Preparing cancer patients for difficult decisions is the oncologist’s job. However, they don’t all the time remember this. At the University of Pennsylvania Health System, doctors are prompted to discuss a patient’s treatment and end-of-life preferences using an artificially intelligent algorithm that predicts their risk of death.
However, it isn’t a tool that will be used and forgotten. A routine technical inspection found that the algorithm had crashed through the Covid-19 pandemic and was 7 percentage points worse at predicting who would die, according to a 2022 study.
They probably had an impact on real life. Ravi Parikh, an oncologist at Emory University and the study’s lead writer, told KFF Health News that the tool failed a whole lot of times to get doctors to initiate this necessary discussion – possibly stopping unnecessary chemotherapy – with patients who needed it.
He believes several algorithms aimed at improving health care have been weakened through the pandemic, not only the one at Penn Medicine. “Many institutions do not routinely monitor the performance” of their products, Parikh said.
Biases in algorithms are one aspect of a dilemma that computer scientists and physicians have long acknowledged, but that’s starting to puzzle hospital executives and researchers: AI systems require constant monitoring and staff to implement them and ensure they work well.
In short: you would like people and more machines to be sure that the brand new tools don’t fail.
“Everyone thinks AI will help us expand access and opportunities, improve care, etc.” said Nigam Shah, chief data scientist at Stanford Health Care. “This is all well and good, but is it feasible if it increases the cost of care by 20%?”
Government officials fear that hospitals lack the resources to put these technologies into practice. “I looked far and wide,” FDA Commissioner Robert Califf said during a recent agency panel on artificial intelligence. “I do not believe there is a single health care system in the United States that can validate an artificial intelligence algorithm applied to a clinical care system.”
Artificial intelligence is already widespread in healthcare. Algorithms are used to predict a patient’s risk of death or deterioration, suggest diagnosis or triage of patients, record and summarize visits to save doctors work, and approve insurance claims.
If technology evangelists are right, technology will grow to be ubiquitous and profitable. Investment firm Bessemer Venture Partners has identified about 20 health AI startups which are heading in the right direction to reach $10 million in revenue every year. The FDA has approved nearly a thousand products containing artificial intelligence.
Evaluating whether these products work is difficult. Assessing whether or not they’re still in business – or whether or not they’ve developed software that matches a blown gasket or leaking engine – is even harder.
Take, for instance, a recent study from Yale Medicine that evaluated six “early warning systems” that alert doctors when a patient’s health could also be rapidly deteriorating. The supercomputer analyzed the information for several days, said Dana Edelson, a University of Chicago physician and co-founder of the corporate that provided one algorithm for the study. The trial was fruitful and showed huge differences in performance between the six products.
It isn’t easy for hospitals and healthcare providers to select the perfect algorithms for his or her needs. The average doctor doesn’t carry a supercomputer and there aren’t any Consumer Reports on artificial intelligence.
“We have no standards,” said Jesse Ehrenfeld, former president of the American Medical Association. “Today, I cannot point to anything that is a standard for how to evaluate, monitor, and validate the performance of an algorithm model, whether it is AI-enabled or not, once it is deployed.”
Perhaps the most well-liked AI product in medical offices is environmental documentation, a high-tech assistant that listens to and summarizes patient visits. Last 12 months, Rock Health investors saw an inflow of $353 million into these records corporations. However, Ehrenfeld said, “At this point, there is no standard for comparing the results of these tools.”
And that is a problem when even small mistakes will be devastating. A team from Stanford University tried to use large language models – the technology behind popular AI tools comparable to ChatGPT – to summarize patients’ medical histories. They compared the outcomes with what the doctor would write.
“Even in the best-case scenario, the model error rate was 35%,” said Stanford’s Shah. In medicine, “when you’re writing a summary and you forget one word, like ‘fever’ – I mean, that’s a problem, right?”
Sometimes the explanations for algorithm failure are quite logical. For example, changes to source data can reduce its effectiveness, comparable to when hospitals change their laboratory supplier.
However, sometimes traps open for no apparent reason.
Sandy Aronson, chief technology officer of Mass General Brigham’s personalized medicine program in Boston, said that when his team tested one app designed to help genetic counselors locate relevant literature on DNA variants, the product exhibited “nondeterminism” – that’s, when asked the identical query several times in a short period of time, it produced different results.
Aronson is happy concerning the potential of large language models to summarize knowledge for overburdened genetic counselors, but “the technology needs improvement.”
If indicators and standards are rare and errors can occur for strange reasons, what should institutions do? Invest a lot of resources. Shah said that at Stanford University, it took eight to 10 months and 115 man hours to check the 2 models for fairness and credibility.
Experts interviewed by KFF Health News presented the thought of AI monitoring AI, with some (human) data monitoring each. They all acknowledged that this may require organizations to spend much more money – a difficult task given the realities of hospital budgets and the limited supply of AI technology specialists.
“It’s great to have a vision where we melt icebergs so we can have a model to monitor them,” Shah said. “But was that really what I wanted? How many more people will we need?”