Two health-tech leaders examine potentially groundbreaking human and computing efforts to stop diagnostic errors.
The statistics are troubling:
These errors fall into four broad categories: Missed diagnosis, misdiagnosis (coming to the wrong conclusion on what causes a patient’s symptoms), delayed diagnosis and overdiagnosis. The last problem has become more prevalent in developed countries because they have the ability to perform an endless array of tests that sometimes detect innocuous variants that don’t signal the presence of disease.
Although these errors cause a great deal of human suffering, reducing their toll is more complicated than it might appear. The first issue that needs to be tackled is figuring out how to measure the problem. There is no universally accepted metric to determine scope of the diagnostic error epidemic. Some researchers have used medical records reviews, while others rely on malpractice claims data, health insurance data, physician surveys or patient questionnaires. The flaws with all these yardsticks is they take a great deal of time to collect, and since time is money, healthcare providers are looking for a more cost-effective way to measure the incidence of diagnostic errors. Without such a cost-effective measure, it is almost impossible to gauge the effectiveness of potential solutions.
With the assistance of artificial intelligence (AI)-enhanced tools to track the diagnostic error dilemma and an awareness of the cognitive mistakes that contribute to it, clinicians can hope to get a handle on the problem.
One approach that is garnering attention is called the SPADE approach, which attempts to measure diagnostic mistakes by coupling individual symptoms with specific diseases. SPADE, which stands for Symptoms-Disease Pair Analysis of Diagnostic Error, uses electronic health records (EHRs) and billing and insurance claims data to measure the rate at which seemingly benign diagnoses precede more serious life-threatening diseases that were overlooked during the initial medical visit.
One such symptom-disease pair that has been linked to misdiagnosis is dizziness and stroke. A patient may come into the emergency department complaining of dizziness, which the physician diagnoses as otitis media, only to have the patient return to the hospital a few days later with a full-blown cerebrovascular accident. Because researchers have established a clear link between the symptom and the disease, hospitals can track the number of times the coupling occurs to estimate how often their practitioners make the mistake. Then hospitals can take the necessary educational and administrative steps to help correct the problem.
Other pairs that are used in this way are headache and aneurysm, chest pain and myocardial infarction, and fainting and pulmonary embolism.
For the SPADE model to work, providers need a large data set of patient information that includes the symptom and disease occurrences. Healthcare organizations also need to capture these data points regardless of where the patients show up after the more serious event happens. If a significant number of patients with the initial benign diagnosis return to a different health system when they experience the more serious outcome disease, that would skew the results. With that in mind, SPADE is most likely to work when a provider has a very wide reach — for example, when it offers clinical care and insures its patient population. That ensures that the organization will have the follow-up administrative data needed to link symptoms and the misdiagnosed disorders that follow. The SPADE metric would also work if patient data were pulled from a regional health information exchange.
Of course, measuring the incidence of diagnostic errors is only the beginning. The next step is understanding their many causes. One National Academy of Medicine report, Improving Diagnosis in Healthcare, contains a long list of contributing issues, including:
While the list may seem overwhelming to anyone trying to reduce the heavy toll taken by diagnostic errors, most fall into two broad categories: Systemwide issues and cognitive issues. Among the systemwide issues that urgently need attention is the poor communication that often exists between providers and patients. Addressing the problem doesn’t necessarily require the latest AI tools, but it does call for more human intelligence — and compassion. It might at first seem counterintuitive to suggest that patients can play a role in the diagnostic process. Many clinicians believe they should play a silent role. But clinicians can learn a great deal from patients, if they are willing to listen and not constantly interrupt them before they tell their entire story. About 80% of diagnoses can be correctly made based on a patient’s history, which is why William Osler’s maxim still makes sense today: “Listen to your patient, he is telling you the diagnosis.”
Poor communication between testing facilities and clinicians, another barrier to diagnostic accuracy, can be remedied with relatively low-tech solutions. Setting up an electronic system that verifies receipt of important lab results doesn’t require the latest machine-learning algorithms. Similarly, administrative staff and allied health professionals can monitor the back and forth between testing labs and physicians.
The second broad category responsible for diagnostic mistakes — cognitive biases and errors — can distort an individual clinician’s reasoning process and lead them down the wrong path. The list of potential problems is long and includes anchoring, affective bias, availability bias and premature closure. During anchoring, a diagnostician will fixate on initial findings and stay anchored to this line of reasoning even when contrary evidence suggests it’s best to change direction. The culture of modern medicine gravitates toward this mindset because it encourages physician overconfidence in their own skill set — and because many physicians, like other leaders, believe the appearance of certainty is the best course of action. Clinicians who are swayed by their positive and negative emotional reactions to patients, on the other hand, are guilty of an affective bias. Availability bias is common among clinicians who see the same disorder over and over within a short time frame or who have done research on a specific disorder, while premature closure occurs when a practitioner is too quick to accept the first plausible explanation for all the presenting signs and symptoms.
One way to avoid these cognitive errors is for clinicians to take a more introspective approach to the diagnostic process, which is sometimes referred to as metacognition or “thinking about thinking.” It involves forcing oneself to step back and take a dispassionate look at the steps one takes during the reasoning process. Psychologists who have studied diagnosticians’ thinking patterns believe that most physicians rely on one of two approaches: Type 1, which is initiative, automatic and stereotypic thinking, and Type 2, which focuses on slow, logical, effortful, calculating reasoning. The dual system came to the public’s attention when Daniel Kahneman won the Nobel Prize in economics for his seminal work on the topic and published Thinking Fast and Slow. The Type 1 system is “fast, frugal, requires little effort, and frequently gets the right answer. But occasionally it fails, sometimes catastrophically. Predictably, it misses the patient who presents atypically, or when the pattern is mistaken for something else,” according to Pat Croskerry, M.D., an expert in this field. Type 2 thinking is more reflective and takes the time to reason through several possible forks in the road. Experienced clinicians usually rely on a combination of the two approaches and have learned how to switch between fast and slow thinking.
Of course, even the most gifted diagnostician will make mistakes or become overwhelmed with the sheer volume of data in their workflows. As Ziad Obermeyer, M.D., and Thomas Lee, M.D., point out, “The complexity of medicine now exceeds the capacity of the human mind.”
But AI and machine learning are successfully addressing this issue.
In previous installments in our series, we discussed several innovative AI solutions to the diagnostic error dilemma, including the use of neural networks and image analysis to help facilitate the diagnosis of melanoma and other types of machine learning algorithms to assist in the diagnosis of colorectal cancer. But these initiatives only scratch the surface.
Dutch investigators, for example, have shown that deep-learning algorithms that use machine learning are as effective as trained pathologists in detecting the spread of breast cancer to nearby lymph nodes. They reached that conclusion by training the algorithms on glass slides of healthy and diseased tissue samples that were definitively diagnosed using immunohistohemical staining. A second set of pathology slides was evaluated by the algorithms and 11 pathologists who were under time constraints to complete the diagnosis, to simulate a real-world setting. Another pathologist arrived at a diagnosis without time constraints. Babak Ehteshami Bejnordi and his colleagues found that “the top-performing algorithm achieved a lesion-level, true-positive fraction comparable with that of the pathologist WOTC [without time constraints]” (72.4% vs 64.3-80.4%). Some of the algorithms tested were also as effective as the panel of 11 pathologists who had only two hours to make a diagnosis.
Similarly, Juan Banda, of the Center for Biomedical Informatics Research at Stanford University, and his colleagues have created a machine-learning-based classifier using a data analytics approach called random forest modeling. It tackles a thorny problem that has plagued physicians for years: How does one spot patients with elevated cholesterol who are at very high risk of heart disease due to a genetic defect called familial hypercholesterolemia (FH)? Having this autosomal dominant condition increases the odds of developing atherosclerotic cardiovascular disease twentyfold when compared to individuals with normal LDL cholesterol levels. While FH affects about one in 250 persons, current screening protocols miss more than 95% of patients with the disorder.
The machine-learning classifier detected 84% of patients at the highest probability threshold of having FH among patients who had been cared for at Stanford Health Care. The investigators used structured and unstructured patient data from Stanford’s EHR system to create their classification system and verified its accuracy by applying it to a different patient population, using it to flag FH patients in Geisinger Health System. That’s an important distinction to point out, as many AI projects have fallen short because they have been shown to be reliable only in a narrowly defined internal data set.
A consequence of diagnostic errors is that they often land patients back in the hospital after they have been discharged. The SPADE tool we discussed earlier is designed to help reduce that eventuality. Other researchers have studied the best way to prevent avoidable readmissions by developing more accurate statistics on who is most likely to be readmitted. Several traditional risk scores exist to help make this prediction, including LACE, HOSPITAL and Maxim/Right Care scores, none of which are especially reliable. Researchers from the University of Maryland have developed a risk score derived from EHR data, using machine learning that relies on convolutional neural networks and gradient boosting repression. The new algorithm, called the Baltimore or B score, was compared to the more traditional risk scores in three different hospitals. Each hospital was evaluated with a different version of the B score because the tool was calculated with patient data from each institution.
To compare the risk scores, Daniel Morgan, M.D., and his associates used area under the receiver operating characteristic curve (AUROC). MedCalc states: “In a ROC curve, the true positive rate (sensitivity) is plotted in function of the false positive rate (100 - specificity) for different cut-off points of a parameter. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal).”
Morgan and associates found that the B score was significantly better than all the traditional scoring systems. For example, at 48 hours after admission, the AUROC for the B score as 0.72, versus 0.63, 0.64 and 0.66 for HOSPITAL, Maxim/Right Care and modified LACE scores respectively for one hospital. Put another way, “the B score was able to identify the same number of readmitted patients while flagging 25.5% to 54.9% fewer patients.”
And that would save a hospital the resources and manpower needed to deliver specialized care to patients who probably didn’t need it because they were unlikely to be readmitted.
About the Authors
Paul Cerrato has more than 30 years of experience working in healthcare as a clinician, educator, and medical editor. He has written extensively on clinical medicine, electronic health records, protected health information security, practice management, and clinical decision support. He has served as editor of Information Week Healthcare, executive editor of Contemporary OB/GYN, senior editor of RN Magazine, and contributing writer/editor for the Yale University School of Medicine, the American Academy of Pediatrics, Information Week, Medscape, Healthcare Finance News, IMedicalapps.com, and Medpage Today. HIMSS has listed Mr. Cerrato as one of the most influential columnists in healthcare IT.
John D. Halamka, M.D., leads innovation for Beth Israel Lahey Health. Previously, he served for over 20 years as the chief information officer (CIO) at the Beth Israel Deaconess Healthcare System. He is chairman of the New England Healthcare Exchange Network (NEHEN) and a practicing emergency physician. He is also the International Healthcare Innovation professor at Harvard Medical School. As a Harvard professor, he has served the George W. Bush administration, the Obama administration and national governments throughout the world, planning their healthcare IT strategies. In his role at BIDMC, Dr. Halamka was responsible for all clinical, financial, administrative and academic information technology, serving 3,000 doctors, 12,000 employees, and 1 million patients.
Get the best insights in healthcare analytics directly to your inbox.
The Artificial Intelligence Question
AI Struggles Across Health Systems, Needs Wide-Range Testing
Setting the Stage for Next-Generation mHealth