By SAURABH JHA, MD
This is the part two of a three-part series. Catch up on Part One here.
Clever Hans
Preetham Srinivas, the head of the chest radiograph project in Qure.ai, summoned Bhargava Reddy, Manoj Tadepalli, and Tarun Raj to the meeting room.
“Get ready for an all-nighter, boys,” said Preetham.
Qure’s scientists began investigating the algorithm’s mysteriously high performance on chest radiographs from a new hospital. To recap, the algorithm had an area under the receiver operating characteristic curve (AUC) of 1 – that’s 100 % on multiple-choice question test.
“Someone leaked the paper to AI,” laughed Manoj.
“It’s an engineering college joke,” explained Bhargava. “It means that you saw the questions before the exam. It happens sometimes in India when rich people buy the exam papers.”
Just because you know the questions doesn’t mean you know the answers. And AI wasn’t rich enough to buy the AUC.
The four lads were school friends from Andhra Pradesh. They had all studied computer science at the Indian Institute of Technology (IIT), a freaky improbability given that only hundred out of a million aspiring youths are selected to this most coveted discipline in India’s most coveted institute. They had revised for exams together, pulling all-nighters – in working together, they worked harder and made work more fun.
Preetham ordered Maggi Noodles – the mysteriously delicious Indian instant noodles – to charge their energies. Ennio Morricone’s soundtrack from For a Few Dollars More played in the background. We were venturing into the wild west of deep learning.
The lads had to comb a few thousand normal and a few thousand abnormal radiographs to find what AI was seeing. They were engineers, not radiologists, and had no special training in radiology except for one that comes with looking at thousands of chest radiographs, which they now knew like the lines at the back of their hands. They had carefully fed AI data to teach it radiology. In return, AI taught them radiology – taught them where to look, what to see, and what to find.
They systematically searched the chest radiographs for clues. Radiographs are two-dimensional renditions, mere geometric compressions, maps of sorts. But the real estate they depict have unique personalities. The hila, apices, and tracheobronchial angle are so close to each other that they may as well be one structure, but like the mews, roads, avenues and cul-de-sacs of London, they’re distinct, each real estate expressing unique elements of physiology and pathology.
One real estate which often flummoxes AI is the costophrenic angle (CPA) – a quiet hamlet where the lung meets the diaphragm, two structures of differing capacity to stop x-rays, two opposites which attach. It’s supposedly sharp – hence, an “angle”; the loss of sharpness implies a pleural effusion, which isn’t normal.
The CPA is often blunt. If radiologists called a pleural effusion every time the CPA was blunt half the world would have a pleural effusion. How radiologists deal with a blunted CPA is often arbitrary. Some call pleural effusion, some just describe their observation without ascribing pathology, and some ignore the blunted CPA. I do all three but on different days of the week. Variation in radiology reporting frustrates clinicians. But as frustrating as reports are, the fact is that radiographs are imperfect instruments interpreted by imperfect arbiters – i.e. Imperfection Squared. Subjectivity is unconquerable. Objectivity is farcical.
Because the radiologist’s interpretation is the gospel truth for AI, variation amongst radiologists messes AI’s mind. AI prefers that radiologists be consistent like sheep and the report be dogmatic like the Old Testament, so that it can better understand the ground truth even if the ground truth is really ground truthiness. When all radiologists call a blunted CPA a pleural effusion, AI appears smarter. Perhaps, offering my two cents, the secret to AI’s mysterious super performance was that the radiologists from this new institute were sheep. They all reported the blunted CPA in the same manner. 100 % consistency – like machines.
“I don’t think it’s the CPA, yaar,” objected Tarun, politely. “The problem is probably in the metadata.”
The metadata is a lawless province which drives data scientists insane. Notwithstanding variation in radiology reporting, radiographs – i.e. data – follow well-defined rules, speak a common language, and can be crunched by deep neural networks. But radiographs don’t exist in vacuum. When stored, they’re drenched in the attributes of the local information technology. And when retrieved, they carry these attributes, which are like local dialects, with them. Before feeding the neural networks, the radiographs must be cleared of idiosyncracies in the metadata, which can take months.
It seemed we had a long night ahead. I was looking forward to the second plate of Maggi Noodles.
Around the 50th radiograph, Tarun mumbled, “it’s clever Hans.” His pitch then rose in excitement, “I figured it. AI is behaving like Clever Hans.”
Clever Hans was a celebrity German horse which could allegedly add and subtract. He’d answer by tapping his hoof. Researchers, however, figured out his secret. Hans would continue tapping his hoof until the number of taps corresponded to the right numerical answer, which he’d deduce from the subtle, non-verbal, visual cues in his owner. The horse would get the wrong answer if he couldn’t stare at his owner’s face. Not quite a math Olympiad, Hans was still quite clever, certainly for a horse, but even by human standards.
“What do you see?” Tarun pointed excitedly to a normal and an abnormal chest radiograph placed side by side. Having interpreted over several thousand radiographs I saw what I usually see but couldn’t see anything mysterious. I felt embarrassed – a radiologist was being upstaged by an engineer, AI, and supposedly a horse, too. I stared intently at the CPA hoping for a flash of inspiration.
“It’s not the CPA, yaar,” Tarun said again – “look at the whole film. Look at the corners.”
I still wasn’t getting it.
“AI is crafty, and just like Hans the clever horse, it seeks the simplest cue. In this hospital all abnormal radiographs are labelled – “PA.” None of the normals are labelled. This is the way they kept track of the abnormals. AI wasn’t seeing the hila, or CPA, or lung apices – it detected the mark – “PA” – which it couldn’t miss,” Tarun explained.
The others shortly verified Tarun’s observation. Sure enough, like clockwork – all the abnormal radiographs had “PA” written on them – without exception. This simple mark of abnormality, a local practice, became AI’s ground truth. It rejected all the sophisticated pedagogy it had been painfully taught for a simple rule. I wasn’t sure whether AI was crafty, pragmatic or lazy, or whether I felt more professionally threatened by AI or data scientists.
“This can be fixed by a simple code, but that’s for tomorrow,” said Preetham. The second plate of Maggi Noodles never arrived. AI had one more night of God-like performance.
The Language of Ground Truth
Artificial Intelligence’s pragmatic laziness is enviable. To learn, it’ll climb mountains when needed but where possible it’ll take the shortest path. It prefers climbing molehills to mountains. AI could be my Tyler Durden. It doesn’t give a rat’s tail how or why and even if it cared it won’t tell you why it arrived at an answer. AI’s dysphasic insouciance – its black box – means that we don’t know why AI is right, or that it is. But AI’s pedagogy is structured and continuous.
After acquiring the chest radiographs, Qure’s scientists had to label the images with the ground truth. Which truth, they asked. Though “ground truth” sounds profound it simply means what the patient has. On radiographs, patients have two truths: the radiographic finding, e.g. consolidation – an area of whiteness where there should be lung, and the disease, e.g. pneumonia, causing that finding. The pair is a couplet. Radiologists rhyme their observation with inference. The radiologist observes consolidation and infers pneumonia.
The inference is clinically meaningful as doctors treat pneumonia, not consolidation, with antibiotics. The precise disease, such as the specific pneumonia, e.g. legionella pneumonia, is the whole truth. But training AI on the whole truth isn’t feasible for several reasons.
First, many diseases cause consolidation, or whiteness, on radiographs – pneumonia is just one cause, which means that many diseases look similar. If legionella pneumonia looks like alveolar hemorrhage, why labor to get the whole truth?
Second, there’s seldom external verification of the radiologist’s interpretation. It’s unethical resecting lungs just to see if radiologists are correct. Whether radiologists attribute consolidation to atelectasis (collapse of a portion of the lung, like a folded tent), pneumonia, or dead lung – we don’t know if they’re right. Inference is guesswork.
Another factor is the sample size: preciser the truth fewer cases of that precise truth. There are more cases of consolidation from any cause than consolidation from legionella pneumonia. AI needs numbers, not just to tighten the confidence intervals around the point estimate – broad confidence intervals imply poor work ethic – but for external validity. The more general the ground truth, the more cases of labelled truth AI sees, and the more generalizable AI gets, allowing it to work in Mumbai, Karachi, and New York.
Thanks to Prashant Warier’s tireless outreach and IIT network, Qure.ai acquired a whopping 2.5 million chest radiographs from nearly fifty centers across the world, from afar as Tokyo and Johannesburg and, of course, from Mumbai. AI had a sure shot at going global. But the sheer volume of radiographs made the scientists timorous.
“I said to Prashant, we’ll be here till the next century if we have to search two million medical records for the ground truth, or label two million radiographs” recalls Preetham. AI could neither be given a blank slate nor be spoon fed. The way around it was to label a few thousand radiographs with anatomical landmarks such as hila, diaphragm, heart, a process known as segmentation. This level of weak supervision could be scaled.
For the ground truth, they’d use the radiologist’s interpretation. Even so, reading over a million radiology reports wasn’t practical. They’d use Natural Language Processing (NLP). NLP can search unstructured (free text) sentences for meaningful words and phrases. NLP would tell AI whether the study was normal or abnormal and what the abnormality was.
Chest x-ray reports are diverse and subjective, with inconsistency added to the mix. Ideally, words should precisely and consistently convey what radiologists see. Radiologists do pay heed to March Hare’s advice to Alice: “then you should say what you mean,” and to Alice’s retort: “at least I mean what I say.” The trouble is that different radiologists say different things about the same disease and mean different things by the same descriptor.
One radiologist may call every abnormal whiteness an “opacity”, regardless of whether they think the opacity is from pneumonia or an innocuous scar. Another may say “consolidation” instead of “opacity.” Still another may use “consolidation” only when they believe the abnormal whiteness is because of pneumonia, instilling connation in the denotation. Whilst another may use “infiltrate” for viral pneumonia and “consolidation” for bacterial pneumonia.
The endless permutations of language in radiology reports would drive both March Hare and Alice insane. The Fleischner Society lexicon makes descriptors more uniform and meaningful. After perusing several thousand radiology reports, the team selected from that lexicon the following descriptors for labelling: blunted costophrenic angle, cardiomegaly, cavity, consolidation, fibrosis, hilar enlargement, nodule, opacity and pleural effusion.
Not content with publicly available NLPs, which don’t factor local linguistic culture, the team developed their own NLP. They had two choices – use machine learning to develop the NLP or use humans (programmers) to make the rules. The former is way faster. Preetham opted for the latter because it gave him latitude to incorporate qualifiers in radiology reports such as “vague” and “persistent.” The nuances could come in handy for future iterations.
Starting off with simple rules such as negation detection so that “no abnormality” or “no pneumonia” or “pneumonia unlikely” would be the same as “normal”, then broadening the rules to incorporate synonyms such as “density” and “lesion”, including the protean “prominent”, a word which can mean anything except what it actually means and like “awesome” has been devalued by overuse, the NLP for chest radiograph accrued nearly 2500 rules, rapidly becoming more biblical than the regulations of Obamacare.
The first moment of reckoning arrived: does the NLP even work? Testing the NLP is like testing the tester – if the NLP was grossly inaccurate, the whole project would crash. NLP determines the accuracy of the labelled truth – e.g. whether the radiologist truly said “consolidation” in the report. If NLP correctly picks “consolidation” in nine out of ten reports and doesn’t in one out of ten, the radiograph with “consolidation” but labelled “normal” doesn’t confuse AI. AI can tolerate occasional misclassification; indeed, it thrives on noise. You’re allowed to fool it once, but you can’t fool it too often.
After six months of development, the NLP was tested on 1930 reports to see if it flagged the radiographic descriptors correctly. The reports, all 1930 of them, were manually checked by radiologists blinded to NLP’s answers. The NLP performed respectively, with sensitivities/ specificities for descriptors ranging from 93 % to 100 %.
For “normal”, the most important radiological diagnosis, NLP had a specificity of 100 %. This means that in 10, 000 reports the radiologists called or implied abnormal, none would be falsely extracted by the NLP as “normal.” NLP’s sensitivity for “normal” was 94 %. This means that in 10, 000 reports the radiologist called or implied normal, 600 would be falsely extracted by NLP as “abnormal.” NLP’s accuracy reflected language ambiguity, which is a proxy of radiologist’s uncertainty. Radiologists are less certain and use more weasel words when they believe the radiograph is normal.
Algorithm Academy
After deep learning’s success using Image Net to spot cats and dogs, prominent computer scientists prophesized the extinction of radiologists. If AI could tell cats apart from dogs it could surely read CAT scans. They missed a minor point. The typical image resolution in Image Net is 64 x 64 pixels. The resolution of chest radiographs can be as high as 4096 x 4096 pixels. Lung nodules on chest radiographs are needles in haystacks. Even cats are hard to find.
The other point missed is more subtle. When AI is trying to classify a cat in a picture of a cat on the sofa, the background is irrelevant. AI can focus on the cat and ignore the sofa and the writing on the wall. On chest radiographs the background is both the canvass and the paint. You can’t ignore the left upper lobe just because there’s an opacity in the right lower lobe. Radiologists don’t enjoy satisfaction of search. All lungs must be searched with unyielding visual diligence.
Radiologists maybe awkward people, imminently replaceable, but the human retina is a remarkable engineering feat, evolutionarily extinction-proof, which can discern lot more than fifty shades of gray. For the neural network, 4096 pixels is too much information. Chest radiographs had to be down sampled to 256 pixels. The reduced resolution makes pulmonary arteries look like nodules. Radiologists should be humbled that AI starts at a disadvantage.
Unlike radiologists, AI doesn’t take bathroom breaks or check Twitter. It’s indefatigable. Very quickly, it trained on 50, 000 chest radiographs. Soon AI was ready for the end of semester exam. The validation cases come from the same source as the training cases. Training-validation is a loop. Data scientists look at AI’s performance on validation cases, make tweaks, and give it more cases to train on, check its performance again, make tweaks, and so on.
When asked “is there consolidation?”, AI doesn’t talk but expresses itself in a dimensionless number known as confidence score – which runs between 0 and 1. How AI arrives at a particular confidence score, such as 0.5, no one really understands. The score isn’t a measure of probability though it probably incorporates some probability. Nor does it strictly measure confidence, though it’s certainly a measure of belief, which is a measure of confidence. It’s like asking a radiologist – “how certain are you that this patient has pulmonary edema – throw me a number?” The number the radiologist throws isn’t empirical but is still information.
The confidence score is mysterious but not meaningless. For one, you can literally turn the score’s dial, like adjusting the brightness or contrast of an image, and see the trade-off between sensitivity and specificity. It’s quite a sight. It’s like seeing the full tapestry of radiologists, from the swashbuckling under caller to the “afraid of my shadow” over caller. The confidence score can be chosen to maximize sensitivity or specificity, or using Youden’s index, optimize both.
To correct poor sensitivity and specificity, the scientists looked at cases where the confidence scores were at the extremes, where the algorithm was either nervous or overconfident. AI’s weaknesses were radiologist’s blind spots, such as the lung apices, the crowded bazaar of the hila, and behind the ribs. It can be fooled by symmetry. When the algorithm made a mistake, it’s reward function, also known as loss function, was changed so that it was punished if it made the same mistake and rewarded when it didn’t. Algorithms, who have feelings, too, responded favorably like Pavlov’s dogs, and kept improving.
The Board Exam
After eighteen months of training-validation, and seeing over million radiographs, the second moment of reckoning arrived: the test, the real test, not the mock exam. This important part of algorithm development must be rigorous because if the test is too easy the algorithm can falsely perform. Qure.ai wanted their algorithms validated by independent researchers and that validation published in peer review journals. But it wasn’t Reviewer 2 they feared.
“You want to find and fix the algorithm’s weaknesses before deployment. Because if our customers discover its weaknesses instead of us, we lose credibility,” explained Preetham.
Preetham was alluding to the inevitable drop in performance when algorithms are deployed in new hospitals. A small drop in AUC such as 1-2 %, which doesn’t change clinical management, is fine; a massive drop such as 20 % is embarrassing. What’s even more embarrassing is if AI misses an obvious finding such a bleedingly-obvious consolidation. If radiologists miss obvious findings they could be sued. If the algorithm missed an obvious finding it could lose its jobs, and Qure.ai could lose future contracts. A single drastic error can undo months of hard work. Healthcare is an unforgiving market.
In the beginning of the training, AI missed a 6 cm opacity in the lung, which even toddlers can see. Qure’s scientists were puzzled, afraid, and despondent. It turned out that the algorithm had mistaken the large opacity for a pacemaker. Originally, the data scientists had excluded radiographs with devices so as not to confuse AI. When the algorithm saw what it thought was a pacemaker it remembered the rule, “no devices”, so denied seeing anything. The scientists realized that in their attempt to not confuse AI, they had confused it even more. There was no gain in mollycoddling AI. It needed to see the real world to grow up.
The test cases came from new sources – hospitals in Calcutta, Pune and Mysore. The ground truth was made more stringent. Three radiologists read the radiographs independently. If two called “consolidation” and the third didn’t, the majority prevailed, and the ground truth was “consolidation”. If two radiologists didn’t flag a nodule, and a third did, the ground truth was “no nodule.” For both validation and the test cases, radiologists were the ground truth – AI was prisoner to radiologists’ whims, but by using three radiologists as the ground truth for test cases, the interobserver variability was reduced – the truth, in a sense, was the golden mean rather than consensus.
What’s the minimum number of abnormalities AI needs to see; its numbers needed to learn (NNL)? This depends on several factors – how sensitive you think the algorithm will be, the desired tightness of the confidence interval, desired precision (paucity of false positives) and, crucially, rarity of the abnormality. The rarer the abnormality the more radiographs AI needs to see. To be confident of seeing eighty cases – the NNL was derived from a presumed sensitivity of 80 % – of a specific finding, AI would have to see 15, 000 radiographs. NNL wasn’t a problem in either training or validation – recall, there were 100, 000 radiographs for validation which is a feast even for training. But gathering test cases was onerous and expensive. Radiologists aren’t known to work for free.
Qure’s homegrown NLP flagged chest radiographs with radiology descriptors in the new hospitals. There were normals, too, which were randomly distributed in the test, but the frequency of abnormalities was different from the training cases. In the latter, the frequency reflected actual prevalences of radiographic abnormalities. Natural prevalences don’t guarantee sufficient abnormals in a sample of two thousand. Through a process called “enrichment”, the frequency of each abnormality in the test pool was increased, so that 80 cases each of opacity, nodule, consolidation, etc, were guaranteed in the test.
The abnormals in the test were more frequent than in real life. Contrived? Yes. Unfair? No. In the American board examination, radiologists are shown only abnormal cases.
Like anxious parents, Qure’s scientists waited for the exam result, the AUC.
“We expected sensitivities of 80 %. That’s how we calculated our sample size. A few radiologists advised us that we not develop algorithms for chest radiographs, saying that it was a fool’s errand because radiographs are so subjective. We could hear their warnings.” Preetham recalled with subdued nostalgia.
The AUC for detecting an abnormal chest radiograph was 0.92. Individual radiologists, unsurprisingly, did better as they were part of the truth, after all. As expected, the degree of agreement between radiologists, the inter-observer variability, affected AI’s performance, which was the highest when radiologists were most in agreement, such as when calling cardiomegaly. The radiologists had been instructed to call “cardiomegaly” when the cardiothoracic ratio was greater than 0.5. For this finding, the radiologists agreed 92 % of the time. For normal, radiologists agreed 85 % of the time. For cardiomegaly, the algorithm’s AUC was 0.96. Given the push to make radiology more quantitative and less subjective, these statistics should be borne in mind.
For all abnormalities, both measures of diagnostic performance were over 90 %. The algorithm got straight As. In fact, the algorithm performed better on the test (AUC – 0.92) than validation cases (AUC – 0.86) at discerning normal – a testament not to its less-is-more philosophy but the fact that the test sample had fewer gray zone abnormalities, such as calcification of the aortic knob, the type of “abnormality” that some radiologists report and others ignore. This meant that AI’s performance had reached an asymptote which couldn’t be overcome by more data because the more radiographs it saw the more “gray zone” abnormalities it’d see. This curious phenomenon mirrors radiologists’ performance. The more chest radiographs we see the better we get. But we get worse, too, because we know what we don’t know and become more uncertain. After a while there’s little net gain in performance by seeing more radiographs.
Nearly three years after the company was conceived, after several dead ends, and morale-lowering frustrations with the metadata, the chest radiograph algorithm had matured. It was actually not a single algorithm but a bunch of algorithms which helped each other and could be combined into a meta-algorithm. The algorithms moved like bees but functioned like a platoon.
As the team was about to open the champagne, Ammar Jagirdar, Product Manager, had news.
“Guys, the local health authority in Baran, Rajasthan, is interested in our TB algorithm.”
Ammar, a former dentist with a second degree in engineering, also from IIT, isn’t someone you can easily impress. He gave up his lucrative dental practice for a second career because he found shining teeth intellectually bland.
“I was happy with the algorithm performance,” said Ammar, “but having worked in start-ups, I knew that building the product is only 20 % of the task. 80 % is deployment.”
Ammar had underestimated deployment. He had viewed it as an engineering challenge. He anticipated mismatched IT systems which could be fixed by clever codes or I-phone apps. Rajashtan would teach him that the biggest challenge to deployment of algorithms wasn’t the AUC, or clever statisticians arguing endlessly on Twitter about which outcome measures the value of AI, or overfitting. It was a culture of doubt. A culture which didn’t so much fear change as couldn’t be bothered changing. Qure’s youthful scientists, who looked like characters from a Netflix college movie, would have to labor to be taken seriously.
Saurabh Jha (aka @RogueRad) is a contributing editor to The Health Care Blog. This is Part 2 of a 3-part series.