AI is often referred to as a miracle worker in medicine, especially in screening processes where machine learning models have expert knowledge of problem detection. But like so many technologies, it's one thing to be successful in the lab, another thing in real life – than Google Researchers learned in a humiliating test in clinics in rural Thailand.
Google Health has developed a deep learning system that looks at images of the eye and looks for evidence of diabetic retinopathy, one of the leading causes of vision loss worldwide. Despite the high theoretical accuracy, the tool proved impractical in real tests and frustrated both patients and nurses with inconsistent results and a general lack of harmony with local practices.
To begin with, it must be said that the knowledge gained here was difficult, but it is a necessary and responsible step to carry out this type of test. It's commendable that Google publicly released these less than flattering results. The documentation shows that the team has already taken the results to heart (although the blog post contains a rather sunny interpretation of the events). However, it is also clear that the attempt to connect with this technology has been made with a lack of understanding that would be humorous if it were not for such a serious environment.
The research paper documents the use of a tool that is intended to expand the existing process by which patients in several clinics in Thailand are examined for diabetic retinopathy (DR). Essentially, nurses take diabetics individually, take pictures of their eyes (a “fundus photo”), and send them in batches to ophthalmologists who evaluate them and return results. usually at least 4-5 weeks later due to high demand.
The Google system should provide ophthalmologist-like expertise in a matter of seconds. DR tests were identified with an accuracy of 90 percent in internal tests. The nurses could then make a preliminary referral or referral recommendation within a minute instead of a month (automatic decisions were checked for accuracy by an ophthalmologist within a week). Sounds good – theoretically.
However, this theory fell apart as soon as the study's authors touched the ground. As the study describes it:
We observed a high degree of variation in the eye screening process in the 11 clinics of our study. The processes for capturing and evaluating images were consistent across clinics, but nurses had a high degree of autonomy in organizing the screening workflow and different resources were available in each clinic.
The environment and the places where eye exams were performed were also very different in the clinics. Only two clinics had a special screening room that could be darkened to ensure that the patient's pupils were large enough to take a high-quality fundus photo.
The multitude of conditions and processes meant that images were sent to the server that did not meet the high standards of the algorithm:
The deep learning system has strict guidelines regarding the images to be evaluated. For example, if an image is a bit out of focus or dark, the system rejects it, even if it could make a strong prediction. The high standards of the image quality system conflict with the consistency and quality of the images that the nurses routinely took under the conditions of the clinic. This mismatch caused frustration and additional work.
Images with obvious DR but poor quality would be rejected by the system, which complicates and expands the process. And then you could upload them to the system in the first place:
With a strong internet connection, these results are displayed within a few seconds. However, slower and less reliable connections were often made in the clinics of our study. As a result, it takes 60 to 90 seconds to upload some images, which slows down the screening queue and limits the number of patients that can be screened in a day. In a clinic, the internet went off for two hours during eye screening, reducing the number of patients examined from 200 to just 100.
“First, do no harm” probably plays a role here: In this case, fewer people were treated because they tried to use this technology. The nurses tried various workarounds, but the inconsistency and other factors made some patients discourage from participating in the study at all.
Even the best scenario had unforeseen consequences. The patients were not prepared for an immediate evaluation and the establishment of a follow-up appointment immediately after the image was sent.
Due to the prospective study protocol design and the need to make referral hospital visit plans on site, we observed nurses in clinics 4 and 5 who prevented patients from participating in the prospective study for fear that this could result in unnecessary distress.
As one of these nurses put it:
“(Patients) are not interested in accuracy, but what will the experience be like – will it waste my time if I have to go to the hospital? I assure you that you don't have to go to the hospital. They ask: "Is it taking longer?", "Am I going somewhere else?" Some people are unwilling to leave, so they are not participating in the research. 40-50% do not participate because they think they have to go to the hospital. "
Of course, that's not all bad news. The problem is not that AI has nothing to offer a crowded Thai clinic, but that the solution has to be tailored to the problem and the location. The instant, easy-to-understand automatic assessment was enjoyed by patients and nurses alike when it worked well, and sometimes contributed to the fact that this was a serious problem that soon needed to be addressed. And of course, the main benefit of reducing dependency on a very limited resource (local ophthalmologists) is potentially transformative.
However, the study's authors seemed to have clear eyes when evaluating this early and partial use of their AI system. As they put it:
When introducing new technologies, planners, policy makers, and technology designers did not take into account the dynamics and emergence of problems that arise with complex health programs. The authors argue that paying attention to people – their motivations, values, professional identity, and the current norms and routines that shape their work – is critical when planning assignments.
The paper is worth reading both as an introduction to how AI tools work in clinical settings and the obstacles they face – both in terms of technology and those who are supposed to adopt them.