According to a new study, ChatGPT is mediocre at diagnosing medical conditions, getting it right only 49% of the time. The researchers say their findings show that AI should not be the only source of medical information and highlight the importance of maintaining the human element in healthcare.

The ease of access to online technology means that some people forgo seeing a medical professional, choosing instead to Google their symptoms. While being proactive about one’s health is not a bad thing, ‘Dr. Google’ is not so accurate. A 2020 Australian studies A review of 36 international mobile and web-based symptom checkers found that the correct diagnosis was entered first only 36 percent of the time.

Of course, AI has improved since 2020. Yes, it has definitely improved. Open AI Chat GPT Has progressed in leaps and bounds – it’s worth it Pass the US Medical Licensing Exam.after all. But does it make it better than Dr. Google in terms of diagnostic accuracy? This is the question that researchers from Canada’s Western University sought to answer in a new study.

to use Chat GPT 3.5Using a large language model (LLM) trained on a data set of more than 400 billion words from the Internet, including books, articles and websites, the researchers conducted a qualitative analysis of medical information provided by a chatbot to Medscape. By answering Case challenges.

Challenges of the Medscape case There are complex medical cases that challenge the knowledge and diagnostic skills of the medical professional. Medical professionals need to choose from four multiple-choice answers to make a diagnosis or select an appropriate treatment plan for the case. The researchers chose Medscape’s case challenges because they are open source and freely accessible. To rule out the possibility that ChatGPT had prior knowledge of the cases, authors were only included after training model 3.5 in August 2021.

A total of 150 medscape cases were analyzed. With four multiple-choice answers per case, this means there were a total of 600 possible answers, with only one correct answer per case. The cases analyzed covered a wide range of medical problems, with titles such as “35-year-old asthmatic with nasal obstruction from beer, aspirin”, “Gastro case challenge: a 33-year-old man who cannot swallow his Apna Saliva, “A 27-year-old woman with persistent headaches is too tired to party”, “Pediatric case challenge: a 7-year-old boy with a limp and obesity who collapsed in the street”, and “A Accountant who loves hiccups and aerobics in harmony”. Cases of visual assets, such as clinical images, medical photography, and graphs, were excluded.

An example of a standard prompt fed to ChatGPT
An example of a standard prompt fed to ChatGPT

Hadi et al.

To ensure consistency in the input provided to ChatGPT, each case challenge was converted to a standard prompt, including a script for the output that was to be provided to the chatbot. All cases were evaluated by at least two independent raters, medical trainees, blinded to each other’s responses. They evaluated responses to the ChatGPT based on diagnostic accuracy, cognitive load (i.e., complexity and clarity of information provided, from minimum to maximum) and quality of clinical information (including whether it was complete and relevant).

Out of 150 Medscape cases analyzed, ChatGPT provided correct answers in 49% of cases. However, the chatbot demonstrated an overall accuracy of 74%, meaning it could identify and reject incorrect multiple choice options.

The researchers explain, “This high value is due to ChatGPT’s ability to identify true negatives (wrong options), which significantly contributes to overall accuracy, and its effectiveness in eliminating incorrect choices. increases,” the researchers explain. “This difference highlights the high specificity of ChatGPT, indicating its excellent performance in rejecting false diagnoses. However, it requires improvement in accuracy and sensitivity to reliably identify correct diagnoses. .

In addition, ChatGPT provided false positives (13%) and false negatives (13%), which has implications for use as a diagnostic tool. Just over half (52%) of the responses provided were complete and relevant, 43% were incomplete but still relevant. ChatGPT generates responses with low (51%) to moderate (41%) cognitive load, making them easy for users to understand. However, the researchers pointed out that this ease of understanding combined with the possibility of incorrect or irrelevant information could result in “misunderstandings and misunderstandings”, especially if ChatGPT is used as a medical education tool. being used as

“ChatGPT also struggled to differentiate between diseases with completely different presentations and the model occasionally produced false or incomprehensible information, known as AI hallucinations, relying solely on ChatGPT for clinical guidance. “Emphasizing the need for human expertise in the risk and assessment process,” the researchers said.

Researchers say AI should be used as a tool rather than augmenting the human element of medicine.
Researchers say AI should be used as a tool rather than augmenting the human element of medicine.

Of course – and the researchers cite this as a limitation of the study – ChatGPT 3.5 is only one AI model that may not be representative of other models and is bound to improve in future iterations, improving its accuracy. can Also, the Medscape cases analyzed by ChetGPT primarily focus on differential diagnosis cases, where medical professionals must differentiate between two or more conditions with similar signs or symptoms.

Although future research should assess the accuracy of different AI models using a wider range of case sources, the results of the current study are instructive.

“The combination of high concordance with relatively low precision advises against relying on ChatGPT for clinical consultation, as it may present important information that may be misleading,” the researchers said. “While our results show that ChatGPT consistently provides the same information to different users, demonstrating sufficient interrater reliability, it also reveals the tool’s shortcomings in providing factually accurate medical information. It does, as is clear. [sic] by its low diagnostic accuracy.”

The study was published in the journal PLoS One.





Source link