Building Multilingual LLMs to Transform Global Healthcare
25-11-2024 | By Liam Critchley
Large language models (LLMs) have grown massively in the last few years since the invention of Chat-GPT took the world by storm. Aside from queries from the general public about a multitude of knowledge areas, LLMs have shown a lot of promise in the healthcare sector.
Key Things to Know:
- Open-source multilingual medical LLMs like MMedLlama 3 aim to bridge language barriers in global healthcare systems.
- The innovative three-phase approach includes creating a medical corpus, establishing a question-answer benchmark, and evaluating leading language models.
- With over 25.5 billion medical tokens, the Multilingual Medical Corpus (MMedC) is a groundbreaking dataset spanning six major languages.
- MMedLlama 3 demonstrates unparalleled performance on both English and multilingual benchmarks, setting a new standard for medical AI.
Close-source LLMs such as GPT-41 and MedPalm-22 have shown a lot of promise in the healthcare space and have passed the United States Medical Licensing Examination (USMLE). Open-source models such as the Llama 2 have also been used to develop specialist LLMs for the medical sector, such as PMC-Llama, MedAlpaca, MEDITRON and ChatDoctors.
However, even though there has been a lot of progress in developing LLMs for the medical sector, these LLMs have primarily focused on English language queries, meaning that their reach and potential use is limited only to those who speak English to a good standard. Developing an open-source multilingual LLM could benefit a wider range of people from different regions of the world.
With open-source multilingual LLMs, the challenge in the medical space is that they still exhibit poor performance across different non-English medical queries despite being trained on diverse multilingual datasets. This is still an issue because there is a lack of medical content in the datasets for the LLMs to train from. Researchers have now tried to address these issues by developing a new open-source multilingual LLM specifically for the healthcare sector with specialist medical terminology data.
Developing a New Medical Multilingual LLM Model
The researchers have tackled the challenges via a three-phase approach. The first stage was to create a medical corpus that is designed for auto-regressive training to build a foundation that encompasses the linguistic diversity in the medical sector. The second stage was to create a new medical question answering benchmark that enables multiple choice questions and answers (Q&A), while the third stage looked at testing and evaluating a wide range of existing LLMs to evaluate the best approach for a new model.
Phase One: Building the Medical Corpus
For the auto-regressive training, the researchers developed a Multilingual Medical Corpus (MMedC) with over 25.5 billion medical tokens across 6 languages: English, Russian, Spanish, Japanese, Chinese and French. The data was taken from four different sources.
The first was an automatic pipeline to filter out medical content from the multilingual corpus to ensure that all the data in the training dataset is relevant and medically focused. Secondly, the researchers curated multiple medical textbooks across various languages and used pre-processing techniques (optical character recognition, heuristic data filtering) to convert the information into texts for the LLM.
The third source of information came from medical knowledge contained within open-source medical websites, allowing texts to be incorporated from a range of authoritative websites. The fourth source of information came from integrating multiple small-scale medical corpus databases that are already in the public domain to enhance the depth of this corpus. The final result was the MMedC which is the first approach to create a corpus that is specifically targeted to the multilingual medical domain.
Ensuring Quality and Relevance in Data Curation
Moreover, the development of MMedC underscores the importance of data curation standards. Leveraging optical character recognition (OCR) and heuristic filtering ensures that the input data not only maintains medical relevance but also upholds ethical considerations in patient confidentiality and dataset transparency. This focus on robust data practices aligns with broader principles of trustworthy AI development in sensitive fields such as healthcare.
Moreover, the development of MMedC underscores the importance of data curation standards. Leveraging optical character recognition (OCR) and heuristic filtering ensures that the input data not only maintains medical relevance but also upholds ethical considerations in patient confidentiality and dataset transparency. This focus on robust data practices aligns with broader principles of trustworthy AI development in sensitive fields such as healthcare.
Phase Two: Establishing the Benchmark
The benchmark curation phase involved aggregating existing medical multiple-choice Q&A datasets across the six languages contained within the MMedC. GPT-4 was also used to improve the datasets by providing extra information that supports the correct answers. The enriched database contained 55,556 Q&A pairs across the six languages, offering both multiple choice Q&A, but also an accompanied rationale reasoning into why the answer was correct.
Expanding on MMedBench, introducing more complex question formats, such as scenario-based reasoning or case studies, could elevate the benchmark's ability to evaluate LLMs' contextual understanding. For example, multi-turn dialogues and decision-tree questions reflecting real-world clinical reasoning can better simulate practical healthcare challenges. This addition would make MMedBench a more comprehensive tool for assessing both reasoning and decision-making capabilities in medical LLMs.
Enhancing Benchmark Complexity
Expanding on MMedBench, introducing more complex question formats, such as scenario-based reasoning or case studies, could elevate the benchmark's ability to evaluate LLMs' contextual understanding. For example, multi-turn dialogues and decision-tree questions reflecting real-world clinical reasoning can better simulate practical healthcare challenges. This addition would make MMedBench a more comprehensive tool for assessing both reasoning and decision-making capabilities in medical LLMs.
The collection of Q&A pairs in the MMedC spanned 21 medical fields. This includes internal medicine, pharmacology, psychiatry, and biochemistry, among others. This benchmark collection of Q&A pairs across the different domains was coined the Multilingual Medical Benchmark (MMedBench). The Q&A pairs were divided into 45,048 training pairs and 8518 testing pairs, with training pairs being used to fine-tune the LLMs after domain-specific training had been performed.
The researchers used the training set (using all the test pairs) to evaluate the accuracy of multiple-choice questioning and answering. A subset of the training pairs—1136 Q&A pairs in total—were selected to look at the model’s reasoning abilities and were accompanied by manually verified rationale sentences to provide a more specialised benchmark for evaluating the reasoning capabilities of the model.
Phase Three: Evaluating Existing LLMs
The final phase of model development was the evaluation phase. This phase involved benchmarking eleven existing LLMs that have multilingual support. The models that were chosen for the evaluation phase were GPT-4, GPT-3.5, BLOOM, InternLM, InternLM 2, Gemini-1.0 pro, Llama 2, Llama 3, MedAlpaca, MedAlpaca, ChatDoctor, MEDITRON Mistral, and BioMistral, with the most promising being further trained with MMedC.
The models were evaluated in three different settings— zero-shot, parameter-efficient fine-tuning (PEFT), and full fine-tuning—and human rating scores were used in the analysis. This approach enabled the model’s performance to be scrutinised based on the correlation between the automated metrics and human judgement. This analysis enabled the most reliable model with the best reasoning ability to be determined. All the models that showed promise underwent further autoregressive training with the MMedC, and all showed improved performance over their base model, showing the need for an effective multilingual corpus.
The most accurate was found to be Llama 3, so this was the base LLM that was chosen to be enhanced and developed further.
Optimising Transformer Architectures
The choice of Llama 3 as the foundational LLM stems from its robust architecture and scalability. However, ongoing improvements in transformer-based architectures, such as efficient attention mechanisms and modular pre-training layers, could further optimise its performance. Introducing sparse computations and adaptive learning rates during autoregressive training could also enhance the model's processing speed and accuracy, particularly when managing extensive multilingual datasets.
Addressing Bias and Inclusion
Another pivotal consideration during this phase is ensuring that the training datasets address biases inherent in existing medical literature. For instance, ensuring representation across demographics, regions, and socio-economic conditions helps models like Llama 3 generalise effectively. Such inclusivity not only improves model performance but also aligns with ethical frameworks emphasising equitable healthcare solutions.
Another pivotal consideration during this phase is ensuring that the training datasets address biases inherent in existing medical literature. For instance, ensuring representation across demographics, regions, and socio-economic conditions helps models like Llama 3 generalise effectively. Such inclusivity not only improves model performance but also aligns with ethical frameworks emphasising equitable healthcare solutions.
The choice of Llama 3 as the foundational LLM stems from its robust architecture and scalability. However, ongoing improvements in transformer-based architectures, such as efficient attention mechanisms and modular pre-training layers, could further optimise its performance. Introducing sparse computations and adaptive learning rates during autoregressive training could also enhance the model's processing speed and accuracy, particularly when managing extensive multilingual datasets.
The Final Model: MMedLlama 3
The final model developed was known as the MMedLlama 3 and showed the best performance on both English and multilingual benchmarks. The researchers have said that they will publicly release the dataset, codebase, human rating results for individual cases, and the trained models to develop the models further so that they can be used in clinical settings.
Clinical Impacts of the New LLM Model
Beyond pure research efforts, open-source multilingual medical LLMs—such as the MMedLlama 3—that can provide multilingual capabilities have a number of clinical benefits that centre around easing the language barrier in healthcare systems.
Firstly, Language barriers between patients and healthcare providers can be an issue that prevents effective communication of ailments, leading to misunderstandings, wrong diagnoses and poor care. Multilingual medical LLMs can provide real-time translation and interpretation to ensure that patients can communicate effectively with healthcare professionals to showcase their symptoms so that clinicians can accurately diagnose the patients and treat them accordingly.
Secondly, multilingual medical LLMs can also address cultural and law nuances for different countries during healthcare interactions. Training the LLM to understand cultural sensitivity and law differences in different regions of the world could help LLMs provide better health outcomes to more people in different locations.
Finally, multilingual medical LLMs can help to improve medical education. LLMs can be customised for educational purposes in regions of the world where there is a lack of medical educators and resources for people to learn from. Using multilingual medical LLMs as a platform for providing educational materials and simulations in different languages could help to standardise medical training around the world to ensure that every region of the world has access to a high and consistent quality of care.