Major Language Models (LLMs) has marked a fundamental shift in AI research and development. However, despite their widespread impact, they still are. Basically limited.
That is, LLMs can only process and create text. blind Other methods like images, video, audio, and more. Since then it has been a major limitation. Some tasks rely on non-textual data, For example, analyzing engineering blueprints, reading body language or speech tonality, and interpreting plots and infographics.
This has led to efforts to expand LLM functionality to include multiple methods.
Oh Multimodal Model (MM) There is one An AI system that can process multiple data modalities as input or output (or both). [1]. Below are a few examples.
- GPT-4o – Input: text, images, and audio. Output: text.
- Flux – input: text. Output: Images.
- the sun – input: text. Output: Audio.
Although there are many ways to build models that can process multiple data modalities, a recent line of research attempts to use LLMs as the core reasoning engine of a multimodal system [2]. Such models are called multimodal large language models (or large multimodal models). [2][3].
One advantage of using an existing LLM as a starting point for MMs is that they Demonstrated strong ability to acquire global knowledge through extensive pre-training.which can be exploited to process concepts expressed in non-textual representations.
Here, I will focus on multimodal models developed from LLM. Three popular methods are described below.
- LLM + Tools: Extending LLM with pre-built components
- LLM + Adapter: Extending the LLM with multimodal encoders or decoders, which are connected by adapter fine-tuning.
- Unified models: Extend the LLM architecture to fuse methods in pre-training.
LLM is the easiest way to make multimodal. Adding external modules that can easily translate between text and arbitrary formats.. For example, a transcription model (e.g. Whisper) can be connected to LLM to translate input speech into text, or a text-to-image model can generate images based on LLM outputs.
Such an approach has significant advantages Simplicity. The tools can be assembled quickly without additional model training.
However, the downside is that the quality of such systems can be limited. Just like when playing a game of telephone, messages change as they pass from person to person. Information can be reduced by moving from one module to another through text-only descriptions.
One way to reduce the “telephone problem” is to improve the representation of the new methods to be compatible with the internal concept space of the LLM. For example, making sure to include a photo of the dog and a description look Like the LL.M.
This is possible by using Adaptera relatively small set of The parameters that translate the dense vector representation appropriately for the downstream model. [2][4][5].
For example, an adapter can be trained using image-caption pairs, where the adapter learns to translate image encodings compatible with LLM. [2][4][6]. One way to achieve this is through adversarial learning. [2]Which I will discuss more in the next article in this series.
Advantages of using adapters to extend LLM include: Improved alignment between novel modality representations in data-efficient models. Since there are many pre-trained embedding, language, and diffusion models available in today’s AI landscape, one can easily fuse the models based on their needs. Notable examples from the open source community are LLaVA, LLaMA 3.2 Vision, Flamingo, MiniGPT4, Janus, Mini-Omni2, and IDEFICS. [3][5][7][8].
However, this data efficiency comes at a price. Just as adapter-based fine-tuning approaches (eg LoRA) can only lean LLM so far, the same is true in this context. Additionally, embedding different encoders and decoders into the LLM can result in more complex model architectures.
The last way to make LLM multimodal is by Incorporating multiple methods into the pre-training phase. It works by adding modality-specific tokenizers (rather than pre-trained encoder/decoder models) to the model architecture and extending the embedding layer to accommodate new modalities. [9].
Although this approach comes with significantly more technical challenges and computational requirements, it is enabling. Seamless integration of multiple methods into a common conceptual spaceUnlocking better reasoning abilities and efficiencies [10].
The leading example of this unified approach is (probably) GPT-4o, which processes text, image and audio input to enable Improved inference capabilities with faster estimation times compared to previous versions of GPT-4.. Other models that follow this approach include Gemini, Emu3, BLIP, and Chameleon. [9][10].
Training these models typically involves multistep pre-training on a set of (multimodal) tasks, such as language modeling, text-image contrast learning, text-to-video generation, and others. [7][9][10].
With a basic understanding of how LLM-based multimodal models work, let’s see what we can do with them. Here, I will use LLaMA 3.2 Vision to perform various image-to-text tasks.
To run this example, Download Allama. And his Python library. This enables the model to be run natively i.e. no need for external API calls.
Example code is freely available. GitHub.