Multimodal AI: How Models Process Images, Text, and Audio
AI

Multimodal AI: How Models Process Images, Text, and Audio

Key takeaways

  • Multimodal models process two or more input types — typically images and text, often audio and video — inside a single architecture.
  • CLIP established the modern recipe: train an image encoder and a text encoder to map matching pairs to nearby points using a contrastive loss over hundreds of millions of image-caption pairs.
  • Instruction-tuned systems like LLaVA, GPT-4V, Claude vision, and Qwen-VL bolt a vision encoder onto a pretrained language model through a projection layer, then fine-tune on visual instruction data.
  • GPT-4o and Gemini push toward “native” multimodality — a single model trained on interleaved image, audio, and text tokens from the start rather than stitched together after the fact.
  • Real deployments cover document understanding, visual question answering, accessibility tools, UI agents, and perception stacks for robotics — each with different latency, grounding, and hallucination tradeoffs.

What makes a model multimodal

A single-modality model takes one kind of input — pixels, or tokens, or audio samples — and produces one kind of output. A multimodal model ingests at least two. The hard part is not reading each modality in isolation; encoders for that have existed for years. The hard part is fusing them so the representation of “a red car in the left lane” built from pixels aligns with the representation built from the sentence “a red car in the left lane”.

Camera lens alongside books illustrating vision and language fused into one model
Photo by Rantisoo Mirossa on Pexels

There are three dominant architectural recipes in 2026. Dual-encoder contrastive models (CLIP, SigLIP, ALIGN) train two towers to agree on matched pairs. Adapter-style vision-language models (LLaVA, MiniGPT-4, Qwen-VL, early GPT-4V) wire a pretrained vision encoder into a pretrained language model and teach the language model to read image tokens. Native multimodal models (GPT-4o, Gemini, Chameleon) are trained end-to-end from scratch on interleaved streams of every modality. Each recipe makes different tradeoffs between training cost, retrieval quality, and open-ended reasoning.

Vision encoders: the ViT foundation

Nearly every modern multimodal system uses a Vision Transformer (ViT) or a close variant as the image encoder. A ViT slices an image into fixed-size patches — typically 14×14 or 16×16 pixels — linearly projects each patch into a vector, adds a positional embedding, and runs the sequence through a standard transformer stack. The output is a sequence of patch embeddings plus a global embedding that can be consumed by downstream layers. For the underlying attention machinery, see the transformers primer.

Two things made ViTs the default choice for multimodal work. First, they share a computational substrate with language models, which makes joining the two straightforward — image patches and text tokens are just two flavors of the same sequence. Second, they scale well: unlike CNNs, which bake translation priors into the architecture, ViTs have fewer built-in assumptions, so they keep improving as you add data and parameters. For broader background on pixel-level tasks, see the computer vision explainer.

Resolution, tiling, and token budgets

A practical constraint rarely discussed in papers is the image-token budget. A 224×224 image with 14×14 patches produces 256 tokens; a 1024×1024 document page produces several thousand. Every token consumed by the image is a token not available for text context, and attention cost grows quadratically. Production systems like GPT-4V and Claude vision use dynamic tiling: the image is split into multiple crops at different resolutions, each encoded separately, and a low-resolution thumbnail preserves global context. This is why document-understanding quality depends so heavily on the tiling strategy, not just the encoder.

Training approach 1: contrastive pretraining (CLIP)

CLIP, published by OpenAI in 2021, was the first widely adopted contrastive vision-language model and still underpins a surprising fraction of multimodal pipelines. The setup is simple. Collect hundreds of millions of image-caption pairs from the public web. For each training batch of N pairs, compute image embeddings through one tower and text embeddings through another. Compute an NxN similarity matrix. Apply a cross-entropy loss that pushes the diagonal (matched pairs) up and the off-diagonal (mismatched pairs) down.

The result is a shared embedding space where semantically related images and texts land near each other regardless of modality. CLIP’s killer feature is zero-shot classification: to label an image as “cat” or “dog”, encode both strings and the image, and pick whichever text embedding sits closer. No labelled classification data required. CLIP embeddings also became the default text conditioner for diffusion-based image generators including early Stable Diffusion, and they remain a common retrieval backbone for multimodal RAG systems.

Successors: SigLIP, EVA-CLIP, and data curation

CLIP’s contrastive loss has known pathologies — it’s sensitive to batch size and treats all negatives as equally wrong. Google’s SigLIP replaced the softmax with a pairwise sigmoid loss, making training more stable at large batch sizes and slightly improving transfer quality. Parallel work (EVA-CLIP, DFN, MetaCLIP) focused on data curation: filtering noisy web pairs, deduplicating, and rebalancing. The lesson from 2021-2025 is that dataset quality moves benchmark numbers more reliably than architectural tweaks.

Training approach 2: visual instruction tuning (LLaVA and kin)

Contrastive models are excellent at matching but terrible at open-ended questions. “Describe what’s unusual about this photograph” is not something a similarity score can answer. The dominant recipe for question-answering multimodal systems, introduced by LLaVA in 2023 and followed by InstructBLIP, Qwen-VL, InternVL, and many others, glues a vision encoder to a pretrained language model and fine-tunes on instruction data.

The mechanics: take a ViT (often CLIP’s image tower) and freeze it. Take a pretrained language model — for deeper context see the large language models overview — and keep it trainable or partially frozen. Insert a projection module — sometimes a single linear layer, sometimes an MLP, sometimes a cross-attention “resampler” as in Flamingo — that maps image-patch embeddings into the language model’s token-embedding space. Now the language model reads images as if they were exotic tokens.

Stage one pretrains only the projector on caption data so the language model learns to interpret the new token type. Stage two instruction-tunes on generated or human-curated visual question-answer pairs — “given this image, answer this question” — so the combined system learns to ground responses in pixels rather than ignoring them. LLaVA’s original instruction data was itself generated by prompting GPT-4 with image captions and bounding-box metadata, a bootstrapping trick that has been reused across nearly every open VLM since.

Training approach 3: native multimodality

By 2024 the frontier labs moved to end-to-end training across all modalities simultaneously. GPT-4o (the “o” for “omni”) processes text, audio, and vision through a single model rather than piping audio through a separate speech recognizer and image through a separate vision adapter. Gemini was described by Google at launch as “natively multimodal” — pretrained on interleaved image, video, audio, and text from the start. Meta’s Chameleon paper pushed the same approach to an early-fusion transformer that tokenizes images and text into a shared vocabulary.

The practical payoff of native training is most visible in latency and tone. GPT-4o can respond to speech in roughly 320 milliseconds because there is no ASR-then-LLM-then-TTS pipeline; the model sees audio tokens directly. It can also preserve paralinguistic features — laughter, sighs, tone of voice — that a transcription-first pipeline throws away. The cost is engineering complexity and the fact that native-multimodal models are much harder to open-source because the training corpora are harder to assemble and the model is harder to ablate.

The model landscape in 2026

The current practitioner options, grouped by how open they are: closed frontier models (GPT-4o and 4.1, Claude Opus vision, Gemini 2.x) lead most benchmarks and offer production APIs. Open-weight strong contenders (Qwen2-VL and Qwen2.5-VL from Alibaba, Qwen-VL being the ancestor, InternVL, Molmo, Pixtral from Mistral, Llama 3.2 Vision) have closed much of the quality gap and are deployable on-prem. Contrastive backbones for retrieval and grounding (CLIP, SigLIP, EVA-CLIP) remain the dominant choice for embedding-space applications. Choosing between them is usually a question of whether the workload is retrieval, grounded VQA, or open-ended reasoning, and whether self-hosting is a requirement.

Cross-modal attention and fusion

Three fusion patterns dominate. Early fusion interleaves image and text tokens in the same sequence, letting standard self-attention mix them at every layer — Chameleon and Fuyu take this route, and it’s what “native multimodal” usually means in practice. Late fusion encodes each modality separately and combines only at the final similarity or classification step — CLIP is the canonical example. Cross-attention fusion keeps modalities separate but lets one attend to the other through dedicated cross-attention layers — Flamingo, BLIP-2, and some LLaVA variants use this pattern because it is parameter-efficient and preserves a frozen language backbone.

Which pattern wins depends on the task. For grounded VQA and OCR-heavy document work, early fusion with high-resolution tiling produces the highest accuracy. For retrieval and zero-shot classification, late-fusion contrastive models remain state of the art. For resource-constrained fine-tuning on top of large frozen LLMs, cross-attention adapters offer the best quality-per-compute.

Applications that actually ship

Document understanding is the largest enterprise deployment today. Multimodal models read PDFs, scanned invoices, forms, and handwritten notes in one shot — extracting fields, answering natural-language questions about tables, and routing by content. Claude, GPT-4o, and Gemini all handle multi-page documents directly without a separate OCR pipeline, though OCR-plus-LLM hybrids still win on pure text accuracy.

Accessibility tools use VLMs to describe images for blind users, read signs and menus through a phone camera, and generate alt text at scale. Be My Eyes’ integration with GPT-4V in 2023 was an early demonstration; the capability is now commoditized. UI agents — systems that control a web browser or a desktop by looking at screenshots and issuing clicks — depend on VLMs that can ground coordinates in pixels; Anthropic’s computer-use and OpenAI’s Operator both rest on this stack.

Robotics and perception use multimodal representations as a semantic layer on top of geometric sensing. Google’s RT-2 and its successors fold vision-language pretraining into robot policies so the robot can generalize from “pick up the apple” to “pick up the thing you’d eat for breakfast”. Visual question answering remains the canonical benchmark but real products rarely look like a benchmark; they look like a customer service flow that happens to accept screenshots, or an industrial inspection tool that takes images and structured queries together.

Open problems

Three weaknesses show up consistently across systems. Fine-grained spatial reasoning — counting objects, comparing sizes, reading clock hands — remains well below human performance even on frontier models. Hallucinations specific to vision (“the image shows a dog” when it doesn’t) are hard to eliminate because grounding signals during training are noisy. And efficiency is genuinely hard: a 2K-by-2K screenshot is expensive to process at every turn of an agent loop, and the cost scales badly with context length. Expect the next round of progress to come from better tiling, sparse attention over image tokens, and joint audio-video training rather than from dramatic new architectures.

Frequently asked questions

What is the difference between a vision-language model and a multimodal model?
A vision-language model (VLM) handles images and text. A multimodal model handles two or more modalities, which may include audio, video, depth, or sensor streams in addition to images and text. In common usage the terms are often swapped because vision-language is by far the most deployed multimodal combination. GPT-4o, Gemini, and Claude go further by adding audio as a first-class input and in some cases output, which makes them multimodal in the stricter sense. For most 2026 applications the practical distinction is whether audio is handled natively or through a separate pipeline.

Do multimodal models actually “see” or just pattern-match on captions?
Both, and the balance depends on training. Pure contrastive models like CLIP operate almost entirely on caption-level patterns — they match image distributions to text distributions and can fail on anything under-represented in web captions. Instruction-tuned VLMs develop more fine-grained grounding because their training data forces answers to specific spatial and compositional questions. Native multimodal models trained end-to-end tend to ground better still, though every current system can be fooled by adversarial images or unusual compositions. Calling it “seeing” in the human sense overstates the case; calling it “just pattern-matching” understates what a well-trained VLM can do on OCR, chart reading, and spatial QA.

Which multimodal model should a team evaluate first for a new product?
For a closed-API production system where quality-per-dollar matters, start with GPT-4o and Claude 3.5 Sonnet or Opus vision, benchmarking both on your actual data rather than on public leaderboards. For self-hosted deployment, Qwen2.5-VL and Llama 3.2 Vision are the strongest open-weight starting points in 2026, with InternVL and Molmo as alternatives. For retrieval or embedding-based applications, SigLIP or a recent CLIP variant is usually the right choice. The most common mistake is picking a model by benchmark rather than by a realistic eval set from the target domain — document AI, screenshot grounding, and chart reasoning all reward different models.

Digital Mind News

Digital Mind News is an AI-operated newsroom. Every article here is synthesized from multiple trusted external sources by our automated pipeline, then checked before publication. We disclose our AI authorship openly because transparency is part of the product.