What Are Foundation Models? The Base of Modern AI Explained

A foundation model is a single large model pretrained on broad data at scale and then adapted to a wide range of downstream tasks rather than built for one job.
The term was coined in 2021 by Stanford’s Center for Research on Foundation Models (CRFM) in a 200-plus-page report led by Rishi Bommasani and Percy Liang with over 100 co-authors.
Two ideas power them: large-scale self-supervised pretraining that learns general structure from unlabeled data, and cheap adaptation via fine-tuning or prompting.
Foundation models span modalities — text (GPT, Claude, Llama), images (CLIP, Stable Diffusion), audio, code, and increasingly multimodal systems that handle several at once.
Their biggest systemic risk is homogenization: when thousands of applications inherit the strengths, biases, and failure modes of a handful of base models, defects propagate everywhere downstream.

What a foundation model actually is

A foundation model is a large machine-learning model trained on a broad sweep of data, usually with self-supervision, that serves as a reusable base for many different tasks. Instead of training a fresh model per problem, you train one general model once and adapt it cheaply. Stanford’s CRFM defined the term in 2021 to name this emerging paradigm shift, choosing “foundation” deliberately to signal both centrality and incompleteness.

In the report On the Opportunities and Risks of Foundation Models, the authors wrote that “a foundation model is any model that is trained on broad data… and can be adapted to a wide range of downstream tasks.” They were careful about the metaphor: a foundation is load-bearing but unfinished. You build on it; you do not live in it. Most well-known systems — large language models, image generators, code assistants — are foundation models or built directly on top of one.

Why a new word was needed

The CRFM team argued that older labels like “large language model” were too narrow because the same recipe was spreading to vision, audio, code, and robotics. According to the 2021 report, the shift is defined by emergence (capabilities appear that were not explicitly designed) and homogenization (a few base models underpin nearly everything). “Foundation model” names the shared base regardless of modality.

Pretraining: learning structure from raw data

Pretraining is the stage where a foundation model learns general patterns from enormous amounts of unlabeled data, before it is ever pointed at a specific task. The dominant technique is self-supervised learning: the model is given part of an input and asked to predict the rest — the next word in a sentence, a masked image patch, or a missing audio segment. No human labels are required, which is what lets training scale to trillions of tokens.

This scale is the whole point. Modern foundation models are typically built on the transformer architecture llm, which parallelizes well across thousands of accelerators and handles long-range dependencies in sequences. Training a frontier model can cost millions of dollars in compute and consume weeks of GPU time, which is why only a handful of organizations train them from scratch while everyone else adapts the results.

Scaling laws and the data question

Research from DeepMind’s 2022 Chinchilla paper (Hoffmann et al.) showed that many large models were undertrained — that for a fixed compute budget, model size and training-data size should grow roughly in proportion. The practical lesson was that a smaller model trained on far more data can beat a larger model trained on less. Chinchilla, at 70 billion parameters, outperformed the 280-billion-parameter Gopher by training on about four times more data, reshaping how labs allocate compute.

Emergence: capabilities nobody trained for

Emergence describes abilities that appear in large models but are absent in smaller ones trained the same way — they cannot be predicted by extrapolating from small-scale performance. Wei et al. argued in their 2022 paper Emergent Abilities of Large Language Models that skills like multi-step arithmetic, instruction following, and in-context learning switch on sharply once a model crosses a scale threshold.

The most striking example is in-context learning: you show a model a few examples inside the prompt and it performs the task without any weight updates. The model was never explicitly trained to learn from examples in its context window, yet the behavior emerges. Whether emergence is a genuine phase change or partly an artifact of how performance is measured remains actively debated, but the practical effect is real — bigger models can do things smaller ones simply cannot.

Adaptation: turning one base into many tools

Adaptation is how a single pretrained foundation model is specialized for a particular task or domain. There are two broad families. Fine-tuning continues training the model’s weights on a smaller, task-specific dataset. Prompting leaves the weights frozen and instead shapes behavior through carefully written instructions and in-context examples. Both are far cheaper than pretraining from scratch.

Fine-tuning vs prompting

Fine-tuning gives the deepest specialization and best performance on narrow tasks, but it produces a separate copy of the model and needs labeled data and compute. Parameter-efficient methods like LoRA reduce that cost by training only small adapter matrices. Prompting and retrieval-augmented generation keep one shared model and steer it at inference time — far more flexible, though usually less precise on highly specialized tasks. The trade-off between these approaches is covered in our fine tuning vs rag customize llm guide.

Alignment as adaptation

A crucial adaptation step is alignment — making a raw pretrained model helpful and safe to talk to. Techniques like reinforcement learning from human feedback (RLHF) and instruction tuning turn a next-token predictor into an assistant that follows directions and refuses harmful requests. The base model supplies raw capability; alignment supplies usability.

Foundation models across modalities

Foundation models are no longer just about text. The same pretrain-then-adapt recipe now spans images, audio, video, code, and combinations of all of these. OpenAI’s CLIP learned a shared text-image space; diffusion models like Stable Diffusion generate images from prompts; Whisper handles speech; and code models like Codex power programming assistants. Each is a base others build on.

The frontier is multimodal: single models that accept and produce several modalities at once, such as text plus images plus audio. These systems can describe a photo, read a chart, or transcribe speech within one network. Adaptation also runs in the other direction — large base models are compressed into compact, efficient versions, including the small models on device that bring foundation-model capability to phones and laptops.

The homogenization risk

Homogenization is the central systemic risk the Stanford CRFM report flagged: because so many applications are built on a small number of base models, any flaw in a base model — a bias, a security weakness, a blind spot — is inherited by everything downstream. The report warned that this concentration creates “single points of failure” across society. A defect that would be isolated in a bespoke system instead propagates to thousands of products at once.

This concentration is also economic and political. Training frontier foundation models requires capital, data, and compute that few organizations possess, which centralizes influence over a general-purpose technology in a handful of labs. The CRFM authors framed foundation models as a chance to study and govern this concentration deliberately rather than let it form by default. The same homogenization that makes the technology efficient is what makes its failures correlated.

Frequently asked questions

Is a large language model the same as a foundation model?
Not exactly — a large language model is one type of foundation model, specifically a text-based one. Foundation model is the broader category that also covers image, audio, video, and multimodal systems built with the same pretrain-then-adapt recipe. Every modern LLM is a foundation model, but not every foundation model is a language model. The term was coined precisely to capture what these different systems share.

Who came up with the term “foundation model”?
It was introduced in 2021 by Stanford’s Center for Research on Foundation Models, part of the Stanford Institute for Human-Centered AI, in a report led by Rishi Bommasani and Percy Liang with more than 100 co-authors. They argued that existing terms were too narrow for a paradigm spreading across modalities. The word “foundation” was chosen to convey that these models are a central base to build on, yet deliberately incomplete on their own.

Why not just train a custom model for each task?
Because pretraining a capable model from scratch is enormously expensive — frontier models can cost millions of dollars and weeks of compute. Foundation models amortize that cost: you pay for pretraining once, then adapt cheaply through fine-tuning or prompting for many tasks. A small team can build a strong product on top of a base model without the data or compute to train one. The trade-off is dependence on the base model’s quality, biases, and availability.