Transformer Architecture Explained: The Foundation of LLMs

The transformer is a neural-network architecture introduced in 2017 by Google researchers in the paper “Attention Is All You Need”.
Its core innovation is self-attention: every token in a sequence can directly attend to every other token, regardless of distance.
Transformers replaced recurrent networks (RNNs, LSTMs) as the dominant architecture for language because they parallelize well on GPUs and capture long-range dependencies more effectively.
Every major large language model — GPT-5, Claude, Gemini, Llama, Mistral — is a transformer, varying in scale, training data, and fine-tuning rather than architecture.
A transformer is built from repeating blocks of multi-head attention plus feedforward networks, glued together with residual connections and layer normalization.

Why the transformer replaced everything

Before 2017, language models were dominated by recurrent neural networks (RNNs) and their variants like LSTMs and GRUs. These process sequences one token at a time, carrying a hidden state forward. This worked, but it had two fatal limits: training could not be parallelized across a sequence (you had to wait for step N to compute step N+1), and long-range dependencies degraded — the model would “forget” context from earlier in the input.

The 2017 paper from Google’s research group — Vaswani et al., “Attention Is All You Need” — proposed replacing recurrence entirely with attention. The title is a mission statement: the authors argued that the recurrent machinery was unnecessary if you had a powerful enough attention mechanism. They were right. Within three years, transformers dominated machine translation, then language modelling, then image generation, then protein folding. Background on this turning point is in our attention is all you need coverage.

Self-attention in plain language

Imagine reading the sentence “The bank raised interest rates this quarter”. To understand “bank”, you need to connect it with “interest rates” — not with the word next to it, but with a word further away. Self-attention is the mechanism that lets the network do exactly that.

For every token in the input, self-attention computes a weighted combination of every other token, where the weights reflect how relevant each other token is. The machinery uses three learned projections of each token — called query, key, and value — and computes attention scores as a dot product between queries and keys, softmaxed across the sequence. The weighted sum of values is the token’s new representation.

Crucially, all these computations for all tokens happen in parallel. That is what makes transformers fast to train on GPUs — you can shovel a whole sequence into a matrix multiplication instead of stepping through it serially.

Multi-head attention

A single attention head learns one way to weight tokens. But language has many kinds of relationships — syntactic, semantic, coreference, discourse. So transformers use multi-head attention: several attention heads run in parallel, each with its own learned query/key/value projections. Each head specializes in different patterns, and the outputs are concatenated and mixed. A typical large model uses 32 to 128 heads per layer.

The full transformer block

A transformer is a stack of identical blocks. Each block contains, in order: multi-head self-attention, a residual connection and layer normalization, a feedforward network (a two-layer fully-connected network applied position-wise), and another residual connection and layer norm. Stack 32, 64, or 96 such blocks and you have GPT, Claude, or Gemini.

Positional encoding

Attention by itself is order-agnostic — swap two tokens and the outputs are the same. But “dog bites man” and “man bites dog” should mean different things. Transformers solve this with positional encoding: each token gets additional information about its position, either added to the input (sinusoidal or learned embeddings) or applied during attention itself (rotary position embedding, ALiBi, and newer variants).

Layer normalization and residuals

Two tricks make deep transformers trainable. Residual connections add the input of each sub-layer to its output, so gradients can flow backward through 96 layers without vanishing. Layer normalization rescales activations at each layer to keep them in a stable range. Neither is the “interesting” part of the architecture, but without them, training a deep transformer does not converge.

Encoder, decoder, or both

The original 2017 paper described an encoder-decoder transformer for machine translation — one stack reads the source language, another stack produces the target. Subsequent models specialize:

Encoder-only (BERT and descendants): produce representations of input text for classification, search, and retrieval.
Decoder-only (GPT family, Claude, Gemini in its core): autoregressive language modelling — given a prefix, predict the next token. This is the dominant pattern for large language models today.
Encoder-decoder (T5, original translation models): still used for some sequence-to-sequence tasks.

For the broader LLM picture, see our large language models explainer.

Why scale mattered

The transformer architecture is deceptively simple. The papers that changed AI from 2018 onward did not invent fundamentally new architectures — they scaled the same transformer from millions of parameters to billions, and eventually hundreds of billions. OpenAI’s GPT-3 paper in 2020 showed that with enough scale, transformers developed emergent abilities — few-shot learning, basic reasoning — that smaller versions simply lacked.

As a result, frontier AI research for the last five years has been as much an engineering story as a scientific one: building the data pipelines, distributed training infrastructure, and GPU supply chains needed to train ever-larger transformers. The architecture itself is remarkably stable. For a foundational view of the neural networks behind transformers, see our neural networks primer.

Known limits

Transformers are expensive. Naive self-attention scales as O(n²) in sequence length — double the context window, quadruple the compute. This is why long-context models use techniques like sliding-window attention, sparse attention, or linear-attention variants.

They are also data-hungry. Transformers shine when trained on billions of tokens; they are often outperformed by simpler models when data is scarce. And they inherit every flaw of their training data — biases, inaccuracies, and gaps.

Frequently asked questions

Is GPT a transformer?
Yes. The “T” in GPT stands for Transformer. GPT (Generative Pre-trained Transformer) is a decoder-only transformer trained to predict the next token in a sequence. Every version from GPT-1 through GPT-5 follows the same basic architecture, differing mainly in size, training data, and post-training techniques like reinforcement learning from human feedback. The same is true of Claude, Gemini, Llama, and nearly every other large language model you will encounter.

Why is the transformer called “Attention Is All You Need”?
The paper’s title was a deliberate provocation. Before 2017, attention was used as an add-on inside recurrent networks to help them handle long sequences. The paper demonstrated that attention alone, without recurrence or convolution, was sufficient to set a new state-of-the-art in machine translation. The title signalled that the complex recurrent machinery was not needed — attention was not a feature, it was the architecture.

Are there any alternatives to transformers?
Several research directions aim to replace or complement transformers. State-space models like Mamba offer linear-time sequence processing. Mixture-of-experts (MoE) architectures are technically still transformers but with sparse activation to save compute. Retentive networks, linear transformers, and newer attention variants each offer tradeoffs. So far none have dethroned the standard transformer at frontier scale, but the research community remains active.