Key takeaways
- Training an AI model is the process of adjusting its parameters so its predictions match desired outputs on labelled or structured data.
- The workflow has five recurring stages: data preparation, model selection, training loop, evaluation, and deployment.
- The training loop itself is a tight cycle of making predictions, measuring errors, and updating weights using gradient descent.
- Guard rails like train/validation/test splits and regularization exist to detect and prevent overfitting — when a model memorizes training data instead of learning patterns.
- Most production AI today is built by fine-tuning a pre-trained base model, not by training from scratch — a technique called transfer learning.
Stage 1: data preparation
Training an AI model starts with data, and data is almost always the hardest part of the project. Raw data comes messy, inconsistent, and biased. Before anything else, engineers clean it — remove duplicates, fix encoding issues, handle missing values, normalize formats. For supervised learning, they also need labels: for each training example, the correct answer the model should learn to produce.

Labelling is often outsourced to human annotators. For large-scale datasets, labelling alone can take months and cost more than model development. Automated labelling, weak supervision, and self-supervised learning (training on implicit labels derived from the data itself) have all grown in popularity as workarounds.
Before training, the dataset is split into three subsets. Training data is what the model learns from. Validation data is used during training to tune hyperparameters and detect overfitting. Test data is held back until the very end to give an unbiased estimate of real-world performance. A typical split is 80/10/10 or 70/15/15.
Stage 2: model selection
Next, choose the model architecture. For tabular data, tree-based models (XGBoost, LightGBM) are often the strongest choice. For images, convolutional neural networks or vision transformers. For text, transformers. For sequences with strict temporal ordering, recurrent networks may still apply. See our neural networks primer for the underlying theory.
In practice, most teams do not start from zero. They pick a pre-trained base model — a network already trained on a large, generic dataset — and fine-tune it on their specific task. This is transfer learning, and it dramatically reduces data requirements and training time. A vision classifier fine-tuned from ImageNet-pretrained ResNet or from a CLIP backbone often matches the accuracy of a from-scratch model with a fraction of the labels.
Stage 3: the training loop
This is where the model actually learns. The loop repeats three steps for every batch of training examples.
Forward pass
The batch of inputs flows through the model. The current weights determine the model’s prediction. Early in training these predictions are essentially random.
Loss computation
A loss function measures how wrong the predictions are compared to the labels. Cross-entropy for classification, mean squared error for regression, and a zoo of specialized losses for ranking, sequence generation, and structured prediction. The loss is a single number per batch.
Backward pass and weight update
Using calculus (automatic differentiation in modern frameworks like PyTorch and TensorFlow), the system computes the gradient of the loss with respect to every weight. An optimizer — typically Adam or a variant — uses the gradients to nudge each weight in the direction that reduces the loss. The size of the nudge is controlled by the learning rate, one of the most important hyperparameters.
Epochs and batches
The training data is usually too large to process in one pass. It is broken into batches (say, 64 or 256 examples each), and each pass through the entire dataset is called an epoch. A large model might train for anywhere from a few epochs to hundreds, depending on dataset size and task difficulty.
Stage 4: evaluation
After each epoch, the model is evaluated on the validation set. The metric depends on the task — accuracy for balanced classification, F1 for imbalanced, BLEU or ROUGE for translation and summarization, AUC for ranking, and so on. Tracking validation metrics over epochs is how engineers spot overfitting: training loss keeps dropping while validation loss starts rising. This means the model is memorizing training data rather than generalizing.
Regularization techniques combat overfitting: dropout randomly zeroes parts of the network during training, weight decay penalizes large weights, data augmentation creates synthetic variants of training examples. Early stopping — halting training when validation loss stops improving — is another widely used safeguard.
Only once validation-based tuning is complete is the test set touched. The test-set metric is the number that goes in the report or paper. Repeatedly tuning on the test set — called test set contamination — invalidates the whole evaluation, and is a surprisingly common mistake even in published research.
Stage 5: deployment
A trained model sitting on a researcher’s laptop is not yet useful. Deployment means packaging the model so it can serve predictions in production — typically behind an API, in a batch job, or on an edge device. For large models, this involves quantization (compressing weights to smaller numeric types), distillation (training a smaller model to mimic a larger one), and careful runtime engineering to achieve latency and throughput requirements. See our mlops coverage for the operational side.
Deployment is not the end. Production models need monitoring for data drift (the real-world input distribution shifts away from training data), performance degradation, and edge-case failures. Most mature ML teams have some form of continuous retraining pipeline.
Pre-training vs. fine-tuning
For large foundation models — LLMs, vision transformers, multimodal models — there are two distinct training phases. Pre-training happens once, using self-supervised learning on massive corpora (trillions of text tokens, billions of images). This is extraordinarily expensive — the largest frontier models cost tens to hundreds of millions of dollars to pre-train. Fine-tuning adapts the pre-trained model to a specific task or domain with a much smaller, task-specific dataset. Fine-tuning can often be done for hundreds to low-thousands of dollars. Most product teams only do fine-tuning; pre-training frontier models is the business of a small number of well-funded labs. See our fine tuning explainer.
What can go wrong
- Bad data: biased labels, mislabelled examples, or a dataset that does not reflect the real-world distribution. No amount of clever modeling fixes this.
- Overfitting: model fits training set perfectly, fails on new data.
- Underfitting: model is too simple to capture the patterns. More data and a bigger model may help.
- Distribution shift: the world changes after the model was trained. User behaviour, language use, sensor readings all drift.
- Label leakage: a feature in the training data accidentally reveals the label. The model looks great until deployed, then fails.
Frequently asked questions
How long does it take to train a model?
From seconds for a small classifier on a CPU to months for a frontier large language model on tens of thousands of GPUs. A small transformer fine-tune might take a few hours on a single GPU. A vision classifier for a custom product inspection task might take a day on one or two GPUs. The largest current models — GPT-5, Claude Opus 4, Gemini 3 — involve training runs that consume the equivalent of thousands of GPUs for several months.
Do I need to train a model from scratch, or can I use an existing one?
For nearly every commercial use case, use an existing one. Pre-trained foundation models are available for free (Hugging Face has tens of thousands) or as APIs (OpenAI, Anthropic, Google). Fine-tuning one of these is faster, cheaper, and usually more accurate than training from scratch unless you have proprietary data at a scale that matches or exceeds what the pre-training used — which is rarely the case.
What hardware do I need?
For experimenting with small models and fine-tuning, a single consumer GPU (RTX 4090, 5090) is plenty. Cloud GPUs from AWS, Google Cloud, Azure, or specialized providers like Lambda Labs and Modal let you rent bigger hardware by the hour. Training large models at scale requires orchestration across many GPUs with fast interconnects — typically Nvidia H100 or B200 GPUs in clusters of 8 or more — which is why only well-funded labs pre-train at frontier scale.






