Fine-Tuning vs RAG: When to Customize Your LLM

Fine-tuning continues training an existing LLM on your data, baking new behaviour into the model weights.
RAG leaves the model weights alone and instead provides fresh knowledge at query time via retrieved context.
Use fine-tuning to teach style, format, tone, or specialized output patterns. Use RAG to ground answers in fresh or private factual data.
Many production systems use both — a fine-tuned model that also retrieves from a knowledge base.
Parameter-efficient fine-tuning methods like LoRA have made fine-tuning dramatically cheaper and are the default choice today.

The core difference

Both fine-tuning and RAG aim at the same outcome — get an LLM to produce answers useful for your specific case — but they do it in opposite ways. Fine-tuning changes the model. RAG changes what the model sees at inference time.

A fine-tuned model has been retrained on your examples. It now has new “intuitions” baked in — maybe it always responds in your company’s brand voice, maybe it handles your niche technical vocabulary, maybe it outputs exactly the JSON schema your frontend expects.

A RAG system, by contrast, uses an unmodified base model but pairs it with a vector database full of your documents. When a query comes in, the system retrieves relevant chunks and pastes them into the prompt. The model answers using that retrieved context. See our rag explainer for the full pipeline.

When to fine-tune

Teaching a style or format

If you want consistent tone, brand voice, or output structure — always returning valid JSON in a specific schema, always writing in your support team’s register, always following a proprietary document template — fine-tuning is the right tool. Prompt engineering can approximate this, but fine-tuning makes it the model’s default behaviour.

Narrow, high-volume tasks

If you have thousands of queries per day for the same narrow task — classifying emails into 50 categories, extracting named entities from legal contracts, summarizing call transcripts in one fixed format — a fine-tuned smaller model often matches or beats a larger base model at a fraction of the cost per query.

Specialized vocabulary or domain

Medical, legal, financial, and scientific domains have vocabulary, reasoning patterns, and conventions that benefit from exposure during training. Fine-tuning on in-domain text can produce a model that handles the specialized content more fluently than base models.

Model compression

Distillation is a form of fine-tuning where a small model is trained to mimic a large one. The resulting smaller model is much cheaper to serve while retaining much of the large model’s quality on the tasks you care about.

When to use RAG

Answering questions about private data

RAG is the simplest way to make an LLM answer questions about your company’s documents, customer-support history, or any proprietary knowledge base. You index the documents once, then queries retrieve them on demand. No training required.

Fresh or frequently-updated information

LLMs have a training cutoff. If you need the model to answer about this week’s earnings call, yesterday’s policy change, or real-time inventory, RAG is the right mechanism. Fine-tuning cannot keep up with frequent data changes.

Citable answers

When users need to verify or trust answers — legal research, medical queries, financial compliance — RAG lets you show the retrieved source documents alongside the answer. Fine-tuning produces outputs that look confident but cannot be traced to a specific source.

Large, changing knowledge bases

If your knowledge base has millions of documents that churn frequently, fine-tuning is impractical — retraining every time data changes would be absurdly expensive. A vector index over a vector databases handles updates much more gracefully.

Cost comparison

Fine-tuning a small model (7B parameters) with modern parameter-efficient methods like LoRA typically costs tens to low hundreds of dollars per run. Fine-tuning a larger model (70B+) costs thousands. The trained model is cheap to serve but expensive to retrain.

RAG has no training cost, but has ongoing retrieval costs — embedding models, vector database queries, and the tokens consumed when retrieved context is pasted into the prompt. For a high-volume production system, retrieved context easily doubles or triples the token cost per query.

The cost tradeoff inverts by scale. RAG is cheaper to set up but more expensive per query. Fine-tuning is more expensive to set up but cheaper per query. Which wins depends on volume.

Parameter-efficient fine-tuning and LoRA

Fine-tuning all the weights of a large model is expensive — you need to compute and store gradients for every parameter. LoRA (low-rank adaptation), introduced by Microsoft Research in 2021, changed the economics of fine-tuning. Instead of updating every weight, LoRA inserts small trainable matrices alongside the frozen original weights. Only the small matrices are trained, cutting memory and compute requirements by 10x or more.

LoRA and its descendants — QLoRA, DoRA, adapters — have made fine-tuning accessible. A 70B-parameter model can be LoRA-fine-tuned on a single high-end consumer GPU. Multiple LoRA adapters can be stored and swapped at inference time, so one base model can serve many tuned variants.

Using both together

In production, fine-tuning and RAG are often complementary. A typical pattern: fine-tune the base model to follow your output format and tone, then wire it to a RAG pipeline for factual grounding. The fine-tune handles “how to answer”; RAG handles “what to answer with”. Many customer-support chatbots, internal knowledge tools, and vertical AI products combine the two.

For more on the underlying models, see our large language models explainer.

A decision guide

Want the model to always output in a specific format, tone, or domain style? → Fine-tune.
Need to answer questions about constantly-changing or private documents? → RAG.
Need citable sources for each answer? → RAG.
Running a very high query volume where cost per query matters more than setup cost? → Fine-tune (possibly distill to a smaller model).
Facts change daily or weekly? → RAG.
Need a consistent persona plus access to your docs? → Both.

Frequently asked questions

Is fine-tuning better than prompt engineering?
Not necessarily. Prompt engineering — carefully crafting the system prompt and few-shot examples — can get you very far, especially with modern large models. Many tasks that seemed to require fine-tuning two years ago can now be solved with a well-designed prompt. Start with prompt engineering. Move to fine-tuning only if prompts cannot produce the consistency or quality you need, or if prompt length makes per-query costs unacceptable.

Can I fine-tune GPT-5 or Claude?
Fine-tuning availability varies by provider. OpenAI offers fine-tuning for selected models including some GPT-4 variants. Anthropic offers fine-tuning for Claude through enterprise arrangements and Bedrock. Google offers fine-tuning for Gemini. Open-weights models (Llama, Mistral, Qwen) can be fine-tuned freely on your own hardware or via services like Together, Fireworks, or Modal. Check provider documentation for current offerings, as these change frequently.

How much data do I need for fine-tuning?
Much less than pre-training. For instruction-following fine-tunes, a few hundred to a few thousand high-quality examples often suffice. For style or format learning, even fewer. The key is quality over quantity — a hundred carefully curated examples typically beat ten thousand noisy ones. LoRA and similar methods are especially effective in low-data regimes.