Code Generation with AI: How Copilot and Cursor Write Code

AI code assistants like GitHub Copilot, Cursor, Claude Code, Windsurf, and Cody are built on large language models trained on public code repositories plus general text.
Code completion uses a fill-in-the-middle training objective so the model can predict text given both the code before and after the cursor, not just what comes next.
Quality on isolated function problems is measured by HumanEval; quality on realistic repository tasks is measured by SWE-bench and its Verified variant.
A controlled study of GitHub Copilot found developers using it completed an HTTP-server task 55.8% faster than a control group, though later field studies show mixed productivity effects on real code.
Context handling — which files the tool sends to the model, and in what order — now matters more than raw benchmark score for day-to-day usefulness.

How code LLMs learn to program

An AI code generator is a large language model whose training data includes a substantial share of source code — typically scraped from public GitHub repositories, package registries, and curated corpora such as The Stack. The model learns statistical patterns across programming languages, idioms, API usage, documentation, and the natural-language comments that surround code. It does not execute the code during pretraining; it simply learns to predict the next token of whatever came next in its training examples.

Because code is heavily structured and has automated verification (tests, type checks, compilers), it is an unusually good training signal. Modern frontier models are trained on code even when the product target is general assistance, because code exposure appears to improve reasoning across unrelated domains. Coding-specific checkpoints are then further refined with supervised fine-tuning on instruction-response pairs and reinforcement learning from feedback — human and automated.

Fill-in-the-middle training

Standard language models predict left-to-right: given a prefix, produce the next token. That is not how programmers work. A developer usually writes partial code, jumps back to the middle, and wants the assistant to fill a hole between an existing prefix and suffix. Bavarian et al. at OpenAI showed that models can learn this objective — called fill-in-the-middle (FIM) — by randomly splitting training documents into prefix, middle, suffix and rearranging them so the model learns to generate the middle given the other two pieces. Critically, FIM training does not degrade standard left-to-right performance. Every major code model released since late 2022 uses some variant of this approach.

Context window utilization

A model can only consider text that fits in its context window. Early Codex had 2,048 tokens; current coding models support 128k-1M tokens. But raw window size is only part of the story. What actually reaches the model depends on the tool’s retrieval strategy — which files it reads from the repository, in what order, and how it summarizes them. Cursor and Claude Code build repository indexes and pull in likely-relevant files. Copilot relies on a mix of open editor tabs and retrieval from the workspace. Retrieval quality, not context length, is now the main differentiator on real codebases. For background on the underlying models, see our large language models primer.

Code completion versus code generation

Inline completion

Inline completion — the grey “ghost text” that appears as you type — is the oldest and most widely used mode. The model is called on every keystroke (debounced), receives the surrounding code as prompt, and produces a short suggestion the user can accept with Tab. Latency budgets are tight: a suggestion that arrives 800ms after the user has already typed the next character is worse than no suggestion at all. Inline completion rewards small, fast, FIM-trained models.

Chat and edit modes

Chat mode lets the developer describe a change in natural language; the assistant proposes an edit or writes new code. Edit mode (Cursor’s “Composer”, Copilot’s “Edits”, Claude Code’s default loop) goes further: the model is given tools to read files, run commands, and apply multi-file patches. These modes run stronger, slower models and tolerate higher latency because the developer is waiting for a meaningful change rather than a character-by-character completion.

Agentic coding

The newest mode hands the model a task and lets it loop — plan, edit, run tests, read output, iterate — with limited human supervision. SWE-bench was designed for exactly this setting. Claude Code, Cursor Agent, Copilot Workspace, and Devin all operate here. Agentic coding puts more weight on the quality of tool use and reflection than on raw token prediction. Good prompt engineering of the system and tool descriptions matters as much as model choice.

The main tools and what distinguishes them

GitHub Copilot

GitHub Copilot launched in technical preview in June 2021 and reached general availability for individual developers in June 2022, according to GitHub’s own documentation. It was originally powered by OpenAI Codex, a GPT derivative fine-tuned on public code. Today Copilot supports multiple model backends — including Anthropic, OpenAI, and Google models — selectable by the user. It is integrated into Visual Studio, VS Code, JetBrains IDEs, Neovim, and GitHub.com itself. Its strength is distribution and IDE integration; its weakness has historically been weaker repository-wide context handling.

Cursor

Cursor is a VS Code fork built around AI-first workflows. It pioneered tight integration between chat, inline completion, and multi-file edits, and invests heavily in codebase indexing. Cursor uses frontier models from Anthropic, OpenAI, and others under the hood. It became widely adopted in 2024-2025 among developers who wanted deeper AI features than Copilot shipped at the time.

Claude Code

Claude Code, Anthropic’s official CLI, runs Claude directly against a local repository with tool access — file reading, editing, shell execution. It targets agentic workflows: describe a task, let the model iterate. It is widely used for larger refactors and exploratory work where chat-in-editor would be too slow.

Windsurf

Windsurf (formerly Codeium’s IDE) is another AI-first editor, differentiating on its “Cascade” agent that can plan and execute multi-step edits. Codeium also ships a free extension for existing IDEs.

Cody

Cody is Sourcegraph’s assistant, built around Sourcegraph’s code-search index. Its distinctive advantage is large-codebase context: it can retrieve across many repositories, which matters for enterprise monorepos and multi-service architectures.

Benchmarks: HumanEval, MBPP, SWE-bench

HumanEval and the pass@k metric

HumanEval, introduced alongside Codex in Chen et al. 2021, is a hand-crafted set of 164 Python programming problems. Each problem has a signature, docstring, and hidden unit tests. The model sees signature and docstring, produces a function body, and is scored on whether the unit tests pass. The pass@k metric reports the probability that at least one of k sampled completions passes. Early Codex scored 28.8% pass@1; current frontier models exceed 90% on HumanEval. That ceiling is why HumanEval is now considered saturated and researchers have moved to harder benchmarks.

MBPP, MultiPL-E, and LiveCodeBench

MBPP (Mostly Basic Python Problems) extends HumanEval-style evaluation to around 1,000 problems. MultiPL-E translates HumanEval to 18+ programming languages. LiveCodeBench uses competitive-programming problems released after the model’s training cutoff to reduce data contamination. None of these, however, resemble real software-engineering work.

SWE-bench

SWE-bench, introduced in Jimenez et al. 2023, is the closest benchmark to real work. It contains 2,294 issue-plus-pull-request pairs from 12 popular Python repositories. The model is given the issue text and the repository state before the fix, and must produce a patch that resolves the issue and passes the project’s existing test suite. Early models scored under 5%. OpenAI’s SWE-bench Verified subset (a human-validated 500-problem slice) is the de facto leaderboard, and agent-based systems now pass 60-70% on it — a dramatic shift in two years, and the clearest signal that agentic coding is no longer a research demo.

What studies say about productivity

Controlled experiments

The most cited result is Peng et al. 2023: 95 developers were asked to implement an HTTP server in JavaScript. The treatment group had Copilot access and finished 55.8% faster than the control group (P=0.0017, 95% CI [21%, 89%]). Less-experienced developers benefited most. The task was narrow and boilerplate-heavy, which is exactly where AI completion excels.

Field evidence

Larger field studies paint a more complex picture. Some organizations report meaningful productivity gains; others report flat or negative effects once the metric shifts from “lines produced” to “defects”, “time-to-merge”, or “review burden”. A 2024 study covered by industry press found that AI-generated code correlated with higher defect rates and more code churn in some samples. The honest summary: AI coding tools clearly speed up scaffolding and boilerplate; whether they speed up a given team on a given codebase depends on task mix, code quality standards, and review discipline.

Where gains are real

Reasonably well-established use cases: writing tests, generating boilerplate (CRUD endpoints, data-class conversions, regex), explaining unfamiliar code, translating between languages, drafting configuration files, and large mechanical refactors. Weak areas: novel algorithmic work, reasoning about concurrency, and judgement calls about architecture. For broader context on how these tools fit into the industry, see our ai industry coverage.

Limitations and risks

Hallucinated APIs

Models sometimes invent function names, import paths, or method signatures that do not exist. This is especially common for smaller libraries or recent API changes the model did not see during training. The defence is running the code — type checkers, tests, linters — rather than trusting the suggestion.

License and provenance

Because training data includes open-source code, suggestions can occasionally reproduce copyrighted snippets verbatim. Vendors have added filters to detect and suppress near-exact matches, and some offer indemnification for paid customers, but the legal situation is still evolving.

Security

Studies have shown AI-generated code can introduce well-known vulnerability patterns (SQL injection, hard-coded secrets, insecure defaults) if the prompt does not steer away from them. Treat AI output as code from a fast but unreliable collaborator: review it.

Frequently asked questions

Do I need to pick one tool, or can I use several?
Most developers end up using more than one. A common pattern is Copilot or a lightweight completion tool for inline suggestions in the IDE, plus an agentic tool like Claude Code or Cursor’s Composer for larger tasks. The tools are not mutually exclusive, and because they connect to similar underlying models, the differences lie mostly in IDE integration, context handling, and workflow ergonomics rather than raw code quality. Try two or three for a week each before committing.

Are benchmark scores a reliable way to choose a coding assistant?
Only partly. HumanEval is saturated and tells you little. SWE-bench Verified is more informative, but it measures agentic problem-solving on Python issues, which may not match your stack or workflow. In practice the limiting factor is often the tool’s context retrieval — how well it picks which files to send the model — rather than model capability. Test a tool on your actual codebase for a realistic signal.

Will AI code generation replace programmers?
Not on current evidence, though it is reshaping the work. Tasks that were expensive — writing tests, porting code, building scaffolds, explaining unfamiliar systems — are now cheap. Tasks that were the core of senior engineering — architecture, trade-off analysis, debugging production incidents, working with stakeholders — remain human work, sometimes augmented by AI. The likely direction is fewer developers writing repetitive code and more developers reviewing, directing, and integrating AI output, with skill premiums shifting toward judgement and systems thinking.