AI Agents: How LLMs Use Tools, Plan, and Take Actions

Key takeaways

An AI agent is a software system that uses a large language model as its reasoning core, combined with tools, memory, and a control loop that lets it plan and take actions toward a goal.
The dominant pattern is ReAct — interleaved reasoning steps and tool calls — introduced by Yao et al. in 2022.
Planning strategies range from simple chain-of-thought to tree of thoughts and Monte Carlo-style search, trading latency and cost for reliability on harder problems.
Agent frameworks — LangChain, LlamaIndex, AutoGen, CrewAI, LangGraph — standardize the plumbing around tool schemas, memory, and multi-agent orchestration.
The Model Context Protocol (MCP) is an emerging open standard for connecting agents to external tools and data without per-framework adapters.

What an AI agent actually is

The classical definition of an intelligent agent predates LLMs: any entity that perceives its environment and acts on it to achieve goals. The Wikipedia entry on intelligent agents traces this framing back to Russell and Norvig’s textbook. What changed in 2022-2023 was the arrival of language models capable enough to serve as the agent’s reasoning core — able to read a task description, decide which tool to call, inspect the result, and iterate.

A robotic hand reaching toward tools, illustrating the concept of AI agents selecting and using software tools — Photo by Tara Winstead on Pexels

A modern LLM agent has four moving parts: a model (the policy), a set of tools (functions the model can call), a memory (short-term context plus, optionally, long-term storage), and a control loop (the program that repeatedly prompts the model, parses its output, executes tool calls, and feeds results back). Remove any of these and the system is something narrower — a chatbot, a RAG pipeline, a prompt template. Together, they become an agent. Readers new to the underlying models may want to start with the large language models primer.

Why agents, not just prompts

A single prompt-and-response exchange is fixed in scope: whatever the model knows and whatever fits in one context window. An agent extends the model’s reach. It can look things up, run code, query databases, call APIs, and incorporate the results into its next step. This changes the failure mode from “the model hallucinated an answer” to “the model reasoned over real data it fetched.”

The ReAct pattern

ReAct — short for Reasoning and Acting — is the foundational pattern for modern LLM agents. The loop is simple: the model produces a Thought (natural-language reasoning about what to do next), then an Action (a tool call with arguments), and receives an Observation (the tool’s output). It repeats until it emits a final answer instead of a new action.

The original paper by Yao and colleagues showed that interleaving reasoning traces with tool use outperforms both pure chain-of-thought (no tools) and pure action (no reasoning) on question-answering, fact verification, and interactive decision-making tasks. Nearly every production agent today uses a descendant of this pattern, often wrapped in structured function-calling APIs from providers like OpenAI, Anthropic, and Google.

Function calling as a ReAct substrate

Instead of parsing free-text “Action: search[query]” strings, current agents use provider-native function calling. The developer declares tool schemas (name, description, JSON-schema arguments), and the model emits structured tool-call objects that the runtime executes. This removes a large class of parsing errors but does not change the underlying ReAct loop — just makes it more reliable. Good tool descriptions matter here; see the prompt engineering guide for how wording shapes model behaviour.

Planning strategies

Chain-of-thought

The simplest form of planning is getting the model to think step by step in natural language before acting. Chain-of-thought prompting, introduced by Wei et al. in 2022, reliably improves performance on multi-step problems. In agents, it manifests as the Thought step in each ReAct iteration. Cheap, fast, usually enough for tasks with short horizons.

Tree of thoughts

For harder problems, a single linear trace is not enough. Tree of Thoughts (Yao et al. 2023) generalizes chain-of-thought into a search tree: the model generates multiple candidate next-steps, evaluates them, and explores the most promising branches. The cost is higher — more model calls per task — but the hit rate on puzzles like Game of 24 and creative writing rises substantially.

MCTS-style search

A further generalization borrows from game-playing AI: Monte Carlo Tree Search, where the agent simulates multiple action sequences, estimates their value, and prefers high-value branches. Research systems like Reflexion, Language Agent Tree Search, and AlphaLLM apply this to code generation, math, and reasoning benchmarks. Production deployment is still rare because of latency and token cost, but the technique sets the upper bound on what agent reasoning can currently do.

Plan-then-execute

A pragmatic middle ground: the agent first produces an explicit multi-step plan (often in structured form), then executes the plan step by step, re-planning only when steps fail. This separates the “figure out what to do” cost from the “do it” cost and tends to produce more predictable behaviour than pure ReAct on long tasks.

Memory systems

Short-term context

The model’s context window is the default short-term memory — everything the agent has said, seen, and done in the current session. For short tasks this suffices. For longer ones, the context fills up, and the system must decide what to keep, what to summarize, and what to drop. Rolling summaries, message-window strategies, and selective pruning are all common patterns.

Long-term vector memory

When an agent needs to remember across sessions or work with large corpora, short-term context is not enough. The standard approach is embedding-based retrieval: prior conversations, documents, and facts are stored as vectors in a vector database, and relevant items are retrieved at query time and injected into context. This is the same retrieval-augmented generation pattern used for knowledge grounding, applied to the agent’s own history.

Structured memory stores

More recent systems combine vector memory with structured stores — key-value caches for user preferences, graph databases for relationship memory, SQL tables for tracked state. The motivation is that pure semantic retrieval fails on factual questions like “what did the user ask me to do last Tuesday?” where the answer requires precise recall, not nearest-neighbour similarity.

Agent frameworks

A handful of frameworks dominate agent development as of 2026. Each takes a slightly different stance on abstraction, control flow, and multi-agent patterns.

LangChain and LangGraph

LangChain is the most widely adopted framework, offering tool wrappers, memory abstractions, and a large integration library. Its sibling LangGraph adds explicit graph-based control flow — nodes and edges that define the agent’s state machine — which tends to produce more debuggable long-running agents than the loosely-structured chains of earlier LangChain versions.

LlamaIndex

LlamaIndex started as a RAG-focused library and has expanded into agent territory. Strong on data connectors, query engines, and retrieval patterns; popular for agents whose primary job is reasoning over private data.

AutoGen

Microsoft’s AutoGen focuses on multi-agent conversations — scenarios where several specialized agents (planner, coder, critic) exchange messages to solve a problem. The framework provides group-chat orchestration and role-based agent definitions.

CrewAI

CrewAI emphasizes role-based multi-agent teams with explicit task delegation. Agents are defined by role, goal, and backstory; a “crew” object coordinates them on a shared mission. Lightweight, opinionated, popular for quick prototypes of multi-agent systems.

Model Context Protocol

As of 2024-2026, an open standard called the Model Context Protocol has emerged for connecting LLM agents to external tools and data. Originally proposed by Anthropic, MCP defines a client-server protocol so that any MCP-compliant agent can use any MCP-compliant tool server without per-framework glue code. Servers expose resources (data), tools (actions), and prompts (templates); clients — the agents — consume them through a common JSON-RPC interface.

MCP matters because tool integration has been a major tax on agent development. Every framework invented its own way to declare tools, handle auth, stream results, and manage permissions. A shared protocol lets tool providers ship once and run in many agent runtimes, and lets agent builders pick up new capabilities without writing adapter code.

Current limits

Agents work impressively in demos and increasingly well in production for bounded tasks, but several failure modes remain open problems.

Reliability over long horizons

Each step has some probability of error, and errors compound. A 95% per-step success rate becomes 60% over ten steps and under 40% over twenty. Current systems handle short-horizon tasks well and struggle with genuinely long ones.

Error recovery

When a tool call fails, returns unexpected output, or produces a wrong answer the agent trusts, recovery is hard. The model may loop on the same mistake, invent a workaround that makes things worse, or confidently report success despite failure. Robust retry logic, structured error signals, and explicit verification steps help, but there is no clean solution.

Security and prompt injection

Agents that read external content and take actions are the highest-risk category for prompt injection attacks. A malicious web page or document can embed instructions that hijack the agent’s behaviour. Privilege separation, output validation, and human-in-the-loop gates for destructive actions are standard mitigations, but the threat model is not fully solved at the model layer.

Cost and latency

Planning strategies like tree of thoughts or MCTS multiply model calls per task, long-context memory grows token counts, and multi-agent systems add conversational overhead. A capable agent is often 10-100x more expensive than a single prompt and can take minutes instead of seconds.

Frequently asked questions

How is an AI agent different from a chatbot?
A chatbot produces text in response to user messages and typically has no ability to take external actions. An AI agent is built around an LLM but adds tools, memory, and a control loop that lets it fetch data, run code, call APIs, and iterate on a task autonomously. The same base model can serve both roles; the difference is the scaffolding around it. Most production “chatbots” today are actually lightweight agents, because almost every useful assistant needs at least retrieval or function calling to stay grounded.

Do AI agents require special models, or does any LLM work?
Any capable instruction-following LLM can in principle drive an agent, but models trained with tool-use and function-calling examples perform dramatically better on ReAct loops, planning, and structured output. Frontier models from major labs ship with native function-calling APIs and agentic post-training, and open-weight models increasingly include similar capabilities. The practical floor is a model that reliably produces well-formed JSON and follows multi-turn instructions; below that, agent behaviour becomes too unreliable to deploy.

Should developers build agents from scratch or use a framework?
It depends on the complexity and the team’s experience. For simple single-loop ReAct agents with a handful of tools, a custom implementation of about a hundred lines of code is often clearer than pulling in a large framework. For multi-agent orchestration, complex memory patterns, or production-scale observability, frameworks like LangGraph, AutoGen, or CrewAI save real work. The trade-off is framework lock-in against boilerplate — a choice worth making deliberately rather than by default.