Long-Horizon Agents

A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.

A single model call answers one prompt. An agent loop extends this into a process:

$$ \text{observe} \rightarrow \text{plan} \rightarrow \text{act} \rightarrow \text{observe} \rightarrow \cdots $$

The word “long-horizon” means the task cannot be solved reliably in one step. The agent must preserve intent across time.

Examples include:

Task	Why it is long-horizon
Building a software feature	Requires reading code, editing files, testing, debugging
Researching a topic	Requires search, source selection, synthesis, citation
Planning a trip	Requires constraints, availability, routes, booking options
Running an experiment	Requires setup, execution, measurement, analysis
Robot manipulation	Requires perception, motion, feedback, correction

The central problem is control. The agent must decide what to do next, not merely predict the next token.

Agent State

An agent maintains state across steps. State includes the task objective, observations, tool results, partial outputs, memory, and constraints.

A minimal agent state can be written as:

$$ s_t = (g, o_{\leq t}, a_{<t}, m_t), $$

where:

Symbol	Meaning
$g$	Goal
$o_{\leq t}$	Observations up to time $t$
$a_{<t}$	Previous actions
$m_t$	Memory at time $t$

The next action is selected from this state:

$$ a_t \sim \pi_\theta(a \mid s_t). $$

Here $\pi_\theta$ is the model’s policy. In an LLM agent, the policy is usually implemented by a language model prompted with the current state and available tools.

Agent Loop

A basic agent loop has four stages.

Stage	Role
Observe	Read current environment state
Decide	Select next action
Act	Execute tool call or produce output
Update	Store result and revise state

In pseudocode:

state = init_state(goal)

for step in range(max_steps):
    action = policy(state)

    if action.type == "final":
        return action.output

    observation = environment.step(action)
    state = update_state(state, action, observation)

This loop is simple. Real systems add validation, tool schemas, error handling, budgets, retries, and safety checks.

Planning

Planning decomposes a goal into steps.

For example, a coding task may become:

Inspect repository structure.
Locate relevant files.
Read interfaces.
Edit implementation.
Run tests.
Fix failures.
Summarize changes.

The plan gives structure, but it should not be rigid. Tool results may reveal that the original plan was wrong.

A good agent uses a plan as a working hypothesis.

Planning may be explicit, stored as text, or implicit, represented inside the hidden state of the model. Explicit plans are easier to inspect and revise.

Replanning

Long-horizon tasks rarely follow the first plan exactly.

Replanning occurs when:

Trigger	Example
Observation contradicts assumption	File name differs from expected
Tool call fails	API returns an error
Task becomes underspecified	Multiple valid interpretations appear
New constraint appears	User adds a deadline
Partial result is poor	Test failure exposes a bug

Replanning can be represented as:

$$ \pi_\theta(a_t \mid s_t) $$

where the current state $s_t$ includes the latest observations. The agent does not choose actions from the initial prompt alone.

Tools

Tools allow an agent to affect the world or inspect external state.

Common tool types include:

Tool	Purpose
Search	Retrieve external information
Code execution	Run programs and tests
File editing	Modify project state
Database query	Read structured records
Browser	Inspect web pages
Calendar or email	Operate personal workflows
Robot controller	Move in physical space

A tool call usually has a schema:

{
    "name": "search",
    "arguments": {
        "query": "PyTorch distributed data parallel tutorial"
    }
}

Schemas constrain actions. They make tool use easier to validate and safer to execute.

Tool Selection

Tool selection is a decision problem. The agent must choose whether to answer directly, retrieve information, run code, ask for clarification, or stop.

A weak agent overuses tools. A weak agent also underuses tools. Both errors matter.

Error	Consequence
Tool overuse	Slow, noisy, expensive
Tool underuse	Stale or unsupported answers
Wrong tool	Irrelevant observation
Wrong arguments	Failed or misleading result

Tool selection improves when the system has clear action descriptions, examples, and feedback from tool results.

Memory

Long-horizon agents need memory because the full history may exceed the context window.

Memory can be divided into several kinds.

Memory type	Description
Working memory	Current task state
Episodic memory	Previous events and actions
Semantic memory	Stable facts and knowledge
Procedural memory	Reusable methods and policies
External memory	Documents, databases, vector stores

Working memory is usually included directly in the prompt. External memory is retrieved when relevant.

A memory write should be selective. Storing everything creates noise. Storing too little causes forgetting.

Reflection and Self-Evaluation

Many agent systems include a reflection step. The agent reviews its own intermediate result and decides whether it is sufficient.

Example checks:

Task	Self-evaluation question
Coding	Did tests pass?
Research	Are claims supported by sources?
Planning	Are constraints satisfied?
Math	Does substitution verify the answer?
Writing	Does the output match the requested style?

A reflection step may produce:

The current answer lacks source citations. Retrieve primary sources before finalizing.

Reflection is useful only when connected to action. A critique that does not change behavior adds cost without improving the result.

Verification

Verification checks whether the agent’s output satisfies external criteria.

Domain	Verification method
Code	Unit tests, type checks, linters
Math	Proof checking, substitution
Retrieval	Source citation and quote matching
Data analysis	Recomputed statistics
Tool use	API response validation

Verification separates plausible generation from reliable execution.

For code agents, the test suite is often more valuable than model self-confidence. For research agents, citations and source inspection are more valuable than fluent prose.

Credit Assignment

Long-horizon tasks make learning difficult because success or failure may depend on actions taken many steps earlier.

If an agent fails at step 40, the cause may be:

a bad assumption at step 3
poor retrieval at step 9
a wrong edit at step 18
missing validation at step 30

This is a credit assignment problem.

In reinforcement learning terms, the return depends on a trajectory:

$$ \tau = (s_0,a_0,r_0,s_1,a_1,r_1,\ldots,s_T). $$

The objective is to improve the policy:

$$ \max_\theta \mathbb{E}{\tau \sim \pi\theta}[R(\tau)]. $$

Long trajectories make this objective high variance. Practical agent training often uses shorter supervised traces, preference data, tool-use demonstrations, or process rewards.

Process Supervision

Outcome supervision judges only the final answer. Process supervision gives feedback on intermediate steps.

For long-horizon agents, process supervision is often more informative.

Supervision type	Signal
Outcome supervision	Final task success
Process supervision	Quality of steps
Tool supervision	Correct tool call
Verification supervision	Test or checker result

A coding agent can learn from traces where each step is labeled as useful or harmful. A research agent can learn whether a citation actually supports a claim.

Hierarchical Agents

A long task can be decomposed hierarchically.

A high-level planner chooses subgoals. Low-level workers execute them.

$$ \text{Goal} \rightarrow \text{Subgoal} \rightarrow \text{Action}. $$

For example:

Level	Coding agent example
High-level	Implement authentication
Mid-level	Add middleware
Low-level	Edit `auth.py`

Hierarchical control reduces complexity. Each layer operates at a different temporal scale.

Multi-Agent Systems

Some systems use multiple agents with distinct roles.

Agent role	Function
Planner	Break down the task
Researcher	Gather evidence
Coder	Modify implementation
Critic	Review output
Executor	Run tests or tools

Multi-agent systems can improve coverage, but they introduce coordination costs. Agents may duplicate work, disagree, or amplify errors.

A multi-agent design should have clear responsibilities and a final arbitration mechanism.

Agent Environments

An agent environment defines what actions are possible and what observations are returned.

Examples:

Environment	Actions
Shell	Run commands
Browser	Open, click, search
Codebase	Read, patch, test
Game	Move, inspect, interact
Robot	Sense, move, grasp

The environment determines the agent’s effective capabilities.

A language model without tools can describe actions. A language model with tools can execute actions.

Safety Constraints

Long-horizon agents require stricter safety controls than single-turn systems because they can act repeatedly.

Important constraints include:

Constraint	Purpose
Permission boundaries	Prevent unauthorized actions
Tool allowlists	Limit available operations
Budget limits	Bound cost and time
Human approval	Gate sensitive actions
Sandboxing	Contain execution
Logging	Support auditability

The longer the horizon, the more opportunities exist for compounding errors.

Failure Modes

Long-horizon agents fail in characteristic ways.

Failure mode	Description
Goal drift	Agent gradually departs from the user’s objective
Looping	Agent repeats similar actions
Premature stopping	Agent finishes before verification
Tool hallucination	Agent assumes tool results that did not occur
Context loss	Important constraints disappear
Overplanning	Agent spends effort planning instead of acting
Error accumulation	Small mistakes compound

The best practical defense is state discipline: preserve constraints, record observations, verify outputs, and stop when the objective is satisfied.

PyTorch View of Agents

An agent is not usually a single PyTorch module. It is a system around a model.

Still, the policy model can be represented abstractly:

class AgentPolicy(torch.nn.Module):
    def forward(self, state_tokens):
        hidden = self.backbone(state_tokens)
        action_logits = self.action_head(hidden[:, -1])
        return action_logits

A tool-using system wraps this model in an execution loop:

state = encode_task(user_goal)

for _ in range(max_steps):
    action = decode_action(policy(state))

    observation = run_tool(action)
    state = append_observation(state, action, observation)

    if action.is_final:
        break

In practice, modern agents often use pretrained foundation models rather than training an agent policy from scratch. The important concept is the separation between model prediction and environment interaction.

Evaluation

Long-horizon agents are evaluated by task success rather than next-token accuracy.

Useful metrics include:

Metric	Meaning
Success rate	Fraction of completed tasks
Step count	Efficiency
Tool error rate	Quality of tool use
Verification pass rate	Objective correctness
Cost	Tokens, compute, API calls
Human intervention rate	Need for assistance

A good benchmark should include tasks with hidden tests or independent verification. Otherwise, the agent may produce plausible but incorrect outputs.

Summary

Long-horizon agents extend foundation models into goal-directed systems. They maintain state, plan, use tools, update memory, verify results, and revise behavior over many steps.

The main theoretical ideas are policy learning, state representation, planning, tool use, memory, credit assignment, and process supervision. The main engineering problems are context management, tool reliability, verification, safety boundaries, and cost control.

In PyTorch terms, the neural model supplies a policy over actions. The full agent is the loop that connects this policy to tools, memory, observations, and external verification.