Long-Horizon Agents

A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.

A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.

A single model call answers one prompt. An agent loop extends this into a process:

$$ \text{observe} \rightarrow \text{plan} \rightarrow \text{act} \rightarrow \text{observe} \rightarrow \cdots $$

The word “long-horizon” means the task cannot be solved reliably in one step. The agent must preserve intent across time.

Examples include:

Task Why it is long-horizon
Building a software feature Requires reading code, editing files, testing, debugging
Researching a topic Requires search, source selection, synthesis, citation
Planning a trip Requires constraints, availability, routes, booking options
Running an experiment Requires setup, execution, measurement, analysis
Robot manipulation Requires perception, motion, feedback, correction

The central problem is control. The agent must decide what to do next, not merely predict the next token.

Agent State

An agent maintains state across steps. State includes the task objective, observations, tool results, partial outputs, memory, and constraints.

A minimal agent state can be written as:

$$ s_t = (g, o_{\leq t}, a_{<t}, m_t), $$

where:

Symbol Meaning
$g$ Goal
$o_{\leq t}$ Observations up to time $t$
$a_{<t}$ Previous actions
$m_t$ Memory at time $t$

The next action is selected from this state:

$$ a_t \sim \pi_\theta(a \mid s_t). $$

Here $\pi_\theta$ is the model’s policy. In an LLM agent, the policy is usually implemented by a language model prompted with the current state and available tools.

Agent Loop

A basic agent loop has four stages.

Stage Role
Observe Read current environment state
Decide Select next action
Act Execute tool call or produce output
Update Store result and revise state

In pseudocode:

state = init_state(goal)

for step in range(max_steps):
    action = policy(state)

    if action.type == "final":
        return action.output

    observation = environment.step(action)
    state = update_state(state, action, observation)

This loop is simple. Real systems add validation, tool schemas, error handling, budgets, retries, and safety checks.

Planning

Planning decomposes a goal into steps.

For example, a coding task may become:

  1. Inspect repository structure.
  2. Locate relevant files.
  3. Read interfaces.
  4. Edit implementation.
  5. Run tests.
  6. Fix failures.
  7. Summarize changes.

The plan gives structure, but it should not be rigid. Tool results may reveal that the original plan was wrong.

A good agent uses a plan as a working hypothesis.

Planning may be explicit, stored as text, or implicit, represented inside the hidden state of the model. Explicit plans are easier to inspect and revise.

Replanning

Long-horizon tasks rarely follow the first plan exactly.

Replanning occurs when:

Trigger Example
Observation contradicts assumption File name differs from expected
Tool call fails API returns an error
Task becomes underspecified Multiple valid interpretations appear
New constraint appears User adds a deadline
Partial result is poor Test failure exposes a bug

Replanning can be represented as:

$$ \pi_\theta(a_t \mid s_t) $$

where the current state $s_t$ includes the latest observations. The agent does not choose actions from the initial prompt alone.

Tools

Tools allow an agent to affect the world or inspect external state.

Common tool types include:

Tool Purpose
Search Retrieve external information
Code execution Run programs and tests
File editing Modify project state
Database query Read structured records
Browser Inspect web pages
Calendar or email Operate personal workflows
Robot controller Move in physical space

A tool call usually has a schema:

{
    "name": "search",
    "arguments": {
        "query": "PyTorch distributed data parallel tutorial"
    }
}

Schemas constrain actions. They make tool use easier to validate and safer to execute.

Tool Selection

Tool selection is a decision problem. The agent must choose whether to answer directly, retrieve information, run code, ask for clarification, or stop.

A weak agent overuses tools. A weak agent also underuses tools. Both errors matter.

Error Consequence
Tool overuse Slow, noisy, expensive
Tool underuse Stale or unsupported answers
Wrong tool Irrelevant observation
Wrong arguments Failed or misleading result

Tool selection improves when the system has clear action descriptions, examples, and feedback from tool results.

Memory

Long-horizon agents need memory because the full history may exceed the context window.

Memory can be divided into several kinds.

Memory type Description
Working memory Current task state
Episodic memory Previous events and actions
Semantic memory Stable facts and knowledge
Procedural memory Reusable methods and policies
External memory Documents, databases, vector stores

Working memory is usually included directly in the prompt. External memory is retrieved when relevant.

A memory write should be selective. Storing everything creates noise. Storing too little causes forgetting.

Reflection and Self-Evaluation

Many agent systems include a reflection step. The agent reviews its own intermediate result and decides whether it is sufficient.

Example checks:

Task Self-evaluation question
Coding Did tests pass?
Research Are claims supported by sources?
Planning Are constraints satisfied?
Math Does substitution verify the answer?
Writing Does the output match the requested style?

A reflection step may produce:

The current answer lacks source citations. Retrieve primary sources before finalizing.

Reflection is useful only when connected to action. A critique that does not change behavior adds cost without improving the result.

Verification

Verification checks whether the agent’s output satisfies external criteria.

Domain Verification method
Code Unit tests, type checks, linters
Math Proof checking, substitution
Retrieval Source citation and quote matching
Data analysis Recomputed statistics
Tool use API response validation

Verification separates plausible generation from reliable execution.

For code agents, the test suite is often more valuable than model self-confidence. For research agents, citations and source inspection are more valuable than fluent prose.

Credit Assignment

Long-horizon tasks make learning difficult because success or failure may depend on actions taken many steps earlier.

If an agent fails at step 40, the cause may be:

  • a bad assumption at step 3
  • poor retrieval at step 9
  • a wrong edit at step 18
  • missing validation at step 30

This is a credit assignment problem.

In reinforcement learning terms, the return depends on a trajectory:

$$ \tau = (s_0,a_0,r_0,s_1,a_1,r_1,\ldots,s_T). $$

The objective is to improve the policy:

$$ \max_\theta \mathbb{E}{\tau \sim \pi\theta}[R(\tau)]. $$

Long trajectories make this objective high variance. Practical agent training often uses shorter supervised traces, preference data, tool-use demonstrations, or process rewards.

Process Supervision

Outcome supervision judges only the final answer. Process supervision gives feedback on intermediate steps.

For long-horizon agents, process supervision is often more informative.

Supervision type Signal
Outcome supervision Final task success
Process supervision Quality of steps
Tool supervision Correct tool call
Verification supervision Test or checker result

A coding agent can learn from traces where each step is labeled as useful or harmful. A research agent can learn whether a citation actually supports a claim.

Hierarchical Agents

A long task can be decomposed hierarchically.

A high-level planner chooses subgoals. Low-level workers execute them.

$$ \text{Goal} \rightarrow \text{Subgoal} \rightarrow \text{Action}. $$

For example:

Level Coding agent example
High-level Implement authentication
Mid-level Add middleware
Low-level Edit auth.py

Hierarchical control reduces complexity. Each layer operates at a different temporal scale.

Multi-Agent Systems

Some systems use multiple agents with distinct roles.

Agent role Function
Planner Break down the task
Researcher Gather evidence
Coder Modify implementation
Critic Review output
Executor Run tests or tools

Multi-agent systems can improve coverage, but they introduce coordination costs. Agents may duplicate work, disagree, or amplify errors.

A multi-agent design should have clear responsibilities and a final arbitration mechanism.

Agent Environments

An agent environment defines what actions are possible and what observations are returned.

Examples:

Environment Actions
Shell Run commands
Browser Open, click, search
Codebase Read, patch, test
Game Move, inspect, interact
Robot Sense, move, grasp

The environment determines the agent’s effective capabilities.

A language model without tools can describe actions. A language model with tools can execute actions.

Safety Constraints

Long-horizon agents require stricter safety controls than single-turn systems because they can act repeatedly.

Important constraints include:

Constraint Purpose
Permission boundaries Prevent unauthorized actions
Tool allowlists Limit available operations
Budget limits Bound cost and time
Human approval Gate sensitive actions
Sandboxing Contain execution
Logging Support auditability

The longer the horizon, the more opportunities exist for compounding errors.

Failure Modes

Long-horizon agents fail in characteristic ways.

Failure mode Description
Goal drift Agent gradually departs from the user’s objective
Looping Agent repeats similar actions
Premature stopping Agent finishes before verification
Tool hallucination Agent assumes tool results that did not occur
Context loss Important constraints disappear
Overplanning Agent spends effort planning instead of acting
Error accumulation Small mistakes compound

The best practical defense is state discipline: preserve constraints, record observations, verify outputs, and stop when the objective is satisfied.

PyTorch View of Agents

An agent is not usually a single PyTorch module. It is a system around a model.

Still, the policy model can be represented abstractly:

class AgentPolicy(torch.nn.Module):
    def forward(self, state_tokens):
        hidden = self.backbone(state_tokens)
        action_logits = self.action_head(hidden[:, -1])
        return action_logits

A tool-using system wraps this model in an execution loop:

state = encode_task(user_goal)

for _ in range(max_steps):
    action = decode_action(policy(state))

    observation = run_tool(action)
    state = append_observation(state, action, observation)

    if action.is_final:
        break

In practice, modern agents often use pretrained foundation models rather than training an agent policy from scratch. The important concept is the separation between model prediction and environment interaction.

Evaluation

Long-horizon agents are evaluated by task success rather than next-token accuracy.

Useful metrics include:

Metric Meaning
Success rate Fraction of completed tasks
Step count Efficiency
Tool error rate Quality of tool use
Verification pass rate Objective correctness
Cost Tokens, compute, API calls
Human intervention rate Need for assistance

A good benchmark should include tasks with hidden tests or independent verification. Otherwise, the agent may produce plausible but incorrect outputs.

Summary

Long-horizon agents extend foundation models into goal-directed systems. They maintain state, plan, use tools, update memory, verify results, and revise behavior over many steps.

The main theoretical ideas are policy learning, state representation, planning, tool use, memory, credit assignment, and process supervision. The main engineering problems are context management, tool reliability, verification, safety boundaries, and cost control.

In PyTorch terms, the neural model supplies a policy over actions. The full agent is the loop that connects this policy to tools, memory, observations, and external verification.