As Models Get Stronger, the Lasting Value of Agents Is in Systems, Not Prompt Patching

Over the last two years, the center of gravity in AI discussions has clearly shifted.

At first, people mostly cared about whether a model could answer questions, write code, or summarize a document. Now the question is different: can it actually get work done?

That is why “agent” became a serious engineering topic. People started building skills, workflows, multi-agent systems, memory layers, tool use, and increasingly long prompts that try to hold everything together.

But if models themselves keep getting stronger, then a more fundamental question matters:

In the agents we build today, what becomes long-term product and engineering value, and what is just temporary scaffolding?

Models will keep improving. That part is close to consensus. The real differentiator will not be who writes cleverer prompts. It will be who puts the model inside a system that can do work, accumulate context, and be governed reliably.

My current answer is straightforward:

The agents that last will not be the ones that keep “making the model smarter.” They will be the ones that place the model inside a system that can execute, accumulate, and govern.

This is really a boundary question: what belongs to the model, and what belongs to the system?

1. Start with the mistake: an agent is not a more complicated prompt

Many agent systems drift in the same direction.

The team starts by adding a few rules to make the model steadier. Then a few more steps. Then fallback instructions. Then tool ordering, exception handling, formatting rules, and fixed wording.

What you get at the end looks “smart,” but it is fragile.

The problem is not that prompts do not matter. They do.

The problem is that prompts are being asked to do too many jobs at once:

define the role
provide the knowledge
encode the workflow
constrain the tools
enforce the evaluation logic

Prompts are useful for direction, framing, boundaries, and tone. They are much worse as a knowledge layer, process engine, tool policy layer, and reliability framework all at the same time.

Once everything gets pushed into one prompt, the system becomes hard to reason about:

maintenance gets expensive
local changes have system-wide side effects
failure analysis becomes guesswork

That is why many teams become disappointed with “agents” after the first wave of experimentation. The issue is usually not the direction. It is the lack of layering from day one.

Conclusion: From Anthropic’s definition of agents to OpenAI’s breakdown of agent building, the prompt is only one part of an agent, not the system itself. If you push knowledge, workflow, tool policy, and evaluation into the prompt, the system gets more brittle, not more durable. (Sources: Anthropic, Building effective agents, OpenAI, Agent Builder)

2. Draw the line: the first thing stronger models eat is “generic cognition patching”

If models keep improving, the first category they compress is usually the layer of generic cognitive patching.

That includes things like:

rewriting user input into cleaner task statements
pulling the key points out of noisy information
producing decent summaries and reports
doing shallow decomposition and planning
making acceptable judgments in common cases

Those are exactly the abilities models are supposed to keep improving at.

So if most of an agent’s value still comes from:

helping the model understand a little more
rephrasing user intent
wrapping a thin planning step around the model
“correcting” the model into basic common sense with prompt tricks

then the shelf life is probably short.

This is not a hard theorem. It is an engineering inference from the direction of model progress.

Recent model releases increasingly emphasize stronger reasoning, coding, multi-step execution, and agentic workflows. That means the more your product is just “generic cognition patching,” the more likely it is to be absorbed by the next model generation.

Conclusion: If an agent’s core value is still “help the model understand better, rewrite, or do shallow planning,” it is more exposed to model progress than to durable product differentiation. That is an engineering inference from the public trajectory of GPT-5 and Claude Opus. (Sources: OpenAI, Introducing GPT-5 for developers, Anthropic, Claude Opus)

3. What does not go away is the system layer

The opposite category is the system layer.

These are the problems that do not disappear just because the model gets smarter:

where information comes from
which context is relevant right now
which tools are available
which tools are read-only versus destructive
which actions require approval
how state is preserved across steps
how handoffs happen
how failures are replayed and audited
how the same class of task gets more reliable over time

These are not intelligence problems. They are system design problems.

And stronger models make them more important, not less important.

Why? Because once models are good enough to be plugged into real environments, the surrounding requirements get stricter:

permissions matter more
auditability matters more
approval boundaries matter more
traces, replay, and evals matter more

So the real job of an agent system is not to build a larger brain. It is to build everything around the brain that lets it act safely and repeatably in the real world.

Conclusion: In real-world use, information access, permission boundaries, state management, approvals, traces, replay, and evaluation are system problems. They do not vanish just because the model gets better. (Sources: Anthropic, Building effective agents, OpenAI, Agents SDK, OpenAI, Safety best practices for building agents, Anthropic, Demystifying AI agent evals)

4. Break an agent apart and the confusion starts to clear

A useful way to look at an agent system is to split it into at least eight layers:

Model: understands, reasons, judges, and writes
Context: supplies what the model should see for the current task
Knowledge: provides facts, rules, background, and experience
Skills: capture reusable ways of doing recurring work
Tools: perform the actual reads, writes, calls, and queries
Workflow: determines the task sequence
Orchestration: controls switching, routing, approval, and termination
Evaluation and feedback: turns one-off success into repeatable reliability

The model is one layer among several. It is important, but it is not the whole system.

Once the layers are separated, a lot of design questions become easier:

should this live in the prompt or in the knowledge layer?
is this a tool or a skill?
is this a workflow issue or an orchestration issue?
should the model decide this, or should the system enforce it?

flowchart TB
  U["User task"] --> G["Goal<br/>What does “done” mean here?"]
  G --> M["Model<br/>Understand, reason, judge, express"]
  M --> C["Context<br/>Task-relevant information"]
  M --> Memory["Memory<br/>History and intermediate state"]
  M --> K["Knowledge base<br/>Rules, experience, docs"]
  C --> O["Orchestration<br/>Routing, switching, approval, tracking"]
  Memory --> O
  K --> O
  O --> W["Workflow<br/>Task sequence"]
  O --> S["Skills<br/>Reusable method packages"]
  O --> B["Boundaries<br/>Permissions and risk constraints"]
  W --> T["Tools<br/>Logs, DBs, APIs, terminals"]
  S --> T
  B --> T
  T --> E["Real environment<br/>Code, systems, data, users"]
  E --> F["Evaluation and feedback<br/>Replay, regression, improvement"]

Conclusion: Once you separate model, context, knowledge, skills, tools, workflow, orchestration, and evaluation, many debates about “what should go in the prompt” versus “what should go in the system” resolve themselves. (Sources: Anthropic, Effective context engineering for AI agents, Anthropic, Building effective agents, OpenAI, Agent Builder)

5. The three pairs people mix up most often

Knowledge is not skill

The simplest distinction is:

Knowledge answers “what is true?”
Skill answers “how should this kind of task be done?”

Knowledge is closer to material. Skill is closer to method.

In database troubleshooting:

“What do TiDB, TiKV, and PD each do?” is knowledge
“For slow queries, check the slow log, then the plan, then attribute the cause” is skill

Knowledge provides judgment material. Skill organizes the action path.

A tool is not a skill

A useful engineering abstraction is:

Tools are minimal actions
Skills are reusable method bundles built around a task type

For example:

“Fetch the last 30 minutes of logs” is a tool-level action
“Run performance triage” is a skill-level package

The first is a move. The second is a structured way of working.

Workflow exists at two levels

When people say “workflow,” they are often referring to two different things.

One is the task-level workflow:

receive the issue
triage
collect evidence
test hypotheses
produce a conclusion
escalate or request approval if needed

The other is the skill-internal sequence:

inspect slow logs
inspect the execution plan
determine whether the cause is statistics or index-related
produce the finding

So workflow is not a single layer. There is a global sequence and there are local method steps inside individual skills.

Conclusion: Knowledge is closer to material, skill is closer to method; tools are closer to minimal actions, while skills are reusable task packages; workflow exists both as task-level sequencing and as internal skill steps. This is not semantics for its own sake. It is a practical abstraction grounded in RAG, tool-use, and orchestration patterns. (Sources: RAG, ReAct, Toolformer, Anthropic, Writing effective tools for AI agents, OpenAI, A practical guide to building AI agents)

6. Why context engineering becomes more important

Many people used to think a better first prompt would naturally make the system more stable.

That stops working as soon as the task becomes long-running, multi-step, or tool-driven.

What matters is not just how the system starts. It is whether the model keeps getting the right information as the task unfolds:

what should be in context now
what should be deferred
what history should be compressed
what state must be preserved
what tools are currently available
what counts as “done”

Prompt engineering is about what to say.

Context engineering is about building the workbench correctly.

And once the workbench is wrong, failure modes show up fast:

too much information and no focus
critical information buried in the middle
irrelevant context entering too early and creating noise

That is why good agent systems do not just increase context size. They manage context shape.

Conclusion: In multi-step execution, the decisive factor is often not how elegant the opening prompt is, but whether the model keeps receiving the right information over time. Context is not better when it is larger. It is better when it stays aligned with the task. (Sources: Anthropic, Effective context engineering for AI agents, Lost in the Middle)

7. Why organizational knowledge still has to stay externalized

Models will absorb more general knowledge. That part is obvious.

But the most important knowledge inside an organization is often not “general intelligence.” It is operational state:

internal runbooks
the newest internal rules
customer-specific environments
version-specific known issues
approval policies
evidence and citation requirements

That kind of information still has to be:

updated
tracked
audited
permissioned

As long as those requirements exist, it should not live only inside model weights.

So stronger models do not make externalized organizational knowledge less important. They make it more important to move from “the model might know this” to “the system explicitly provides this.”

Conclusion: General knowledge will increasingly be absorbed into models, but organizational knowledge will not disappear with it. If information needs updating, tracking, auditing, or permissioning, it should not exist only in weights. (Sources: RAG, Anthropic, Effective context engineering for AI agents, OpenAI, Agent Builder)

8. In a database troubleshooting agent, the boundary gets clearer

Database troubleshooting is a good example because the split is easy to see.

Models are well-suited for many parts of the task:

understanding alerts and incident descriptions
doing initial categorization
extracting signal from logs, metrics, and execution plans
drafting an incident summary
proposing likely directions

Those are strong candidates for collaboration between model, knowledge, and skill layers.

But the surrounding system questions do not solve themselves:

which environments are queryable
which tools are read-only
which actions are destructive
which steps require human approval
which conclusions must include evidence
how experience from one incident gets reused in the next one

So if you are building a database troubleshooting agent, the long-term asset is not a giant paragraph that says “think like a senior DBA.”

The long-term asset is:

tool interfaces and permission models
state handling
organizational knowledge supply
approval boundaries
reusable skills
replay and evaluation

Conclusion: In domain agents, the durable asset is not elaborate persona text like “act like a senior DBA.” It is tool access, state management, organizational knowledge, approval boundaries, reusable skills, and evaluation/replay. (Sources: Anthropic, Writing effective tools for AI agents, OpenAI, Safety best practices for building agents, Anthropic, Demystifying AI agent evals)

9. If you are building agents today, these are the easiest ways to go wrong

The four failure modes I worry about most are straightforward.

First: spending too much effort patching model weaknesses

That layer is exposed to rapid model progress.

Second: pushing everything into the prompt

It looks fast in the short term and turns into a black box in the medium term.

Third: mistaking a provider’s product UX for your own architecture

You can learn from ChatGPT, Claude, or any other provider. But their surface UX reflects internal runtime choices. It should not automatically become your system design.

Fourth: starting with multi-agent complexity and no eval loop

Once complexity rises, if you do not have traces, replay, and evaluation, you quickly end up with a system that looks active but is impossible to debug.

The steadier order is usually:

start with a simple agent or workflow
get tools and context right first
add workflow and orchestration where the task boundary demands it
use evals to decide whether more complexity is justified

Conclusion: The stable default is not “start with many agents.” It is “start simple, then let evals justify additional complexity.” Otherwise you are likely to build a system that is more expensive and harder to reason about than the task requires. (Sources: Anthropic, Building effective agents, OpenAI, A practical guide to building AI agents, Anthropic, Demystifying AI agent evals)

10. The closing point: the durable part is the system that makes models reliably useful

Models will keep improving.

So the important thing in agent design is not chasing every model generation with patch-style upgrades. It is deciding, clearly and early:

what should belong to the model
what should still belong to the system even if models get much better

Once that line is clear, the rest gets simpler.

Skills, workflows, knowledge, context engineering, orchestration—these ideas only become useful once they stop being buzzwords and start being a division of labor.

If I had to compress the whole argument into one sentence, it would be this:

What will last is not the bag of tricks that makes the model look smarter. It is the system capability that makes the model reliably useful.

And that is why the final distinction matters:

The model is the brain. The agent has to be the body.

Final conclusion: Across Anthropic and OpenAI’s engineering guidance, and across the deeper work on retrieval, tool use, and evaluation, the durable direction converges on the same idea: the long-term value is not in endlessly patching the model. It is in placing the model inside a system that can execute, accumulate, and govern. (Sources: Anthropic, Building effective agents, Anthropic, Effective context engineering for AI agents, OpenAI, Agents SDK, RAG)