Over the last two years, the center of gravity in AI discussions has clearly shifted.
At first, people mostly cared about whether a model could answer questions, write code, or summarize a document. Now the question is different: can it actually get work done?
That is why “agent” became a serious engineering topic. People started building skills, workflows, multi-agent systems, memory layers, tool use, and increasingly long prompts that try to hold everything together.
But if models themselves keep getting stronger, then a more fundamental question matters:
In the agents we build today, what becomes long-term product and engineering value, and what is just temporary scaffolding?
Models will keep improving. That part is close to consensus. The real differentiator will not be who writes cleverer prompts. It will be who puts the model inside a system that can do work, accumulate context, and be governed reliably.
My current answer is straightforward:
The agents that last will not be the ones that keep “making the model smarter.” They will be the ones that place the model inside a system that can execute, accumulate, and govern.
This is really a boundary question: what belongs to the model, and what belongs to the system?
1. Start with the mistake: an agent is not a more complicated prompt
Many agent systems drift in the same direction.
The team starts by adding a few rules to make the model steadier. Then a few more steps. Then fallback instructions. Then tool ordering, exception handling, formatting rules, and fixed wording.
What you get at the end looks “smart,” but it is fragile.
The problem is not that prompts do not matter. They do.
The problem is that prompts are being asked to do too many jobs at once:
- define the role
- provide the knowledge
- encode the workflow
- constrain the tools
- enforce the evaluation logic
Prompts are useful for direction, framing, boundaries, and tone. They are much worse as a knowledge layer, process engine, tool policy layer, and reliability framework all at the same time.
Once everything gets pushed into one prompt, the system becomes hard to reason about:
- maintenance gets expensive
- local changes have system-wide side effects
- failure analysis becomes guesswork
That is why many teams become disappointed with “agents” after the first wave of experimentation. The issue is usually not the direction. It is the lack of layering from day one.
Conclusion: From Anthropic’s definition of agents to OpenAI’s breakdown of agent building, the prompt is only one part of an agent, not the system itself. If you push knowledge, workflow, tool policy, and evaluation into the prompt, the system gets more brittle, not more durable. (Sources: Anthropic, Building effective agents, OpenAI, Agent Builder)
2. Draw the line: the first thing stronger models eat is “generic cognition patching”
If models keep improving, the first category they compress is usually the layer of generic cognitive patching.
That includes things like:
- rewriting user input into cleaner task statements
- pulling the key points out of noisy information
- producing decent summaries and reports
- doing shallow decomposition and planning
- making acceptable judgments in common cases
Those are exactly the abilities models are supposed to keep improving at.
So if most of an agent’s value still comes from:
- helping the model understand a little more
- rephrasing user intent
- wrapping a thin planning step around the model
- “correcting” the model into basic common sense with prompt tricks
then the shelf life is probably short.
This is not a hard theorem. It is an engineering inference from the direction of model progress.
Recent model releases increasingly emphasize stronger reasoning, coding, multi-step execution, and agentic workflows. That means the more your product is just “generic cognition patching,” the more likely it is to be absorbed by the next model generation.
Conclusion: If an agent’s core value is still “help the model understand better, rewrite, or do shallow planning,” it is more exposed to model progress than to durable product differentiation. That is an engineering inference from the public trajectory of GPT-5 and Claude Opus. (Sources: OpenAI, Introducing GPT-5 for developers, Anthropic, Claude Opus)
3. What does not go away is the system layer
The opposite category is the system layer.
These are the problems that do not disappear just because the model gets smarter:
- where information comes from
- which context is relevant right now
- which tools are available
- which tools are read-only versus destructive
- which actions require approval
- how state is preserved across steps
- how handoffs happen
- how failures are replayed and audited
- how the same class of task gets more reliable over time
These are not intelligence problems. They are system design problems.
And stronger models make them more important, not less important.
Why? Because once models are good enough to be plugged into real environments, the surrounding requirements get stricter:
- permissions matter more
- auditability matters more
- approval boundaries matter more
- traces, replay, and evals matter more
So the real job of an agent system is not to build a larger brain. It is to build everything around the brain that lets it act safely and repeatably in the real world.
Conclusion: In real-world use, information access, permission boundaries, state management, approvals, traces, replay, and evaluation are system problems. They do not vanish just because the model gets better. (Sources: Anthropic, Building effective agents, OpenAI, Agents SDK, OpenAI, Safety best practices for building agents, Anthropic, Demystifying AI agent evals)
4. Break an agent apart and the confusion starts to clear
A useful way to look at an agent system is to split it into at least eight layers:
- Model: understands, reasons, judges, and writes
- Context: supplies what the model should see for the current task
- Knowledge: provides facts, rules, background, and experience
- Skills: capture reusable ways of doing recurring work
- Tools: perform the actual reads, writes, calls, and queries
- Workflow: determines the task sequence
- Orchestration: controls switching, routing, approval, and termination
- Evaluation and feedback: turns one-off success into repeatable reliability
The model is one layer among several. It is important, but it is not the whole system.
Once the layers are separated, a lot of design questions become easier:
- should this live in the prompt or in the knowledge layer?
- is this a tool or a skill?
- is this a workflow issue or an orchestration issue?
- should the model decide this, or should the system enforce it?
flowchart TB
U["User task"] --> G["Goal<br/>What does “done” mean here?"]
G --> M["Model<br/>Understand, reason, judge, express"]
M --> C["Context<br/>Task-relevant information"]
M --> Memory["Memory<br/>History and intermediate state"]
M --> K["Knowledge base<br/>Rules, experience, docs"]
C --> O["Orchestration<br/>Routing, switching, approval, tracking"]
Memory --> O
K --> O
O --> W["Workflow<br/>Task sequence"]
O --> S["Skills<br/>Reusable method packages"]
O --> B["Boundaries<br/>Permissions and risk constraints"]
W --> T["Tools<br/>Logs, DBs, APIs, terminals"]
S --> T
B --> T
T --> E["Real environment<br/>Code, systems, data, users"]
E --> F["Evaluation and feedback<br/>Replay, regression, improvement"]Conclusion: Once you separate model, context, knowledge, skills, tools, workflow, orchestration, and evaluation, many debates about “what should go in the prompt” versus “what should go in the system” resolve themselves. (Sources: Anthropic, Effective context engineering for AI agents, Anthropic, Building effective agents, OpenAI, Agent Builder)
5. The three pairs people mix up most often
Knowledge is not skill
The simplest distinction is:
- Knowledge answers “what is true?”
- Skill answers “how should this kind of task be done?”
Knowledge is closer to material. Skill is closer to method.
In database troubleshooting:
- “What do TiDB, TiKV, and PD each do?” is knowledge
- “For slow queries, check the slow log, then the plan, then attribute the cause” is skill
Knowledge provides judgment material. Skill organizes the action path.
A tool is not a skill
A useful engineering abstraction is:
- Tools are minimal actions
- Skills are reusable method bundles built around a task type
For example:
- “Fetch the last 30 minutes of logs” is a tool-level action
- “Run performance triage” is a skill-level package
The first is a move. The second is a structured way of working.
Workflow exists at two levels
When people say “workflow,” they are often referring to two different things.
One is the task-level workflow:
- receive the issue
- triage
- collect evidence
- test hypotheses
- produce a conclusion
- escalate or request approval if needed
The other is the skill-internal sequence:
- inspect slow logs
- inspect the execution plan
- determine whether the cause is statistics or index-related
- produce the finding
So workflow is not a single layer. There is a global sequence and there are local method steps inside individual skills.
Conclusion: Knowledge is closer to material, skill is closer to method; tools are closer to minimal actions, while skills are reusable task packages; workflow exists both as task-level sequencing and as internal skill steps. This is not semantics for its own sake. It is a practical abstraction grounded in RAG, tool-use, and orchestration patterns. (Sources: RAG, ReAct, Toolformer, Anthropic, Writing effective tools for AI agents, OpenAI, A practical guide to building AI agents)
6. Why context engineering becomes more important
Many people used to think a better first prompt would naturally make the system more stable.
That stops working as soon as the task becomes long-running, multi-step, or tool-driven.
What matters is not just how the system starts. It is whether the model keeps getting the right information as the task unfolds:
- what should be in context now
- what should be deferred
- what history should be compressed
- what state must be preserved
- what tools are currently available
- what counts as “done”
Prompt engineering is about what to say.
Context engineering is about building the workbench correctly.
And once the workbench is wrong, failure modes show up fast:
- too much information and no focus
- critical information buried in the middle
- irrelevant context entering too early and creating noise
That is why good agent systems do not just increase context size. They manage context shape.
Conclusion: In multi-step execution, the decisive factor is often not how elegant the opening prompt is, but whether the model keeps receiving the right information over time. Context is not better when it is larger. It is better when it stays aligned with the task. (Sources: Anthropic, Effective context engineering for AI agents, Lost in the Middle)
7. Why organizational knowledge still has to stay externalized
Models will absorb more general knowledge. That part is obvious.
But the most important knowledge inside an organization is often not “general intelligence.” It is operational state:
- internal runbooks
- the newest internal rules
- customer-specific environments
- version-specific known issues
- approval policies
- evidence and citation requirements
That kind of information still has to be:
- updated
- tracked
- audited
- permissioned
As long as those requirements exist, it should not live only inside model weights.
So stronger models do not make externalized organizational knowledge less important. They make it more important to move from “the model might know this” to “the system explicitly provides this.”
Conclusion: General knowledge will increasingly be absorbed into models, but organizational knowledge will not disappear with it. If information needs updating, tracking, auditing, or permissioning, it should not exist only in weights. (Sources: RAG, Anthropic, Effective context engineering for AI agents, OpenAI, Agent Builder)
8. In a database troubleshooting agent, the boundary gets clearer
Database troubleshooting is a good example because the split is easy to see.
Models are well-suited for many parts of the task:
- understanding alerts and incident descriptions
- doing initial categorization
- extracting signal from logs, metrics, and execution plans
- drafting an incident summary
- proposing likely directions
Those are strong candidates for collaboration between model, knowledge, and skill layers.
But the surrounding system questions do not solve themselves:
- which environments are queryable
- which tools are read-only
- which actions are destructive
- which steps require human approval
- which conclusions must include evidence
- how experience from one incident gets reused in the next one
So if you are building a database troubleshooting agent, the long-term asset is not a giant paragraph that says “think like a senior DBA.”
The long-term asset is:
- tool interfaces and permission models
- state handling
- organizational knowledge supply
- approval boundaries
- reusable skills
- replay and evaluation
Conclusion: In domain agents, the durable asset is not elaborate persona text like “act like a senior DBA.” It is tool access, state management, organizational knowledge, approval boundaries, reusable skills, and evaluation/replay. (Sources: Anthropic, Writing effective tools for AI agents, OpenAI, Safety best practices for building agents, Anthropic, Demystifying AI agent evals)
9. If you are building agents today, these are the easiest ways to go wrong
The four failure modes I worry about most are straightforward.
First: spending too much effort patching model weaknesses
That layer is exposed to rapid model progress.
Second: pushing everything into the prompt
It looks fast in the short term and turns into a black box in the medium term.
Third: mistaking a provider’s product UX for your own architecture
You can learn from ChatGPT, Claude, or any other provider. But their surface UX reflects internal runtime choices. It should not automatically become your system design.
Fourth: starting with multi-agent complexity and no eval loop
Once complexity rises, if you do not have traces, replay, and evaluation, you quickly end up with a system that looks active but is impossible to debug.
The steadier order is usually:
- start with a simple agent or workflow
- get tools and context right first
- add workflow and orchestration where the task boundary demands it
- use evals to decide whether more complexity is justified
Conclusion: The stable default is not “start with many agents.” It is “start simple, then let evals justify additional complexity.” Otherwise you are likely to build a system that is more expensive and harder to reason about than the task requires. (Sources: Anthropic, Building effective agents, OpenAI, A practical guide to building AI agents, Anthropic, Demystifying AI agent evals)
10. The closing point: the durable part is the system that makes models reliably useful
Models will keep improving.
So the important thing in agent design is not chasing every model generation with patch-style upgrades. It is deciding, clearly and early:
- what should belong to the model
- what should still belong to the system even if models get much better
Once that line is clear, the rest gets simpler.
Skills, workflows, knowledge, context engineering, orchestration—these ideas only become useful once they stop being buzzwords and start being a division of labor.
If I had to compress the whole argument into one sentence, it would be this:
What will last is not the bag of tricks that makes the model look smarter. It is the system capability that makes the model reliably useful.
And that is why the final distinction matters:
The model is the brain. The agent has to be the body.
Final conclusion: Across Anthropic and OpenAI’s engineering guidance, and across the deeper work on retrieval, tool use, and evaluation, the durable direction converges on the same idea: the long-term value is not in endlessly patching the model. It is in placing the model inside a system that can execute, accumulate, and govern. (Sources: Anthropic, Building effective agents, Anthropic, Effective context engineering for AI agents, OpenAI, Agents SDK, RAG)
References
- Anthropic, Building effective agents
- Anthropic, Effective context engineering for AI agents
- Anthropic, Writing effective tools for AI agents
- Anthropic, Demystifying AI agent evals
- OpenAI, Agent Builder
- OpenAI, Agents SDK
- OpenAI, Safety best practices for building agents
- OpenAI, A practical guide to building AI agents
- OpenAI, Introducing GPT-5 for developers
- Anthropic, Claude Opus
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- ReAct: Synergizing Reasoning and Acting in Language Models
- Toolformer: Language Models Can Teach Themselves to Use Tools
- Lost in the Middle: How Language Models Use Long Contexts