The Agentic Frontier Is Not a Single Race

There's a war happening in quiet Slack channels and conference rooms across Silicon Valley, and it doesn't involve any of the AI models themselves. It's a philosophical disagreement between the engineers building them: should an AI agent work alone, or should it delegate?

On one side, Cognition AI — makers of Devin — has staked its entire architecture on the single-agent model. The conviction is straightforward: a "super-agent" that maintains one unified context window across thousands of files is more reliable than a committee. When you fragment a task across multiple sub-agents, you risk losing the thread. The left hand stops knowing what the right hand is doing.

Anthropic takes the opposite view. Their architecture bets on multi-agent orchestration: a coordinator dispatches specialized sub-agents that work in parallel, each operating on scoped context. The benefit is raw throughput — enterprise research tasks that would take a single agent hours can complete in minutes when properly parallelized. The risk is coordination overhead and what engineers quietly call context fragmentation: the moment a sub-agent's partial view of the world leads it to a subtly wrong decision that compounds upstream.

The problem is deeper than it sounds. When you have five sub-agents collaborating on a 10,000-line codebase, each agent's context window is necessarily a cropped slice of reality. Agent A refactors an auth module. Agent B, unaware of that refactor mid-flight, extends a method that no longer exists in main. The orchestrator reconciles the merge conflict — or worse, it doesn't notice one exists.

Context is not just memory — it's coherence. And coherence, once broken across a swarm of agents, is extraordinarily difficult to reconstruct.

The honest answer is that neither side has definitively won yet. Single-agent architectures win on coherence for deep, sequential tasks — the kind where understanding the why behind a design decision made 300 commits ago matters. Multi-agent architectures win on parallelism for breadth-first, research-heavy workflows where independent subtasks genuinely don't interfere with each other.

The dangerous assumption is treating either as universally correct. The real engineering challenge is knowing which tool to reach for — and building the scaffolding to switch gracefully between them.

Why is the foundation builder asking the app builder for a grade?

Here's an interesting asymmetry: Anthropic — one of the most well-resourced AI safety and research labs on the planet — uses Cognition's SWE-bench and related "junior dev" benchmarks to validate their own models before release. The company that makes the underlying model is asking the company that built a product on top of it how well that model performs.

Think about that for a moment. It would be a bit like Intel using a mid-tier laptop manufacturer's performance benchmarks to certify their next-generation processor. The dynamic reveals something important: evaluating agentic behavior is fundamentally harder than evaluating raw model capability.

Standard benchmarks (MMLU, HumanEval, GSM8K) measure a model's ability to answer a question in isolation. Cognition's benchmarks measure whether an AI can execute a multi-step coding task autonomously — handling tool use, error recovery, test execution, and code review — without a human holding its hand at every turn. It's a categorically different evaluation.

The benchmark gap exposes an uncomfortable truth in AI development: capability and agency are not the same thing. A model can score impressively on MMLU while still failing spectacularly at "go debug this authentication bug that spans four microservices." The former tests knowledge retrieval. The latter tests sustained autonomous reasoning under ambiguous, evolving conditions.

Cognition has specialized in building the evaluation infrastructure for the latter. They've essentially created the rubric that the industry uses to grade "can this AI do real work?" — which gives them a structural advantage that has nothing to do with the raw intelligence of their underlying model, and everything to do with the difficulty of defining what "good" looks like in agentic contexts.

For engineers evaluating AI tooling, the takeaway is this: look past the benchmark scores on a vendor's landing page. Ask which benchmarks they're using, who designed them, and whether those benchmarks measure the type of work your team actually does.

Is AI helping your engineers code better, or just helping them stop learning?

Published research suggests that roughly 79% of Claude Code interactions are purely automative — the AI is doing the task, not helping someone do it. Compare that to the standard Claude.ai chat interface, where the split is closer to 49%. Engineers aren't using Claude Code as a copilot. They're using it as a replacement pilot.

At the individual level, the productivity case is clear. Engineers ship faster. PRs close quicker. Sprints finish early. The metrics that show up in Jira look fantastic.

The problem this research surfaces is what I'd call the competence trap: junior engineers who rely heavily on AI tooling early in their careers are spending more time querying than coding. They're developing fluency in prompt engineering instead of fluency in the underlying systems they're building on top of. They can generate a working OAuth flow from a prompt. Ask them to debug the token refresh race condition at 2am when the AI tool is rate-limited? That's a different story.

The engineers who will be most valuable in five years aren't the ones who can get AI to write code the fastest. They're the ones who understand the code well enough to know when the AI got it wrong.

For tech leads and CTOs, this is a genuine talent development dilemma. The productivity gains from AI automation are real and measurable. But the atrophy of foundational engineering intuition is slower, harder to measure, and much more expensive when it eventually manifests in your architecture decisions.

The most thoughtful engineering orgs I've observed are threading this needle deliberately: using AI automation for velocity on well-defined, repeatable tasks, while being intentional about creating space for junior engineers to struggle productively with ambiguous, high-stakes problems without AI as a safety net.

The goal shouldn't be to maximize the percentage of code written by AI. It should be to maximize the engineering judgment in the humans who still own the system.

Consider a team policy that separates "AI-assisted sprints" (high velocity, greenfield work) from "diagnostic sprints" (debugging, architecture reviews, incident retrospectives) where AI tooling is explicitly scoped back. Preserving the latter protects the cognitive muscle that makes your engineers irreplaceable.

The last 5%: who wins the long-horizon race?

AI models are, at their core, System 1 thinkers — fast, pattern-matching, fluent at the familiar. Give Claude or GPT-4 a well-scoped task, and the first 95% will be accomplished at a speed that still feels slightly surreal. The code compiles. The logic is sound. The tests pass. You start to think the problem is solved.

Then comes the last 5%.

The final 5% of a complex engineering task is almost always where the hard things live: the edge case that only manifests in production with real data, the architectural decision that requires understanding three months of prior context, the security implication that a pattern-matching system would never catch because it requires adversarial imagination rather than completion instinct.

This is the core frontier that both Cognition and Anthropic are racing toward: long-horizon coherence. The ability to maintain a consistent, accurate mental model of a complex system across not just thousands of tokens, but across hours and days of continuous autonomous operation. The ability to revisit a decision made six subtasks ago, recognize it was incorrect given new information, and propagate that correction forward without breaking everything else.

Cognition's bet is that their single-agent, deep-context architecture is better suited to this challenge. If your agent never fragments its understanding across sub-agents, it never has to reconcile conflicting world-states. The coherence problem is simpler when there's only one mind holding the thread.

Anthropic's bet is that context engineering — the deliberate design of what information is passed between agents, when, and in what form — can solve the coherence problem at scale. The challenge isn't maintaining a single context; it's designing information flows that preserve the right context at each decision point without overwhelming any individual agent's window.

The 95% is already table stakes. The companies and teams that figure out the last 5% will redefine what "software engineering" means — and who gets to do it.

Where does this leave engineering teams today? In a useful intermediate state. You can extract enormous value from AI agents on well-bounded, high-definition tasks — the System 1 work that currently consumes a disproportionate share of engineering hours. For the tasks in the last 5% — the ambiguous, the high-stakes, the architecturally novel — human judgment remains both necessary and differentiating.

The risk is in confusing the velocity of the 95% for evidence that the 5% is also handled. It isn't. Not yet. And a team that builds production systems as if it is will discover that gap at the worst possible moment.

The race is barely started

We are still in the earliest innings of genuinely autonomous AI agents in software engineering. The architectures being debated today — single vs. multi-agent, deep context vs. distributed orchestration — will look as primitive to engineers in 2030 as jQuery looks to us now.

What won't change is the underlying challenge: building systems that fail gracefully, that preserve coherence under uncertainty, and that earn the trust of the humans who are ultimately accountable for what they produce. That challenge belongs to engineers, regardless of how much of the code an agent writes.

The agents are getting better. The question is whether we are too.

The agentic landscape is evolving faster than any single essay can capture. If you're navigating these decisions on your engineering team, I'd love to hear how you're thinking about it — reach out on LinkedIn.

Context is not just memory — it's coherence. And coherence, once broken across a swarm of agents, is extraordinarily difficult to reconstruct.

Why is the foundation builder asking the app builder for a grade?

Is AI helping your engineers code better, or just helping them stop learning?

At the individual level, the productivity case is clear. Engineers ship faster. PRs close quicker. Sprints finish early. The metrics that show up in Jira look fantastic.

The engineers who will be most valuable in five years aren't the ones who can get AI to write code the fastest. They're the ones who understand the code well enough to know when the AI got it wrong.

The goal shouldn't be to maximize the percentage of code written by AI. It should be to maximize the engineering judgment in the humans who still own the system.

The last 5%: who wins the long-horizon race?

Then comes the last 5%.

The 95% is already table stakes. The companies and teams that figure out the last 5% will redefine what "software engineering" means — and who gets to do it.

The race is barely started

The agents are getting better. The question is whether we are too.

The Agentic Frontier Is Not a Single Race

Why is the foundation builder asking the app builder for a grade?

Is AI helping your engineers code better, or just helping them stop learning?

The last 5%: who wins the long-horizon race?

The race is barely started

Related Projects

The Agentic Frontier Is Not a Single Race

Why is the foundation builder asking the app builder for a grade?

Is AI helping your engineers code better, or just helping them stop learning?

The last 5%: who wins the long-horizon race?

The race is barely started

Related Projects