Framework Matters 7x More Than Model: Braintrust 1781 Traces Reveal Real AI Agent Battleground

Framework Matters 7x More Than Model: Braintrust’s 1,781 Production Traces Reveal the Real AI Agent Battleground

Who is winning the AI Agent race? Based on 1,781 real production runs analyzed by AI evaluation platform Braintrust, the answer might surprise you. By keeping the model fixed and only switching the “agent harness” (the framework wrapping the model), success rates can jump from 12% to 92% — an 80-percentage-point swing that makes model choice almost irrelevant.

The Core Finding: Framework Impact Is 7x Greater Than Model

Braintrust’s regression analysis quantifies this intuition with hard numbers. After controlling for benchmark and model variables, agent harness explains ~5.3% of success rate variance, while model explains only 0.7%. Switching frameworks has over 7x the impact of switching models.

Crucially, the cost of switching frameworks is nearly zero — different harnesses consume roughly the same number of tokens for the same task. When the performance gap between six mainstream models on coding tasks has narrowed to single-digit percentages, “which model to choose” is no longer the decisive variable. “What tool you use to deploy the model to production” and “how low you can push the inference cost per successful task” are becoming the real differentiators.

For AI startups, model-layer commoditization is happening faster than most people realize. Continuing to build a moat on “which latest model we integrated” is a strategy with rapidly evaporating value.

Harness: The Biggest Lever — 81 Percentage Points

Braintrust tested five architecturally distinct agent harnesses:

  • claude_code: Anthropic’s native agent loop, using XML-like format for autonomous tool management
  • smolagents_code: Allows the model to write Python code to chain operations
  • tool_calling: Standard structured JSON function calling, one tool at a time
  • tool_calling_with_shortlisting: Pre-filters available tools each round
  • openai_solo: Thinnest possible OpenAI wrapper

The data from switching harnesses under the same model and task is staggering:

Model Task Harness Success Rate
Claude SWE-bench claude_code 100%
Claude SWE-bench tool_calling 14%
Kimi AppWorld smolagents_code 92%
Kimi AppWorld tool_calling 12%
GPT-4.1 Telco Support smolagents_code 51%
GPT-4.1 Telco Support claude_code 18%

Behind every success rate cliff is the same model. Microscopic differences in harness design — letting the model manage its own context vs. constraining each step with fixed templates; allowing code-chained tool calls vs. one-at-a-time JSON — push the gap to nearly an order of magnitude.

Open-Source Models’ Cost Book: $0.73 Per Success

On the SWE-bench coding benchmark, open-source models compete head-to-head with top closed-source models: DeepSeek V3.2 at 96% success, Kimi K2.5 at 94%, Claude Opus 4.5 at 100%, GPT-5.2 at 93%, Gemini 3 Pro at 87%.

The real watershed is on the cost side. Braintrust priced each run using LiteLLM’s actual token rates, then divided by success rate to get cost per success:

Task Model Harness Cost/Success
SWE-bench Kimi K2.5 claude_code $0.73
SWE-bench DeepSeek V3.2 claude_code $1.27
SWE-bench Claude Opus 4.5 claude_code $4.28
AppWorld Kimi smolagents_code $0.40
AppWorld Claude claude_code $84.33

Open-source models also have a structural cost advantage that closed-source models don’t: self-hosting. No per-call fees, no exposure to API price hikes. For companies deploying agents at scale, this constitutes a structural cost moat that short-term token price cuts can’t erase.

“Cheapest Tokens” ≠ “Highest Efficiency”

GPT-4.1 plays the textbook cautionary tale in this analysis. Its token bill looks astonishingly good on paper — 10 to 100x cheaper than other models on equivalent tasks. But when Braintrust dissected each run trace, they found: GPT-4.1’s failure rate on SWE-bench and AppWorld ranges from 53% to 90%. It’s “cheap” because it fails faster.

A cost metric without success rate isn’t an efficiency metric — it’s “completing a failure with fewer tokens.” The correct dimension for measuring efficiency is cost per success: single-task cost divided by success rate. This metric completely reshuffles the ranking.

On coding tasks, open-source models occupy the optimal frontier of cost-efficiency. On conversational customer service tasks, the picture flips entirely — GPT-4.1 leads at $0.02–$0.03 per success vs. Claude’s $1.95.

Key insight: There is no one-size-fits-all “cheapest model.” Coding tasks call for DeepSeek or Kimi self-hosted; customer service tasks favor GPT-4.1. Different task families map to entirely different cost-optimal solutions.

No Universal Model, Only Task-Optimal Solutions

Six benchmarks, four different champions. Claude wins SWE-bench, BrowseComp+, and TAU2 retail/telco support. Gemini takes TAU2 airline support at 100%. DeepSeek and Kimi lead significantly on AppWorld multi-app orchestration.

Even within the same harness, different models perform drastically differently. In AppWorld, Claude under its own claude_code harness achieves only 26%, far below DeepSeek’s 80% and Kimi’s 78% under the same harness. Model-task fit and model-harness synergy predict final performance far better than absolute model parameter size.

Braintrust also found that high average success rates can mask fatal local collapses. For startups, this means don’t bet on a single model. Build a differentiated model-harness matrix by task type.

Two Failure Modes, Two Opposite Monitoring Strategies

Agent failure behavior is completely opposite between coding and conversational tasks:

  • Coding/multi-app tasks: Failure comes with “turbulence.” Agents make more LLM calls, consume more tokens, and run longer than successful peers. BrowseComp+ failed runs consume 2.3x the tokens of successful runs. These tasks need token usage ceiling alerts to cut losses when agents get stuck in loops.
  • Conversational tasks: The pattern reverses entirely. Failed agents make fewer calls, use fewer tokens, and finish faster — no turbulence, just confidently delivering a wrong answer. These tasks need floor alerts to catch “too-smooth wrong deliveries.”

A one-size-fits-all threshold helps one task type while destroying the other.

Three Action Items for AI Agent Builders

Braintrust’s data tells a more fundamental story than “whose model scores higher.” What’s starting to separate winners is three capabilities beyond the model:

  1. Match the optimal harness per task type. Don’t default to “official SDK” or “most popular framework.” Run A/B tests on real tasks.
  2. Measure efficiency by cost per success. Don’t be fooled by per-token costs. Success rate is the denominator that determines real cost.
  3. Build differentiated failure monitoring. Coding tasks guard against “turbulence”; conversational tasks guard against “confident errors.”

The narrative is no longer about whose model is better — it’s about who delivers what success rate at what cost structure, in what task scenario. The battleground has shifted from models to cost, efficiency, and engineering.


Based on Braintrust’s analysis of 1,781 production traces from Hugging Face, covering 6 models across 6 task categories. Read the full report on Braintrust.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top