yieldOS benchmark dashboard
yieldOS turns risky agent output into an executable commit boundary
Across public and private real-repo deterministic runs, every tested unsafe control commit landed without yieldOS and every matching yieldOS-gated commit was stopped before commit. The calibration layer keeps the important nuance visible: not every realistic security issue is an instant deterministic stop today.
Benchmark story
Strong guardrail, honest limits
The presentation should lead with prevention and friction: unsafe changes that would have landed without yieldOS are blocked, benign commits are allowed, and live model runs show the guardrail operating at the point where generated code would enter the repo.
This dashboard is local-review evidence for product calibration. Run npm run evidence:verify before using any report as public proof. This makes the benchmark more credible: yieldOS has a measured safety boundary today and a clear path to expand coverage tomorrow.
Coverage calibration
Balanced cases: prevent known risks, allow safe work, identify deeper review
Core evidence
Prevention without broad false positives
Deterministic replay
Known unsafe commits in disposable real-repo clones
Review cost route
Calibration set: deterministic stop plus agent-assisted escalation
Delta: $4.68 across this calibration model. This models routing cost; it does not claim the deeper-review cases were automatically repaired.
False-positive replay
Benign public commits
Dollar values are assumption-based and intentionally small-scope. They model avoided review passes for this benchmark set, not total company-wide savings.
Live model workflow
Expanded frontier slice: outcomes by task
Safety charts include only evaluable model patches. The point is not to rank model intelligence; it is to show what happens when generated code reaches an executable commit boundary.
Expanded run outcomes by task
Evaluable generated patches only
Model economics
More expensive models still need a boundary
| Model arm | Cases | Accepted | Prevented | Provider cost |
|---|---|---|---|---|
| openai:gpt-5.5 / raw-agent | 16 | 12 | 4 | $0.27 |
| openai:gpt-5.5 / yieldos-guided-agent | 11 | 7 | 4 | $0.41 |
| anthropic:claude-opus-4-7 / raw-agent | 16 | 12 | 4 | $0.76 |
| anthropic:claude-opus-4-7 / yieldos-guided-agent | 14 | 11 | 3 | $1.48 |
Premium spotcheck outcomes
Evaluable patches on pinned express
Frontier models can be slower and more expensive, so safety has to be enforced at the workflow boundary instead of assumed from model choice.
Provider costs use measured token usage for live runs and the assumptions file for review-cost comparison. Refresh provider pricing before using this as public billing proof.