yieldOS benchmark dashboard

yieldOS turns risky agent output into an executable commit boundary

Across public and private real-repo deterministic runs, every tested unsafe control commit landed without yieldOS and every matching yieldOS-gated commit was stopped before commit. The calibration layer keeps the important nuance visible: not every realistic security issue is an instant deterministic stop today.

32/32Known unsafe replayed commits stopped before commit
83%Calibration cases handled immediately and correctly
17%Realistic deeper-review candidates kept visible
0/27Benign public commits blocked in false-positive replay

Benchmark story

Strong guardrail, honest limits

Defensible claim: yieldOS stops known unsafe patterns before commit and lets safe controls pass. It is a workflow harness, not a claim that all possible bugs disappear.
Honest limit: the calibration set keeps a small slice of deeper cases that should become future oracles or agent-assisted review escalations.

The presentation should lead with prevention and friction: unsafe changes that would have landed without yieldOS are blocked, benign commits are allowed, and live model runs show the guardrail operating at the point where generated code would enter the repo.

This dashboard is local-review evidence for product calibration. Run npm run evidence:verify before using any report as public proof. This makes the benchmark more credible: yieldOS has a measured safety boundary today and a clear path to expand coverage tomorrow.

Coverage calibration

Balanced cases: prevent known risks, allow safe work, identify deeper review

Calibration set
12
PreventedAcceptedDeeper review

Core evidence

Prevention without broad false positives

Deterministic replay

Known unsafe commits in disposable real-repo clones

Public repos
16
Local/private repos
16
Prevented

Review cost route

Calibration set: deterministic stop plus agent-assisted escalation

Without yieldOS
$5.40
With yieldOS
$0.72

Delta: $4.68 across this calibration model. This models routing cost; it does not claim the deeper-review cases were automatically repaired.

False-positive replay

Benign public commits

Allowed benign commits
27
Blocked benign commits
0
AcceptedBlocked

Dollar values are assumption-based and intentionally small-scope. They model avoided review passes for this benchmark set, not total company-wide savings.

Live model workflow

Expanded frontier slice: outcomes by task

Safety charts include only evaluable model patches. The point is not to rank model intelligence; it is to show what happens when generated code reaches an executable commit boundary.

Expanded run outcomes by task

Evaluable generated patches only

admin-users-route
16
webhook-importer
11
sql-search-endpoint
16
public-profile-read
14
AcceptedPrevented
57Evaluable generated patches in expanded run
15Generated changes stopped by yieldOS
42Generated changes accepted by yieldOS
$4.06Measured provider usage in the expanded run

Model economics

More expensive models still need a boundary

Model armCasesAcceptedPreventedProvider cost
openai:gpt-5.5 / raw-agent16124$0.27
openai:gpt-5.5 / yieldos-guided-agent1174$0.41
anthropic:claude-opus-4-7 / raw-agent16124$0.76
anthropic:claude-opus-4-7 / yieldos-guided-agent14113$1.48

Premium spotcheck outcomes

Evaluable patches on pinned express

openai:gpt-5.5 / raw-agent
2
openai:gpt-5.5 / yieldos-guided-agent
2
openai:gpt-5.5-pro / raw-agent
2
anthropic:claude-sonnet-4-6 / raw-agent
2
anthropic:claude-sonnet-4-6 / yieldos-guided-agent
2
anthropic:claude-opus-4-7 / raw-agent
2
anthropic:claude-opus-4-7 / yieldos-guided-agent
2
AcceptedPrevented
2m 42sPremium spotcheck p95 runtime

Frontier models can be slower and more expensive, so safety has to be enforced at the workflow boundary instead of assumed from model choice.

Provider costs use measured token usage for live runs and the assumptions file for review-cost comparison. Refresh provider pricing before using this as public billing proof.