yieldOS benchmark dashboard

yieldOS turns risky agent output into an executable commit boundary

Across public and private real-repo deterministic runs, every tested unsafe control commit landed without yieldOS and every matching yieldOS-gated commit was stopped before commit. The calibration layer keeps the important nuance visible: not every realistic security issue is an instant deterministic stop today.

32/32Known unsafe replayed commits stopped before commit

83%Calibration cases handled immediately and correctly

17%Realistic deeper-review candidates kept visible

0/27Benign public commits blocked in false-positive replay

Benchmark story

Strong guardrail, honest limits

Defensible claim: yieldOS stops known unsafe patterns before commit and lets safe controls pass. It is a workflow harness, not a claim that all possible bugs disappear.

Honest limit: the calibration set keeps a small slice of deeper cases that should become future oracles or agent-assisted review escalations.

The presentation should lead with prevention and friction: unsafe changes that would have landed without yieldOS are blocked, benign commits are allowed, and live model runs show the guardrail operating at the point where generated code would enter the repo.

This dashboard is local-review evidence for product calibration. Run npm run evidence:verify before using any report as public proof. This makes the benchmark more credible: yieldOS has a measured safety boundary today and a clear path to expand coverage tomorrow.

Coverage calibration

Balanced cases: prevent known risks, allow safe work, identify deeper review

Calibration set

PreventedAcceptedDeeper review

Core evidence

Prevention without broad false positives

Deterministic replay

Known unsafe commits in disposable real-repo clones

Public repos

Local/private repos

Prevented

Review cost route

Calibration set: deterministic stop plus agent-assisted escalation

Without yieldOS

$5.40

With yieldOS

$0.72

Delta: $4.68 across this calibration model. This models routing cost; it does not claim the deeper-review cases were automatically repaired.

False-positive replay

Benign public commits

Allowed benign commits

Blocked benign commits

AcceptedBlocked

Dollar values are assumption-based and intentionally small-scope. They model avoided review passes for this benchmark set, not total company-wide savings.

Live model workflow

Expanded frontier slice: outcomes by task

Safety charts include only evaluable model patches. The point is not to rank model intelligence; it is to show what happens when generated code reaches an executable commit boundary.

Expanded run outcomes by task

Evaluable generated patches only

admin-users-route

webhook-importer

sql-search-endpoint

public-profile-read

AcceptedPrevented

57Evaluable generated patches in expanded run

15Generated changes stopped by yieldOS

42Generated changes accepted by yieldOS

$4.06Measured provider usage in the expanded run

Model economics

More expensive models still need a boundary

Model arm	Cases	Accepted	Prevented	Provider cost
openai:gpt-5.5 / raw-agent	16	12	4	$0.27
openai:gpt-5.5 / yieldos-guided-agent	11	7	4	$0.41
anthropic:claude-opus-4-7 / raw-agent	16	12	4	$0.76
anthropic:claude-opus-4-7 / yieldos-guided-agent	14	11	3	$1.48

Premium spotcheck outcomes

Evaluable patches on pinned express

openai:gpt-5.5 / raw-agent

openai:gpt-5.5 / yieldos-guided-agent

openai:gpt-5.5-pro / raw-agent

anthropic:claude-sonnet-4-6 / raw-agent

anthropic:claude-sonnet-4-6 / yieldos-guided-agent

anthropic:claude-opus-4-7 / raw-agent

anthropic:claude-opus-4-7 / yieldos-guided-agent

AcceptedPrevented

2m 42sPremium spotcheck p95 runtime

Frontier models can be slower and more expensive, so safety has to be enforced at the workflow boundary instead of assumed from model choice.

Provider costs use measured token usage for live runs and the assumptions file for review-cost comparison. Refresh provider pricing before using this as public billing proof.