Coding efficiency

AI Coding Assistant Token-Efficiency Benchmark

A paired-run benchmark for measuring whether XFlowIQ reduces total AI consumption while keeping or improving accepted engineering outcomes.

Evidence Contract

Every comparison must carry the same proof fields.

required

XFlowIQ build number and commit SHA

Codex or model version identifiers

Judge or verifier model identifiers

Repository commit, Node version, package lock hash, and OS build

required

Same repository state for baseline and XFlow-assisted run

Same task prompt, acceptance threshold, and available tools

Same environmental settings and no hidden manual fixes

required

Single-assistant baseline using the same model family where practical

XFlow four-AI lane with the same acceptance threshold

Both lanes count input, output, system prompts, compression, tool calls, retries, failed runs, inter-AI messages, and judge calls

required

Report median savings and p10/p25/p75/p90 distribution

Report pass rate, total tokens, cost, latency, human corrections, and time to accepted output

Report failed or rejected paired runs

required

A single assistant may be faster or cheaper on small tasks

XFlowIQ must prove value on complex, multi-step, evidence-heavy work where coordination matters

required

Current fixture is sample-only until real paired logs are attached

Savings claims are locked until multiple task families are covered

A token win without equal or better pass rate is not a product win

required

Benchmark JSON from /api/xflowiq/engineering-token-benchmark

Accepted diffs, tests, receipts, and verifier results

Public methodology explaining all counted buckets

Independence

Public pages should keep this line visible so comparison SEO stays clean, honest, and reviewable.