Erabot Eval Report — v5.0

What this measures

Call-site detection. Given a source file, does the scanner correctly identify every LLM API call site — OpenAI, Anthropic, Gemini, LangChain, LlamaIndex — and extract the model, streaming flag, tools, and token usage? This is the precondition for every downstream finding.

What it does not measure. Whether the findings the auditor LLM writes on top of those detections are subjectively high-quality recommendations. That's a separate dimension we publish separately and label clearly — see the static-analysis confidence bands in each finding's report card (high / medium / low / directional only).

A customer who asks "when erabot says there's a GPT-4 call at line 47, is it right?" should read this page. A customer who asks "are erabot's fix recommendations worth applying?" should run a scan against their own code and judge the output.

Methodology

This report evaluates the Erabot code scanner against ground-truth .expected.yaml manifests generated on 105 real-world open-source repositories plus ~80 synthetic fixtures stratified by language, LLM provider, and call complexity. The auditor model is gemini-2.5-flash.

Scan config: scan_mode=standard, completion_ratio=0.5, calls_per_month=1000

Per-stratum F1 metrics include Bootstrap 95% CI computed via percentile method with n=1000 resamples and fixed seed=42 for reproducibility. Strata with fewer than 15 repos are merged into an "Other" bucket. Buckets where CI width exceeds 0.15 are annotated with a warning marker.

Git SHA: d905cbe7

Overall F1

| Metric | Value | |-----------|--------| | Precision | 1.0000 | | Recall | 1.0000 | | F1 | 1.0000 | | TP | 595 | | FP | 0 | | FN | 0 |

Confusion Matrix

| Metric | Count | |--------|-------| | TP | 595 | | FP | 0 | | FN | 0 |

Per-Language F1

Model: gemini-2.5-flash (gemini-2.5-flash) | Bootstrap 95% CI (n=1000, seed=42)

| Language | n | F1 | Bootstrap 95% CI | Notes | |--------|--:|---:|-----------------|-------| | Other | 8 | 1.0000 | [1.0000, 1.0000] | Merged: javascript, typescript | | python | 97 | 1.0000 | [1.0000, 1.0000] | |

Per-Provider F1

| Provider | n | F1 | Bootstrap 95% CI | Notes | |--------|--:|---:|-----------------|-------| | Other | 23 | 1.0000 | [1.0000, 1.0000] | Merged: bedrock, google, langchain, together, unknown | | anthropic | 33 | 1.0000 | [1.0000, 1.0000] | | | openai | 49 | 1.0000 | [1.0000, 1.0000] | |

Per-Complexity F1

| Complexity | n | F1 | Bootstrap 95% CI | Notes | |----------|--:|---:|-----------------|-------| | Other | 3 | 1.0000 | [1.0000, 1.0000] | Merged: large | | medium | 45 | 1.0000 | [1.0000, 1.0000] | | | small | 57 | 1.0000 | [1.0000, 1.0000] | |

Per-Pattern-Type F1

| Pattern | n | F1 | Bootstrap 95% CI | Notes | |-------|--:|---:|-----------------|-------| | direct_sdk | 15 | 1.0000 | [1.0000, 1.0000] | | | dynamic_model | 37 | 1.0000 | [1.0000, 1.0000] | | | orchestrator_chain | 33 | 1.0000 | [1.0000, 1.0000] | | | unknown | 20 | 1.0000 | [1.0000, 1.0000] | |

Reproduction

To reproduce these results:

# Step 1: Run the detection eval harness
cd tests/eval-harness
PYTHONPATH=.:../../backend python3 run_eval.py --mode detection

# Step 2: Publish the eval report (runs derivation + gates internally)
PYTHONPATH=.:../../backend python3 scripts/publish_eval_report.py

# Validate only (for CI gate):
PYTHONPATH=.:../../backend python3 scripts/publish_eval_report.py --validate