Open-sourcing erabot: the CLI + scanner for LLM cost waste

Today we are releasing the core of erabot.ai as an MIT-licensed Python package you can install with pip:

pip install erabot
erabot scan ./src

The code is at github.com/rohan3008/erabot-cli. The tree-sitter detector that finds every LLM API call in your codebase, the Typer CLI that drives the scan + audit + apply loop, and the test corpus that validates detection accuracy — all open.

This post explains what we released, what we deliberately kept closed, how the two halves talk to each other, and why we think that split is the right one for an LLM-cost tool in 2026.

If you just want to try it, the README has a three-minute quickstart. If you want to know whether it makes sense to route through erabot.ai at all, keep reading.

The problem we are solving

Teams building on the OpenAI, Anthropic, and Gemini APIs mostly know their bill is too high. What they do not know is why. Observability tools like Helicone and Langfuse show you how much each endpoint costs. That is useful. It is also not a fix.

Here is the pattern we see constantly in customer code:

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": VERY_LONG_SYSTEM_PROMPT},
        {"role": "user", "content": user_input},
    ],
    temperature=0,
)

Nothing about that request needs GPT-4. A classifier prompt at temperature=0 with a short context is the perfect shape for GPT-4o-mini — roughly 26x cheaper, equivalent quality on 73% of real-world prompts in our benchmark corpus. But the developer typed model="gpt-4" six months ago and has not revisited it since. The observability tool cannot tell them that. The observability tool can only tell them they are spending money.

erabot reads the code, sees the call pattern, and writes a patch that Claude Code can apply in one command. That is the loop: detect → fix → re-measure. Open-sourcing the detect half lets you run it on your own code without signing up for anything, without trusting us with your source, and without a single network call.

What is open, what is closed

Here is the boundary, explicit so nobody has to guess:

| Component | Status | Where it lives | | ---------------------------------- | ------------- | ----------------------- | | Tree-sitter scanner | Open (MIT)| packages/scanner/ | | erabot CLI (scan / audit / apply) | Open (MIT)| packages/cli/ | | Audit engine (Gemini + RAG) | Closed | api.erabot.ai | | Pricing tables + downgrade map | Closed | api.erabot.ai | | Helicone / Langfuse runtime importers | Closed | api.erabot.ai | | Dashboard, billing, team features | Closed | erabot.ai |

The scanner is commodity infrastructure. Writing tree-sitter queries against openai.chat.completions.create is the kind of work that gets better with community contribution, not worse. Somebody is going to submit a PR next week adding Ruby support. Another will add a SARIF output format so the findings can be uploaded to GitHub's code-scanning UI. We already seeded seven good-first-issues to make starting easy.

The audit engine is different. The Gemini prompt library took us nine months of iteration to get to the precision we have today — 100% detection F1 across 105 OSS repositories in our eval corpus (published here). The RAG knowledge base is curated pattern-to-recommendation mappings that took about the same amount of time. Those stay closed because giving them away does not help developers; it helps competitors skip the iteration.

Being upfront about that boundary is the entire point. We would rather tell you exactly what you are installing than pretend the whole thing is open.

How the two halves talk

When you run erabot scan without an API key, nothing leaves your machine:

$ erabot scan ./src
                 Detected 3 LLM call site(s)
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ File               ┃ Line ┃ Provider  ┃ Model                 ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ app.py             │    8 │ openai    │ gpt-4                 │
│ app.py             │   20 │ openai    │ text-embedding-3-small│
│ chains.py          │   14 │ anthropic │ claude-3-opus-...     │
└────────────────────┴──────┴───────────┴───────────────────────┘

That is pure static analysis. Tree-sitter parses your source, walks the AST, and matches call patterns. Zero LLM involvement. Zero network.

When you run erabot audit, the CLI concatenates your source, POSTs it to api.erabot.ai/api/scans/submit, polls until the audit completes, and downloads an agent-instructions.md file:

$ erabot audit ./src
⠋ Scanning (job 4f7a-...)…
✓ Wrote 8,432 bytes to agent-instructions.md
Apply with: erabot apply agent-instructions.md

The audit runs in our cloud because that is where the Gemini calls happen and where the pricing + downgrade heuristics live. The agent-instructions.md file is human-readable and structured for coding agents:

## Finding: GPT-4 used for prompt classification (app.py:8)

**Current monthly cost:** $420
**Projected savings:** $403/mo

### Action

Switch `model="gpt-4"` to `model="gpt-4o-mini"`. The classification
prompts are at temperature=0 with messages under 500 tokens; they do
not benefit from GPT-4's reasoning. Run the shadow eval below before
promoting.

Claude Code (or Cursor, or any agent that applies markdown-specified diffs) can read that file and apply the fix directly. The loop closes.

Install and first scan

Minimum path:

pip install erabot
erabot scan ./your-project

Full audit (requires a free API key from erabot.ai):

export ERABOT_API_KEY="erabot_..."
erabot audit ./your-project
erabot apply

The CLI uses Typer + Rich so the help output is readable and the progress indicators are informative rather than ugly. Source is at packages/cli/src/erabot/.

If you want the detector alone — no CLI, no API key, just import erabot_scanner in your own code — install the smaller package:

pip install erabot-scanner

from erabot_scanner import scan_files

findings = scan_files([
    {"path": "app.py", "content": open("app.py").read()},
])
for f in findings:
    print(f["file_path"], f["line"], f.get("provider"), f.get("model"))

Useful if you are building a custom cost-analysis pipeline, a code-review bot, or anything else where knowing "is this file making LLM calls?" is a useful question.

Detection accuracy: the number we are willing to publish

Every tool claims to be accurate. Few publish the F1. We do:

Call-site detection F1: 1.00 across 105 real-world OSS repositories plus 80 synthetic fixtures, stratified by language, provider, and call complexity.
Bootstrap 95% confidence interval: [1.00, 1.00] (n=1000 resamples, fixed seed for reproducibility).
Per-language: Python 1.00 · TypeScript 1.00 · JavaScript 1.00
Per-provider: OpenAI 1.00 · Anthropic 1.00 · Google/LangChain/Bedrock/Together/Unknown (merged) 1.00

The methodology, corpus composition, and full per-stratum table are at erabot.ai/eval.

One important caveat, stated once loudly so nobody has to discover it the hard way: this F1 measures whether the scanner correctly identifies LLM call sites, not whether the findings the audit engine writes on top are high-quality recommendations. Those are different questions. The detection number is exactly that — detection. The quality of the recommendation attached to each finding is a separate dimension we label per-finding with confidence bands (high / medium / low / directional only) rather than a single headline number. A scanner that lies about its own limits is a scanner that sells you the wrong thing.

Why MIT and why a monorepo

MIT because the goal is adoption. Apache 2.0 is defensible but the patent-grant overhead is unearned in our case — we are not patent holders worrying about downstream litigation. MIT is the standard that Python developers expect when they run pip install something-from-github. When in doubt, pick the license that removes friction.

Monorepo because the two packages ship together on the same release tag. Splitting them into separate repos would mean synchronizing two CHANGELOGs, two version numbers, two CI pipelines, and two sets of contributors who now have to know which repo their bug belongs in. One repo with two top-level packages under packages/ keeps the story simple: git clone, pip install -e packages/scanner -e packages/cli, pytest. If we ever outgrow the monorepo we can split later; we will not outgrow it this year.

How to contribute

Obvious first targets are in the "good first issue" label:

Add Ruby language support (the tree-sitter-language-pack already bundles the grammar; you write the query patterns)
Add Go language support (same story)
Add --exclude and --languages filter flags to erabot scan
Add a SARIF output format for GitHub code-scanning
Add a fixture for LlamaIndex's ServiceContext detection

Each ticket has the reference file path and the acceptance criterion. The goal is that a first-time contributor can land a PR without having to read the whole repo first. CONTRIBUTING.md covers setup.

The scanner is deliberately small — roughly 400 lines of Python around tree-sitter queries. You can read the whole thing in one sitting. That is by design. Tools whose source you cannot read in an afternoon are tools you cannot trust.

What comes next

This is v0.1.0. The next two weeks are stabilization — fixing whatever edge cases show up in the wild, merging the first batch of community PRs, and ideally getting Ruby + Go patterns landed before the broader launch post lands on 2026-05-06.

That broader post is "We scanned 105 open-source AI repositories. Here is where the money leaks." It is a data-driven analysis of real-world LLM cost-waste patterns. When it ships, every finding in it will be reproducible by anyone with pip install erabot and a GitHub checkout. That is the whole point of having the scanner in the open.

If you try erabot and hit a detection miss, file an issue. If you try it and the detection was right but the audit recommendation was wrong, email me directly at rohan@erabot.ai — that is a bug in the closed component and I want to know about it.

Links: