The same harness. A silicon engine.

claudberghini points Claude Code at Taalas's Llama 3.1 8B — a model baked into a chip that decodes at ~14,500 tokens/sec, roughly 1,000× faster than Claude Opus. You keep the full agentic loop; the engine behind it is now hardware.

Jump to install ↓ View on GitHub ↗

  1. 1.0The speed.

    Taalas bakes AI models directly into custom silicon — "the model is the computer," ~1000× more efficient than GPUs. Their public demo serves Llama 3.1 8B etched into a chip. We benchmarked the same task — "write a basic HTML page" — across engines:

  2. 2.0The story.

    We couldn't get an API key — the Taalas demo is a chat box on a website, no key on offer. So we opened the browser dev tools, watched it talk, and rebuilt the API from the network traffic: POST /api/chat, raw-text streaming, a <|stats|> trailer carrying the token-rate telemetry. We didn't get access — we made access.

    Then the hard part. An 8B model is a mediocre tool-follower: it rambles, invents filenames, botches JSON, and once tried to sudo rm a system file when we just said "hi." So we built an eval harness over real Claude Code agent loops and tuned recursively against it — a hill-climbing workflow proposed prompts, scored each, and looped until ten rounds passed with no gain.

  3. 3.0How it works.

    Claude Code talks to the proxy (via deep-claude, which isolates your real Anthropic login). The proxy turns a fast-but-weak model into a coding agent:

    1. swaps Claude Code's 120 KB system prompt for a compact, tuned one,
    2. injects a <tool_call> format and parses the model's text back into tool_use blocks,
    3. best-of-N: re-samples until a valid tool call parses,
    4. grounded best-of-N: picks the answer most supported by tool output, and
    5. guards against destructive shell commands.
  4. 4.0The setup.

    You'll need Claude Code, deep-claude, and Node 18+. Clone, build, and register the endpoint:

    sh
    git clone https://github.com/dennisonbertram/claudberghini
    cd claudberghini
    npm install && npm run build
    
    # register the endpoint with deep-claude (one time)
    deep-claude endpoints add claudberghini http://localhost:3000
  5. 5.0The run.

    One launcher does everything — it auto-starts the proxy and opens a clean Claude Code session on the silicon. Run it from the project directory you want to work in.

    console
    $ cd your-project
    $ claudberghini
    …a clean Claude Code session, on silicon @ ~14,500 tok/s…
    
    $ claudberghini -p "create an index.html with a Hello World heading"
    …done in ~1.5s…
  6. 6.0The quality & the guardrails.

    On the real Claude Code agent loop the core four operations — read, edit, create, grep — land 5/5. Harder multi-step and multi-file tasks reach ≈0.70 on the Taalas demo's quantized instance (1.0 on OpenRouter's). It's an 8B model: superb for focused file ops, weaker on tangled logic.

The reference.

Override anything through the environment.

Run
claudberghini [args…]Clean Claude Code session on Taalas silicon. Args pass through to claude.
claudberghini -p "task"One-shot print mode.
./eval/real-path-eval.sh NScore the four core tasks, N runs each.
./eval/speed-benchmark.shDecode rate + end-to-end timing.
Configure (env)
BACKENDclaudberghini (default) or openrouter.
TOOL_SAMPLE_ATTEMPTSbest-of-N draws for a valid tool call (default 5).
ANSWER_SAMPLE_ATTEMPTSgrounded answer candidates (default 3).
CLAUDBERGHINI_API_URLthe Taalas demo endpoint.
MAX_SYSTEM_BYTEStrim the prompt to the ~24 KB ceiling.