The Problem
After getting Qwen3.6 35B running at 40 TPS locally, the next goal was obvious: use it as the backend for real coding tools. Claude Code, Cline, GitHub Copilot — all the tools I was already using daily.
The problem: these tools speak different API formats.
Cline / Cursor → OpenAI Chat API (POST /v1/chat/completions)
llama-server → OpenAI Chat API (POST /v1/chat/completions)
Cline works out of the box — just point it at http://localhost:8080/v1. Claude Code doesn't. It sends Anthropic-format requests that llama-server can't understand.
The solution: a thin proxy that sits in between.
Research Is Never Static
By the time this was written, llama-server added native Claude API support via the --jinja flag. It can speak Anthropic's Messages API out of the box — no proxy required for Claude Code.
This is the nature of research: your artifact captures a point in time. Literature review exists precisely because the field moves faster than publication. The value isn't in claiming first; it's in the experiment, the real data, the failures, and the architecture you build on top.
This proxy's purpose has shifted. The Anthropic→OpenAI translation is now redundant. What remains is what the proxy actually does better than scraping Prometheus metrics by hand:
- Accumulates lifetime token totals across llama-server restarts (counter-resilient)
- Serves a live dashboard — no third-party monitoring stack required
- Provides a single forwarding layer ready for federated backends: route to different models based on context size, cost, or latency
The proxy isn't the endpoint. It's infrastructure for the next experiments.
The Architecture
│
▼
llama-cpp-claude-code-proxy (port 9090)
├── Claude mode: translates Anthropic → OpenAI
├── OpenAI mode: transparent passthrough
├── polls /metrics every 10s
├── persists lifetime totals to metrics.json
└── serves dashboard at /dashboard
│
▼
llama-server (port 8080)
└── Qwen3.6 35B at 57 TPS
Two modes. One proxy. Both tools work.
Setup
Step 1 — Start llama-server
The --metrics flag is required — the proxy polls this endpoint to collect TPS and token counts. Add --jinja if you want llama-server to speak Claude's API natively (function calling, tool use, Anthropic format).
./llama-server \
-m Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \
-ngl 999 -ncmoe 11 \
--flash-attn on \
-ctk q8_0 -ctv q8_0 \
--no-mmap --mlock \
-t 9 -tb 16 -c 32768 \
--host 0.0.0.0 --port 8080 \
--metrics \
--jinja
Step 2 — Start the proxy
git clone https://github.com/compiledthoughts/llama-cpp-claude-code-proxy
cd llama-cpp-claude-code-proxy
pip install -r requirements.txt
# For Claude Code
python proxy.py --mode claude
# For Cline / Cursor / Continue.dev
python proxy.py
Startup confirms everything is connected:
==========================================================
llama-cpp-claude-code-proxy
==========================================================
Mode : claude
Proxy URL : http://localhost:9090
llama-server : http://localhost:8080
Dashboard : http://localhost:9090/dashboard
==========================================================
[proxy] llama-server ONLINE (http://localhost:8080)
Connecting Your Tools
Claude Code
bash / zsh / macOS
export ANTHROPIC_BASE_URL=http://localhost:9090
export ANTHROPIC_API_KEY=localkey
claude
PowerShell / Windows
$env:ANTHROPIC_BASE_URL="http://localhost:9090"
$env:ANTHROPIC_API_KEY="localkey"
claude
That's it. Claude Code sends Anthropic Messages API requests. The proxy translates them to OpenAI format, calls llama-server, translates the response back — including full SSE streaming.
Cline (VS Code)
Open Cline settings:
- Provider → OpenAI Compatible
- Base URL → http://localhost:9090/v1
- API Key → localkey
The Claude Mode Translation
The interesting part of the proxy is the Anthropic → OpenAI translation. Claude Code sends requests like this:
POST /v1/messages
{
"model": "claude-sonnet-4-5",
"system": "You are a helpful assistant",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}
The proxy converts this to OpenAI format, calls llama-server, then wraps the response back into Anthropic's SSE event format:
event: message_start
event: content_block_start
event: content_block_delta ← token by token
event: content_block_stop
event: message_delta
event: message_stop
Claude Code never knows it's talking to Qwen3.6. It just sees a compliant Anthropic API.
Lifetime Metrics
The proxy polls http://localhost:8080/metrics every 10 seconds and accumulates lifetime totals in metrics.json. This survives llama-server restarts — counter resets are detected automatically.
187.2K prompt tokens · 10K generated tokens · 57 requests tracked since May 12, 2026
The dashboard at http://localhost:9090/dashboard reads from metrics.json and auto-refreshes every 5 seconds. Shows current TPS, context usage, KV cache ratio, and token history charts.
Real Results
| Tool | Mode | Context | TPS | Status |
|---|---|---|---|---|
| Claude Code | claude | 32k | 57 | ✅ Tested |
| Cline (VS Code) | openai | 32k | 57 | ✅ Tested |
| Cline (VS Code) | openai | 64k actual work | 27 | ✅ Tested |
Both were tested independently — one at a time. The 27 TPS figure at 64k context is from real coding use, not a synthetic benchmark. Cline accumulates context across a session as it reads files, runs commands, and tracks conversation history.
Context size is the real TPS variable. 32k → 57 TPS. 64k actual work → 27 TPS. The KV cache competes with expert weights for VRAM. TurboQuant KV compression is the next step — Experiment 04.
After 2 Hours of Real Use
244.2K prompt tokens · 19.6K generated tokens · 69 requests · ~27 TPS average
The prompt:generated ratio is 12:1. Cline sends large context (files, history, tool results) and receives focused responses. This is normal for agentic coding tools — they read a lot, write precisely.
Source
github.com/compiledthoughts/llama-cpp-claude-code-proxy — MIT license. Single file. ~600 lines.