Experiment 02 — Local Tooling

Using Claude Code and Cline with a Local LLM

May 12, 2026 Qwen3.6 35B · RTX 5060 Ti llama-cpp-claude-code-proxy
Claude Code speaks Anthropic's API. llama-server speaks OpenAI's API. A 600-line proxy bridges the two — and adds lifetime metrics tracking. Here's how it works and how to set it up.

The Problem

After getting Qwen3.6 35B running at 40 TPS locally, the next goal was obvious: use it as the backend for real coding tools. Claude Code, Cline, GitHub Copilot — all the tools I was already using daily.

The problem: these tools speak different API formats.

Claude Code → Anthropic Messages API (POST /v1/messages)
Cline / Cursor → OpenAI Chat API (POST /v1/chat/completions)
llama-server → OpenAI Chat API (POST /v1/chat/completions)

Cline works out of the box — just point it at http://localhost:8080/v1. Claude Code doesn't. It sends Anthropic-format requests that llama-server can't understand.

The solution: a thin proxy that sits in between.

Research Is Never Static

By the time this was written, llama-server added native Claude API support via the --jinja flag. It can speak Anthropic's Messages API out of the box — no proxy required for Claude Code.

This is the nature of research: your artifact captures a point in time. Literature review exists precisely because the field moves faster than publication. The value isn't in claiming first; it's in the experiment, the real data, the failures, and the architecture you build on top.

This proxy's purpose has shifted. The Anthropic→OpenAI translation is now redundant. What remains is what the proxy actually does better than scraping Prometheus metrics by hand:

The proxy isn't the endpoint. It's infrastructure for the next experiments.

The Architecture

Claude Code / Cline / Cursor / Copilot / any OpenAI SDK
            │
            ▼
llama-cpp-claude-code-proxy (port 9090)
    ├── Claude mode: translates Anthropic → OpenAI
    ├── OpenAI mode: transparent passthrough
    ├── polls /metrics every 10s
    ├── persists lifetime totals to metrics.json
    └── serves dashboard at /dashboard
            │
            ▼
llama-server (port 8080)
    └── Qwen3.6 35B at 57 TPS

Two modes. One proxy. Both tools work.

Setup

Step 1 — Start llama-server

The --metrics flag is required — the proxy polls this endpoint to collect TPS and token counts. Add --jinja if you want llama-server to speak Claude's API natively (function calling, tool use, Anthropic format).

./llama-server \
  -m Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \
  -ngl 999 -ncmoe 11 \
  --flash-attn on \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap --mlock \
  -t 9 -tb 16 -c 32768 \
  --host 0.0.0.0 --port 8080 \
  --metrics \
  --jinja

Step 2 — Start the proxy

git clone https://github.com/compiledthoughts/llama-cpp-claude-code-proxy
cd llama-cpp-claude-code-proxy
pip install -r requirements.txt

# For Claude Code
python proxy.py --mode claude

# For Cline / Cursor / Continue.dev
python proxy.py

Startup confirms everything is connected:

==========================================================
  llama-cpp-claude-code-proxy
==========================================================
  Mode         : claude
  Proxy URL    : http://localhost:9090
  llama-server : http://localhost:8080
  Dashboard    : http://localhost:9090/dashboard
==========================================================
[proxy] llama-server ONLINE (http://localhost:8080)

Connecting Your Tools

Claude Code

bash / zsh / macOS

export ANTHROPIC_BASE_URL=http://localhost:9090
export ANTHROPIC_API_KEY=localkey
claude

PowerShell / Windows

$env:ANTHROPIC_BASE_URL="http://localhost:9090"
$env:ANTHROPIC_API_KEY="localkey"
claude

That's it. Claude Code sends Anthropic Messages API requests. The proxy translates them to OpenAI format, calls llama-server, translates the response back — including full SSE streaming.

Cline (VS Code)

Open Cline settings:

The Claude Mode Translation

The interesting part of the proxy is the Anthropic → OpenAI translation. Claude Code sends requests like this:

POST /v1/messages
{
  "model": "claude-sonnet-4-5",
  "system": "You are a helpful assistant",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true
}

The proxy converts this to OpenAI format, calls llama-server, then wraps the response back into Anthropic's SSE event format:

event: message_start
event: content_block_start
event: content_block_delta   ← token by token
event: content_block_stop
event: message_delta
event: message_stop

Claude Code never knows it's talking to Qwen3.6. It just sees a compliant Anthropic API.

Lifetime Metrics

The proxy polls http://localhost:8080/metrics every 10 seconds and accumulates lifetime totals in metrics.json. This survives llama-server restarts — counter resets are detected automatically.

After 3 days of use

187.2K prompt tokens · 10K generated tokens · 57 requests tracked since May 12, 2026

The dashboard at http://localhost:9090/dashboard reads from metrics.json and auto-refreshes every 5 seconds. Shows current TPS, context usage, KV cache ratio, and token history charts.

Real Results

ToolModeContextTPSStatus
Claude Codeclaude32k57✅ Tested
Cline (VS Code)openai32k57✅ Tested
Cline (VS Code)openai64k actual work27✅ Tested

Both were tested independently — one at a time. The 27 TPS figure at 64k context is from real coding use, not a synthetic benchmark. Cline accumulates context across a session as it reads files, runs commands, and tracks conversation history.

Context size is the real TPS variable. 32k → 57 TPS. 64k actual work → 27 TPS. The KV cache competes with expert weights for VRAM. TurboQuant KV compression is the next step — Experiment 04.

After 2 Hours of Real Use

Lifetime stats — 2 hours of actual coding

244.2K prompt tokens · 19.6K generated tokens · 69 requests · ~27 TPS average

The prompt:generated ratio is 12:1. Cline sends large context (files, history, tool results) and receives focused responses. This is normal for agentic coding tools — they read a lot, write precisely.

Source

github.com/compiledthoughts/llama-cpp-claude-code-proxy — MIT license. Single file. ~600 lines.