Using Claude Code and Cline with a Local LLM

The Problem

After getting Qwen3.6 35B running at 40 TPS locally, the next goal was obvious: use it as the backend for real coding tools. Claude Code, Cline, GitHub Copilot — all the tools I was already using daily.

The problem: these tools speak different API formats.

Claude Code → Anthropic Messages API (POST /v1/messages)
Cline / Cursor → OpenAI Chat API (POST /v1/chat/completions)
llama-server → OpenAI Chat API (POST /v1/chat/completions)

Cline works out of the box — just point it at http://localhost:8080/v1. Claude Code doesn't. It sends Anthropic-format requests that llama-server can't understand.

The solution: a thin proxy that sits in between.

Research Is Never Static

By the time this was written, llama-server added native Claude API support via the --jinja flag. It can speak Anthropic's Messages API out of the box — no proxy required for Claude Code.

This is the nature of research: your artifact captures a point in time. Literature review exists precisely because the field moves faster than publication. The value isn't in claiming first; it's in the experiment, the real data, the failures, and the architecture you build on top.

This proxy's purpose has shifted. The Anthropic→OpenAI translation is now redundant. What remains is what the proxy actually does better than scraping Prometheus metrics by hand:

Accumulates lifetime token totals across llama-server restarts (counter-resilient)
Serves a live dashboard — no third-party monitoring stack required
Provides a single forwarding layer ready for federated backends: route to different models based on context size, cost, or latency

The proxy isn't the endpoint. It's infrastructure for the next experiments.

The Architecture

Claude Code / Cline / Cursor / Copilot / any OpenAI SDK
            │
            ▼
llama-cpp-claude-code-proxy (port 9090)
    ├── Claude mode: translates Anthropic → OpenAI
    ├── OpenAI mode: transparent passthrough
    ├── polls /metrics every 10s
    ├── persists lifetime totals to metrics.json
    └── serves dashboard at /dashboard
            │
            ▼
llama-server (port 8080)
    └── Qwen3.6 35B at 57 TPS

Two modes. One proxy. Both tools work.

Setup

Step 1 — Start llama-server

The --metrics flag is required — the proxy polls this endpoint to collect TPS and token counts. Add --jinja if you want llama-server to speak Claude's API natively (function calling, tool use, Anthropic format).

./llama-server \
  -m Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \
  -ngl 999 -ncmoe 11 \
  --flash-attn on \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap --mlock \
  -t 9 -tb 16 -c 32768 \
  --host 0.0.0.0 --port 8080 \
  --metrics \
  --jinja

Step 2 — Start the proxy

git clone https://github.com/compiledthoughts/llama-cpp-claude-code-proxy
cd llama-cpp-claude-code-proxy
pip install -r requirements.txt

# For Claude Code
python proxy.py --mode claude

# For Cline / Cursor / Continue.dev
python proxy.py

Startup confirms everything is connected:

==========================================================
  llama-cpp-claude-code-proxy
==========================================================
  Mode         : claude
  Proxy URL    : http://localhost:9090
  llama-server : http://localhost:8080
  Dashboard    : http://localhost:9090/dashboard
==========================================================
[proxy] llama-server ONLINE (http://localhost:8080)

Connecting Your Tools

Claude Code

bash / zsh / macOS

export ANTHROPIC_BASE_URL=http://localhost:9090
export ANTHROPIC_API_KEY=localkey
claude

PowerShell / Windows

$env:ANTHROPIC_BASE_URL="http://localhost:9090"
$env:ANTHROPIC_API_KEY="localkey"
claude

That's it. Claude Code sends Anthropic Messages API requests. The proxy translates them to OpenAI format, calls llama-server, translates the response back — including full SSE streaming.

Cline (VS Code)

Open Cline settings:

Provider → OpenAI Compatible
Base URL → http://localhost:9090/v1
API Key → localkey

The Claude Mode Translation

The interesting part of the proxy is the Anthropic → OpenAI translation. Claude Code sends requests like this:

POST /v1/messages
{
  "model": "claude-sonnet-4-5",
  "system": "You are a helpful assistant",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true
}

The proxy converts this to OpenAI format, calls llama-server, then wraps the response back into Anthropic's SSE event format:

event: message_start
event: content_block_start
event: content_block_delta   ← token by token
event: content_block_stop
event: message_delta
event: message_stop

Claude Code never knows it's talking to Qwen3.6. It just sees a compliant Anthropic API.

Lifetime Metrics

The proxy polls http://localhost:8080/metrics every 10 seconds and accumulates lifetime totals in metrics.json. This survives llama-server restarts — counter resets are detected automatically.

After 3 days of use

187.2K prompt tokens · 10K generated tokens · 57 requests tracked since May 12, 2026

The dashboard at http://localhost:9090/dashboard reads from metrics.json and auto-refreshes every 5 seconds. Shows current TPS, context usage, KV cache ratio, and token history charts.

Real Results

Tool	Mode	Context	TPS	Status
Claude Code	claude	32k	57	✅ Tested
Cline (VS Code)	openai	32k	57	✅ Tested
Cline (VS Code)	openai	64k actual work	27	✅ Tested

Both were tested independently — one at a time. The 27 TPS figure at 64k context is from real coding use, not a synthetic benchmark. Cline accumulates context across a session as it reads files, runs commands, and tracks conversation history.

Context size is the real TPS variable. 32k → 57 TPS. 64k actual work → 27 TPS. The KV cache competes with expert weights for VRAM. TurboQuant KV compression is the next step — Experiment 04.

After 2 Hours of Real Use

Lifetime stats — 2 hours of actual coding

244.2K prompt tokens · 19.6K generated tokens · 69 requests · ~27 TPS average

The prompt:generated ratio is 12:1. Cline sends large context (files, history, tool results) and receives focused responses. This is normal for agentic coding tools — they read a lot, write precisely.

Source

github.com/compiledthoughts/llama-cpp-claude-code-proxy — MIT license. Single file. ~600 lines.

Next — Experiment 03

Why llama-server is 57 TPS but llama-cli is only 36 TPS — LM Cache and KV Checkpointing

The same model, same flags, 1.6× faster in server mode. The answer is KV cache checkpointing — llama-server reuses 97% of context tokens across requests. Here's how it works and why it matters for coding assistants specifically.