compiled
The Mission

Run the largest possible frontier AI model locallywithout the complexity.

The goal is simple: run the biggest open-weight model possible at good tokens per second, entirely on local or consumer hardware hardware. No cloud. No API keys. Just your machine.

Follow the build → GitHub
Progress · Qwen3.6 35B-A3B · RTX 5060 Ti
64 t/s
+ MTP · 131k ctx
Experiment 03 →
57 t/s
llama-server · LM cache
Experiment 02 →
36 t/s
llama-cli · -ncmoe 11
Experiment 01 →
18 t/s
Ollama — default settings
baseline
scroll
Latest Posts 5 published
Reaching 64 t/s: LM Cache, KV Checkpointing, and MTP

The proxy showed 57 TPS, but llama-cli gave 36. Same hardware. This post answers why — then adds MTP to reach 64 TPS at 131k context.

Running Qwen3.6 35B at 40 TPS on Consumer Hardware

Ollama leaves 2.2× performance on the table for MoE models. A deep dive into memory bandwidth hierarchy and why GPU utilization % is a misleading metric.

Using Claude Code and Cline with a Local LLM

A 600-line proxy bridges Anthropic's API and llama-server — with lifetime metrics tracking and a live dashboard.

Change Default Directory for Ollama

How to change the default directory for Ollama models on Windows using an environment variable.

What Are LLMs (Large Language Models)?

A clear breakdown of what large language models are, how they work, and why they matter for local inference.

What this is

Compiled Thoughts is a public build log with one goal: run the biggest open-weight model possible at good tokens per second, entirely on local hardware.

Open-weight models are getting bigger and better fast. But actually running them — on your own machine, without cloud APIs or expensive subscriptions — is still harder than it should be. This blog is about closing that gap.

Every post is a step in the build: benchmarks, tooling, configuration, failures, and breakthroughs. All numbers are real and measured.

More posts

Browse all articles, or follow the build from the beginning.

View all posts →