Running Qwen3.6 35B at 40 TPS on Consumer Hardware

Hardware

Everything in this experiment runs on a single consumer workstation. No cloud. No rented GPU. The specs matter because the entire optimization strategy is built around its memory hierarchy.

GPU

RTX 5060 Ti

16GB GDDR7 · 448 GB/s · Blackwell

System RAM

32GB DDR5

~96 GB/s bandwidth

CPU

Ryzen 9700X

8 cores / 16 threads · AVX2

PCIe Bus

PCIe 5.0 x8

~32 GB/s GPU↔RAM

NVMe SSD

Gen4 NVMe

~7 GB/s sequential read

Total Fast Memory

~41 GB

VRAM 16GB + RAM 25GB usable

The memory hierarchy

The key to understanding everything that follows is this bandwidth table. These numbers determine every optimization decision.

Tier	Bandwidth	Relative speed
Dedicated VRAM	448 GB/s
DDR5 RAM (CPU)	96 GB/s
Shared GPU memory (PCIe)	32 GB/s
NVMe SSD	7 GB/s

The Model

Qwen3.6 35B-A3B is a Mixture-of-Experts (MoE) model. The numbers mean: 35 billion total parameters, 3 billion active per token. During inference, a router selects which experts (sub-networks) to activate. The rest sit idle in memory.

This is the property that makes consumer hardware viable: you get 35B of knowledge at 3B inference cost. A dense 35B model would be far slower. The tradeoff is memory management complexity — which is exactly what this experiment is about.

We used the IQ4_XS GGUF variant (~20GB). It fits in 16GB VRAM + RAM with no NVMe spillover.

Baseline: Ollama Default

The starting point: run the model with Ollama's default configuration and measure tokens per second.

ollama run qwen3.6:35b-a3b
# Prompt: "Explain how transformers work, step by step."
# Output: 500 tokens

Baseline Result

18.15 tokens/sec

CPU/GPU split: 48% CPU / 52% GPU. GPU utilization: 17%.

GPU at 17% utilization. Nearly half the model running on slow CPU paths. Ollama doesn't understand MoE architecture — it naively splits layers without knowing which should go where.

The MoE Problem

A MoE model has two distinct computation types per layer:

Attention layers — dense, always active every token. Small, fast, GPU loves them.

Expert FFN layers — sparse. Only 3B of 35B activates per token. Large, but most of it sits idle.

Ollama doesn't make this distinction. It splits layers generically. The result: expert weights land in VRAM, fill it up, overflow into shared memory (DDR5 accessed via PCIe at 32 GB/s), and GPU throughput collapses.

Shared GPU memory is DDR5 wearing a GPU mask. It looks like VRAM but runs at PCIe speed — 32 GB/s instead of 448 GB/s. A 14× bandwidth penalty on every expert weight access that hits it.

The Fix: llama.cpp with MoE-aware flags

llama.cpp exposes a flag that Ollama never uses: -ncmoe N — keep the first N MoE layers' expert weights on CPU RAM instead of GPU.

The insight: CPU RAM at 96 GB/s beats shared GPU memory at 32 GB/s by 3×. Keeping experts in RAM is faster than letting them spill into shared VRAM.

Optimal flags

./llama-server \
  -m Qwen3.6-35B-A3B-IQ4_XS.gguf \
  -ngl 999          # all layers → GPU
  -ncmoe 11        # 11/41 MoE layers → CPU RAM (sweet spot)
  --flash-attn on  # memory-efficient attention kernel
  -ctk q8_0        # KV cache keys → 8-bit
  -ctv q8_0        # KV cache values → 8-bit
  --no-mmap        # load fully into RAM upfront
  --mlock          # pin model in RAM, prevent OS swap
  -t 9 -tb 16      # CPU threads: 9 gen, 16 batch
  --prio 2          # high process priority
  --poll 100        # reduce CPU↔GPU handoff latency
  --port 8080

Tuning the -ncmoe Parameter

The model has 41 layers total. We swept -ncmoe values to find the optimal split. The result shows two distinct cliffs — one from too many experts on CPU, one from VRAM overflow into shared memory.

-ncmoe sweep — TPS · GPU% · Shared VRAM

Generation TPS GPU utilization % Shared VRAM GB

Sweet Spot — -ncmoe 11

40.4 tokens/sec

GPU 33% · Shared VRAM 4.7GB · 11/41 layers on CPU (27%)

Why the two cliffs exist

Left cliff (ncmoe 20): Too many experts on CPU. CPU matrix multiply is ~500× slower than GPU. GPU finishes its layers and sits idle waiting for CPU — hence 20% GPU utilization and only 26.8 TPS.

Right cliff (ncmoe 5 → 10): Too many experts pushed to GPU. VRAM fills up and overflows into shared memory. GPU now reads some weights at 32 GB/s instead of 448 GB/s — a 14× penalty. GPU appears "busy" at 86% but TPS collapses to 18.

The sweet spot (ncmoe 11): GPU and CPU finish their respective work at roughly the same time. No waiting. No overflow. Neither is the bottleneck.

High GPU% ≠ fast inference. At 86% GPU utilization we got 18 TPS. At 33% GPU utilization we got 40 TPS. The metric that matters is TPS, not utilization.

Results

Baseline vs Optimized

Config	TPS	GPU%	Shared VRAM
Ollama default	18.15	17%	—
llama.cpp -ncmoe 20	26.8	20%	4.0GB
llama.cpp -ncmoe 15	31.9	26%	6.5GB
llama.cpp -ncmoe 12	36.3	31%	5.2GB
llama.cpp -ncmoe 11 ★	40.4	33%	4.7GB
llama.cpp -ncmoe 10	27.3	60%	4.2GB
llama.cpp -ncmoe 5	18.1	86%	8.7GB

Final Result

40.4 TPS — 2.2× improvement over Ollama with zero hardware changes. Five flags and one tuning parameter recovered performance that Ollama was leaving on the table.

What's Next

This experiment established the baseline and the optimization methodology. Context window was kept small for benchmarking. The next experiment attacks the real constraint: long context.

At 32k context, KV cache grows significantly and eats into the VRAM budget we're currently using for experts. TurboQuant KV cache compression (5–7× reduction) is the tool for this. The goal is to run at 32k–200k context without sacrificing TPS.

Additionally: the MTP (Multi-Token Prediction) variant of this model has a built-in speculative decoding head. BeeLLaMA.cpp (a performance-focused llama.cpp fork) supports it natively — early tests suggest a 2–3× TPS multiplier. That's the path to 80+ TPS on this hardware.

Next — Experiment 02

Long Context at Speed: TurboQuant KV + MTP on Qwen3.6 35B

Pushing context to 32k–200k tokens without sacrificing generation speed. TurboQuant KV cache compression + Multi-Token Prediction speculative decoding. Target: 80+ TPS at 32k context.