Hardware
Everything in this experiment runs on a single consumer workstation. No cloud. No rented GPU. The specs matter because the entire optimization strategy is built around its memory hierarchy.
The memory hierarchy
The key to understanding everything that follows is this bandwidth table. These numbers determine every optimization decision.
| Tier | Bandwidth | Relative speed |
|---|---|---|
| Dedicated VRAM | 448 GB/s | |
| DDR5 RAM (CPU) | 96 GB/s | |
| Shared GPU memory (PCIe) | 32 GB/s | |
| NVMe SSD | 7 GB/s |
The Model
Qwen3.6 35B-A3B is a Mixture-of-Experts (MoE) model. The numbers mean: 35 billion total parameters, 3 billion active per token. During inference, a router selects which experts (sub-networks) to activate. The rest sit idle in memory.
This is the property that makes consumer hardware viable: you get 35B of knowledge at 3B inference cost. A dense 35B model would be far slower. The tradeoff is memory management complexity — which is exactly what this experiment is about.
We used the IQ4_XS GGUF variant (~20GB). It fits in 16GB VRAM + RAM with no NVMe spillover.
Baseline: Ollama Default
The starting point: run the model with Ollama's default configuration and measure tokens per second.
ollama run qwen3.6:35b-a3b
# Prompt: "Explain how transformers work, step by step."
# Output: 500 tokens
CPU/GPU split: 48% CPU / 52% GPU. GPU utilization: 17%.
GPU at 17% utilization. Nearly half the model running on slow CPU paths. Ollama doesn't understand MoE architecture — it naively splits layers without knowing which should go where.
The MoE Problem
A MoE model has two distinct computation types per layer:
Attention layers — dense, always active every token. Small, fast, GPU loves them.
Expert FFN layers — sparse. Only 3B of 35B activates per token. Large, but most of it sits idle.
Ollama doesn't make this distinction. It splits layers generically. The result: expert weights land in VRAM, fill it up, overflow into shared memory (DDR5 accessed via PCIe at 32 GB/s), and GPU throughput collapses.
Shared GPU memory is DDR5 wearing a GPU mask. It looks like VRAM but runs at PCIe speed — 32 GB/s instead of 448 GB/s. A 14× bandwidth penalty on every expert weight access that hits it.
The Fix: llama.cpp with MoE-aware flags
llama.cpp exposes a flag that Ollama never uses: -ncmoe N — keep the first N MoE layers' expert weights on CPU RAM instead of GPU.
The insight: CPU RAM at 96 GB/s beats shared GPU memory at 32 GB/s by 3×. Keeping experts in RAM is faster than letting them spill into shared VRAM.
Optimal flags
./llama-server \
-m Qwen3.6-35B-A3B-IQ4_XS.gguf \
-ngl 999 # all layers → GPU
-ncmoe 11 # 11/41 MoE layers → CPU RAM (sweet spot)
--flash-attn on # memory-efficient attention kernel
-ctk q8_0 # KV cache keys → 8-bit
-ctv q8_0 # KV cache values → 8-bit
--no-mmap # load fully into RAM upfront
--mlock # pin model in RAM, prevent OS swap
-t 9 -tb 16 # CPU threads: 9 gen, 16 batch
--prio 2 # high process priority
--poll 100 # reduce CPU↔GPU handoff latency
--port 8080
Tuning the -ncmoe Parameter
The model has 41 layers total. We swept -ncmoe values to find the optimal split. The result shows two distinct cliffs — one from too many experts on CPU, one from VRAM overflow into shared memory.
GPU 33% · Shared VRAM 4.7GB · 11/41 layers on CPU (27%)
Why the two cliffs exist
Left cliff (ncmoe 20): Too many experts on CPU. CPU matrix multiply is ~500× slower than GPU. GPU finishes its layers and sits idle waiting for CPU — hence 20% GPU utilization and only 26.8 TPS.
Right cliff (ncmoe 5 → 10): Too many experts pushed to GPU. VRAM fills up and overflows into shared memory. GPU now reads some weights at 32 GB/s instead of 448 GB/s — a 14× penalty. GPU appears "busy" at 86% but TPS collapses to 18.
The sweet spot (ncmoe 11): GPU and CPU finish their respective work at roughly the same time. No waiting. No overflow. Neither is the bottleneck.
High GPU% ≠ fast inference. At 86% GPU utilization we got 18 TPS. At 33% GPU utilization we got 40 TPS. The metric that matters is TPS, not utilization.
Results
| Config | TPS | GPU% | Shared VRAM |
|---|---|---|---|
| Ollama default | 18.15 | 17% | — |
| llama.cpp -ncmoe 20 | 26.8 | 20% | 4.0GB |
| llama.cpp -ncmoe 15 | 31.9 | 26% | 6.5GB |
| llama.cpp -ncmoe 12 | 36.3 | 31% | 5.2GB |
| llama.cpp -ncmoe 11 ★ | 40.4 | 33% | 4.7GB |
| llama.cpp -ncmoe 10 | 27.3 | 60% | 4.2GB |
| llama.cpp -ncmoe 5 | 18.1 | 86% | 8.7GB |
40.4 TPS — 2.2× improvement over Ollama with zero hardware changes. Five flags and one tuning parameter recovered performance that Ollama was leaving on the table.
What's Next
This experiment established the baseline and the optimization methodology. Context window was kept small for benchmarking. The next experiment attacks the real constraint: long context.
At 32k context, KV cache grows significantly and eats into the VRAM budget we're currently using for experts. TurboQuant KV cache compression (5–7× reduction) is the tool for this. The goal is to run at 32k–200k context without sacrificing TPS.
Additionally: the MTP (Multi-Token Prediction) variant of this model has a built-in speculative decoding head. BeeLLaMA.cpp (a performance-focused llama.cpp fork) supports it natively — early tests suggest a 2–3× TPS multiplier. That's the path to 80+ TPS on this hardware.