Experiment 01 — MoE Inference

Running Qwen3.6 35B at 40 TPS on Consumer Hardware

May 9, 2026 RTX 5060 Ti · 32GB DDR5 llama.cpp b9091
Ollama leaves 2.2× performance on the table for MoE models. This post documents exactly why — and how we recovered it using five flags, one tuning parameter, and a key insight about memory bandwidth hierarchy.

Hardware

Everything in this experiment runs on a single consumer workstation. No cloud. No rented GPU. The specs matter because the entire optimization strategy is built around its memory hierarchy.

GPU
RTX 5060 Ti
16GB GDDR7 · 448 GB/s · Blackwell
System RAM
32GB DDR5
~96 GB/s bandwidth
CPU
Ryzen 9700X
8 cores / 16 threads · AVX2
PCIe Bus
PCIe 5.0 x8
~32 GB/s GPU↔RAM
NVMe SSD
Gen4 NVMe
~7 GB/s sequential read
Total Fast Memory
~41 GB
VRAM 16GB + RAM 25GB usable

The memory hierarchy

The key to understanding everything that follows is this bandwidth table. These numbers determine every optimization decision.

TierBandwidthRelative speed
Dedicated VRAM 448 GB/s
DDR5 RAM (CPU) 96 GB/s
Shared GPU memory (PCIe) 32 GB/s
NVMe SSD 7 GB/s

The Model

Qwen3.6 35B-A3B is a Mixture-of-Experts (MoE) model. The numbers mean: 35 billion total parameters, 3 billion active per token. During inference, a router selects which experts (sub-networks) to activate. The rest sit idle in memory.

This is the property that makes consumer hardware viable: you get 35B of knowledge at 3B inference cost. A dense 35B model would be far slower. The tradeoff is memory management complexity — which is exactly what this experiment is about.

We used the IQ4_XS GGUF variant (~20GB). It fits in 16GB VRAM + RAM with no NVMe spillover.

Baseline: Ollama Default

The starting point: run the model with Ollama's default configuration and measure tokens per second.

ollama run qwen3.6:35b-a3b
# Prompt: "Explain how transformers work, step by step."
# Output: 500 tokens
Baseline Result
18.15 tokens/sec

CPU/GPU split: 48% CPU / 52% GPU. GPU utilization: 17%.

GPU at 17% utilization. Nearly half the model running on slow CPU paths. Ollama doesn't understand MoE architecture — it naively splits layers without knowing which should go where.

The MoE Problem

A MoE model has two distinct computation types per layer:

Attention layers — dense, always active every token. Small, fast, GPU loves them.

Expert FFN layers — sparse. Only 3B of 35B activates per token. Large, but most of it sits idle.

Ollama doesn't make this distinction. It splits layers generically. The result: expert weights land in VRAM, fill it up, overflow into shared memory (DDR5 accessed via PCIe at 32 GB/s), and GPU throughput collapses.

Shared GPU memory is DDR5 wearing a GPU mask. It looks like VRAM but runs at PCIe speed — 32 GB/s instead of 448 GB/s. A 14× bandwidth penalty on every expert weight access that hits it.

The Fix: llama.cpp with MoE-aware flags

llama.cpp exposes a flag that Ollama never uses: -ncmoe N — keep the first N MoE layers' expert weights on CPU RAM instead of GPU.

The insight: CPU RAM at 96 GB/s beats shared GPU memory at 32 GB/s by 3×. Keeping experts in RAM is faster than letting them spill into shared VRAM.

Optimal flags

./llama-server \
  -m Qwen3.6-35B-A3B-IQ4_XS.gguf \
  -ngl 999          # all layers → GPU
  -ncmoe 11        # 11/41 MoE layers → CPU RAM (sweet spot)
  --flash-attn on  # memory-efficient attention kernel
  -ctk q8_0        # KV cache keys → 8-bit
  -ctv q8_0        # KV cache values → 8-bit
  --no-mmap        # load fully into RAM upfront
  --mlock          # pin model in RAM, prevent OS swap
  -t 9 -tb 16      # CPU threads: 9 gen, 16 batch
  --prio 2          # high process priority
  --poll 100        # reduce CPU↔GPU handoff latency
  --port 8080

Tuning the -ncmoe Parameter

The model has 41 layers total. We swept -ncmoe values to find the optimal split. The result shows two distinct cliffs — one from too many experts on CPU, one from VRAM overflow into shared memory.

-ncmoe sweep — TPS · GPU% · Shared VRAM
Generation TPS GPU utilization % Shared VRAM GB
Sweet Spot — -ncmoe 11
40.4 tokens/sec

GPU 33% · Shared VRAM 4.7GB · 11/41 layers on CPU (27%)

Why the two cliffs exist

Left cliff (ncmoe 20): Too many experts on CPU. CPU matrix multiply is ~500× slower than GPU. GPU finishes its layers and sits idle waiting for CPU — hence 20% GPU utilization and only 26.8 TPS.

Right cliff (ncmoe 5 → 10): Too many experts pushed to GPU. VRAM fills up and overflows into shared memory. GPU now reads some weights at 32 GB/s instead of 448 GB/s — a 14× penalty. GPU appears "busy" at 86% but TPS collapses to 18.

The sweet spot (ncmoe 11): GPU and CPU finish their respective work at roughly the same time. No waiting. No overflow. Neither is the bottleneck.

High GPU% ≠ fast inference. At 86% GPU utilization we got 18 TPS. At 33% GPU utilization we got 40 TPS. The metric that matters is TPS, not utilization.

Results

Baseline vs Optimized
ConfigTPSGPU%Shared VRAM
Ollama default18.1517%
llama.cpp -ncmoe 2026.820%4.0GB
llama.cpp -ncmoe 1531.926%6.5GB
llama.cpp -ncmoe 1236.331%5.2GB
llama.cpp -ncmoe 11 ★40.433%4.7GB
llama.cpp -ncmoe 1027.360%4.2GB
llama.cpp -ncmoe 518.186%8.7GB
Final Result

40.4 TPS — 2.2× improvement over Ollama with zero hardware changes. Five flags and one tuning parameter recovered performance that Ollama was leaving on the table.

What's Next

This experiment established the baseline and the optimization methodology. Context window was kept small for benchmarking. The next experiment attacks the real constraint: long context.

At 32k context, KV cache grows significantly and eats into the VRAM budget we're currently using for experts. TurboQuant KV cache compression (5–7× reduction) is the tool for this. The goal is to run at 32k–200k context without sacrificing TPS.

Additionally: the MTP (Multi-Token Prediction) variant of this model has a built-in speculative decoding head. BeeLLaMA.cpp (a performance-focused llama.cpp fork) supports it natively — early tests suggest a 2–3× TPS multiplier. That's the path to 80+ TPS on this hardware.