Impact of hardware on Local hosted LLM ::

Hardware⌗

GMK-TEK EVO-X1 AI Mini
AMD Ryzen™ AI 9 HX 370 | Radeon 890M
32GB LPDDR5X 7000Mhz

Model used Unsloth/Qwen3.5-9B-UD-Q4_K_XL.gguf

Maximum token available.⌗

32GB of ram : 4channel of 8GB in 32bits

real bandwidth:

7500 MT/s × 128 bits ÷ 8
= 7500 × 16 bytes
= 120 000 MB/s
= 120 GB/s

Bandwidth : 120GB/s

Theoric Ceiling:⌗

Weight calculation based on:

FP16 ≈ 2.0 bytes/param
INT8 ≈ 1.0 byte/param
Q4 ≈ 0.5 byte/param

$$ \text{Weight size} ≈ params × bytes/param $$

Exemple model is 9B param
$$ Weight Size:9x10⁹x05=4.5GB $$

Then we calculate the maximum token per seconds Mtok/s

$$ MTok/s = \frac{\text {bandwidth GB/s} }{ \text{weight size GB}} $$

On GMKtek:
$$ Mtok/s = \frac{120}{4.5}= 26\text{ MTok/s} $$

Constated Output: 14Tok/s

There is then a significant discrepency between Mtoks/s and Real Token/s

Possible improvements :⌗

Prefill⌗

The prompt is sent in the LLM and the LLM will establish the KV cache (embeding?) This could be a bottleneck to reactivity when the context is big (RAG, Code, Embeding)

$$ PrefillDuration ≈ \frac{\text{N of token in prompt}}{ \text{prompt token per second}} $$

GmkTek TMSCI = 43 Tok/s prompt_token from llama.cpp
1000 tokens → 1000 / 43 ≈ 23s
2000 tokens → 2000 / 43 ≈ 46 s
3000 tokens → 3000 / 43 ≈ 69 s

KV cache⌗

KV cache stored in memory by the LLM during tokenization to assign a Key (K) and a Value (V).
To generate the next token, the model refers back to these stored K/V values instead of recomputing the entire context.

Accessing the cache is significantly faster than recalculating the full attention over all previous tokens.

Software overhead (Lama.cpp, Ollama, vLLM)⌗

Software overheads during dequantization and KV‑cache reads can reduce throughput.