Repository: localaiLicense: gemma

Google Gemma 4 E2B-IT served by SGLang with Multi-Token Prediction (MTP) speculative decoding. The companion drafter google/gemma-4-E2B-it-assistant lets the target accept several tokens per step. Flags are a 1:1 transcription of the SGLang cookbook's MTP command (NEXTN algorithm, num_steps=5, num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The E2B variant has 5B total / 2B effective parameters and targets the smaller end of consumer GPUs.
Links
Tags
Repository: localaiLicense: gemma

Google Gemma 4 E4B-IT served by SGLang with Multi-Token Prediction (MTP) speculative decoding. The companion drafter google/gemma-4-E4B-it-assistant lets the target accept several tokens per step. Flags are a 1:1 transcription of the SGLang cookbook's MTP command (NEXTN algorithm, num_steps=5, num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The E4B variant has 8B total / 4B effective parameters — the natural pick for consumer GPUs in the 16–24 GB range.
Links
Tags
Repository: localaiLicense: mit

Xiaomi MiMo-7B-RL served by SGLang with built-in Multi-Token Prediction (MTP) heads (no separate drafter needed) plus online fp8 weight quantization to fit on a 16 GB consumer GPU. ~90% acceptance per the model card. Verified end-to-end at ~88 tok/s on an RTX 5070 Ti (16 GB). Note: mem_fraction_static is dropped to 0.7 (vs sglang's 0.85 default) because the MTP draft worker's vocab embedding is loaded unquantised (~1.2 GiB) and OOMs the static reservation otherwise.
Links
Tags