pegainfer (2): Adding a Sampler to the Inference Engine
In the last post, we built the skeleton of pegainfer from scratch — it runs Qwen3 4B, aligns with HuggingFace precision, but only supports greedy decoding. The output is deterministic, which is unfriendly for creative tasks. So this post is about the Sampler: teaching the model to think a little more freely.
That said, a Sampler is just a Sampler. This post involves no novel optimizations — it is an ordinary Sampler, as ordinary as it gets.
At the same time, this is also a record of how I collaborate with LLMs during development. 100% of pegainfer’s code is generated by Opus 4.6, but “generated” does not mean “hands-off” — how to guide it, how to review it, and how to make decisions: that is the main thing I want to share in this post.
Code is cheap, show you my talk.
Learning Phase: Understanding What a Sampler Is
My understanding of Samplers was that they are parameters that constrain how freely the model generates. But I did not know the specifics of how the constraints work or how they are applied.
So before writing any code, I used the API to systematically learn about sampling strategies with Opus 4.6. Below is the conversation we had. My questions are preserved as-is; the model’s responses are condensed.
Me: What is a Sampler in a large language model? My understanding is that it just predicts the next token — wouldn’t picking the one with the highest probability be enough?
At each step, the model does not output a single token — it outputs a probability distribution over the entire vocabulary, with a score (logit) for every token. Picking the highest-probability token is called greedy decoding, and the problem is that the output tends to loop and repeat, and it can miss paths that are not the highest probability at the current step but are better overall. It is like navigating a maze by always moving toward the exit — sometimes that leads you into a dead end.
Me: So the idea is to introduce some randomness?
Right, but the key question is how much randomness and what kind. Complete randomness produces nonsense; no randomness produces boring repetition. A Sampler is a set of strategies that controls this balance, sitting between the model’s output distribution and the final chosen token.
The most basic parameter is Temperature — it controls whether the probability distribution is “sharp” or “flat”:
Low temperature (0.1) → distribution becomes sharp, almost always picks the top token, close to greedy
High temperature (2.0) → distribution becomes flat, every token has a reasonable chance of being picked, more random
Mathematically it is one line of division: logits[i] /= temperature, then softmax. Temperature = 0 is equivalent to greedy.
Me: Got it. Keep going — what are top_k and top_p?
Top-k takes a completely different approach: it just cuts. It keeps only the k highest-probability tokens and ignores the rest entirely, then samples from what remains. This eliminates the absurd low-probability tokens, but k is fixed — sometimes the model is confident there is only one reasonable option, other times ten candidates are roughly equal. A fixed cutoff lacks flexibility.
Top-p (Nucleus Sampling) solves this: instead of fixing the count, it sorts tokens by probability from highest to lowest, accumulates probabilities, and stops when the cumulative total reaches p — everything beyond that is cut. When the model is confident, only a few tokens survive; when it is uncertain, more survive. It is adaptive.
Me: How do these work together in practice? It looks like three hyperparameters that can be tuned. Why can’t this be handled during training?
The three are applied sequentially, like a pipeline:
Model outputs logits
↓
① Temperature: adjust sharpness of distribution
↓
② Top-k: cut everything outside the top k
↓
③ Top-p: apply another cumulative-probability cut on what remains
↓
Sample randomly from the final surviving tokens
As for why this is not handled during training — training and generation are two different tasks. During training, the model’s goal is to learn an accurate probability distribution (an objective knowledge problem). During generation, we need to pick a token from that distribution based on the situation (a subjective choice problem). Writing code calls for low temperature and precision; writing fiction calls for high temperature and creativity. This preference should not be baked into the model weights.
In one sentence: training solves “knowing what is correct”; the Sampler solves “choosing how to use that knowledge based on context.”
Me: Let’s talk about the interface and implementation. In Rust terms, would this be something like
fn sample(Vec<(Token, float)>, t, top_k, top_p) -> Token?
That is roughly right. Two improvements though: first, the input should be logits rather than probabilities, since Temperature is more naturally applied directly to logits (divide first, then softmax, rather than the reverse); second, the processing can be split into a LogitProcessor trait pipeline, where each processor is independently testable and composable. For our basic version though, a single function is sufficient.
Me: During inference, where does top_k typically run — GPU or CPU? GPUs are parallel, but top_k seems like it has limited parallelism.
Most frameworks do it on CPU. The input is just a float vector of vocabulary size (tens of thousands to a hundred-thousand-plus), and on CPU a fast-selection algorithm finishes in microseconds, while a single forward pass takes tens of milliseconds — the cost of sampling is negligible compared to the forward pass. High-throughput systems (vLLM, TensorRT-LLM) keep sampling on GPU, but mainly not because top-k needs GPU acceleration — it is to avoid repeated GPU-to-CPU data transfers.
After learning the core concepts, I explored a few more topics I was curious about. These are not directly relevant to the current pegainfer implementation, but they helped me build a more complete picture:
Me: Is anyone doing research on Samplers? The ROI seems pretty high.
There is, and it is growing. A few directions: Mirostat (adaptive sampling that dynamically adjusts the cutoff to maintain stable “surprise level”, already in llama.cpp), Contrastive Decoding (run a large and a small model simultaneously, keep tokens that the large model likes but the small model does not, amplifying the large model’s deeper capabilities), Speculative Decoding (small model drafts, large model verifies — does not change quality, only speed). Swapping sampling strategies on the same model makes a big perceptual difference at nearly zero cost, so inference frameworks are offering more and more sampling options.
Me: Tell me more about top-k on GPU.
Three main approaches: Bitonic Sort (classic parallel sort, GPU-friendly but full sort is wasteful), Radix Select (determines the k-th largest value bit by bit — each round is a parallel count, O(N), the actual implementation behind PyTorch’s torch.topk), Heap (each thread maintains a min-heap of size k, suited for small k).
Me: Tell me more about Radix Select. DSA’s top_k=2048 is fairly large.
The core idea: instead of sorting, determine the threshold from the highest bit to the lowest bit of float32 — each round asks “how many candidates have a 1 in this bit position?” (a single parallel count), then uses that count and the remaining k to decide whether this bit is 0 or 1, while narrowing the candidate set. After 32 rounds, the k-th largest value is precisely locked in, then one more pass collects the results. A prerequisite step converts float to ordered u32 (flip the sign bit / negate for negatives) so binary magnitude order matches floating-point magnitude order.
Me: I think there is room to optimize the starting position. LLM logits have a narrow distribution — you probably do not need to start from bit 1, right?
Your intuition is correct. LLM logits typically concentrate around [-10, 30], so the high bits (sign bit + high exponent bits) of float32 are nearly identical across all values — scanning from bit 31 wastes several rounds. A parallel min/max pass can determine a valid starting bit, saving roughly 25% of rounds. PyTorch’s general-purpose topk does not do this (it cannot assume a data distribution), but a dedicated path for LLM inference absolutely could. A more aggressive idea: exploit the continuity of autoregression — use the previous step’s threshold as the initial guess for the current step, then just verify and fine-tune.
Development Phase: Collaborating with an LLM
With the concepts learned, time to implement. I used Claude Code + Opus 4.6 High, working conversationally in the terminal. Below I document my prompts to Claude Code and the key decisions, organized into three phases: design, implementation, and wrap-up.
Design Phase
CLAUDE CODE prompt:
### `DESIGN` — Design First
Discuss tradeoffs and propose options before writing code.
- Give 2–3 realistic options with pros and cons; focus on real constraints (performance, complexity, maintainability).
- Short code sketches or API stubs are fine, but do not write "final" production code.
- Prioritize concise, high-information reasoning over long tutorials. The user is an experienced engineer.
Enter design phase. We want to add a Sampler to the project, so you may need to help me nail down some implementation details. Native support for t, top_k, top_p. No aggressive optimizations — a naive CPU implementation is fine for the basic version.
Claude Code gave two options: a simple single function, and a LogitProcessor trait pipeline — justified by future extensibility.
Future extensibility — so what do you think might change about this sampling in the future that we would need to support? Option A can also support a fixed seed. The two do not seem fundamentally different.
Add top_k. Seed can be fixed at system startup. For the temperature default value, check mini sglang.
Sure, make it work.
The general principle: no premature optimization, just get a naive correct version running.
Implementation Phase
After finalizing the design, I told Claude Code to implement it.
Once it finished writing, I started reviewing the code.
I see you wrote this branch here — will greedy actually be fast on GPU? Also, /t — what happens if t is 0? And top_k of 0 — is that an illegal argument? Though t=0 is legal and meaningful.
Under what circumstances would these two unwraps trigger? Does using
orhere potentially mask the real problem? We could use assert to enforce input validity constraints.
Then I kept checking the code and moved toward wrap-up.
Wrap-up and Validation Phase
Ok, let’s add some unit tests for the Sampler.
Ok, give me two curl commands — one greedy, one with some randomness — and after I
cargo run -rI will curl them myself and compare.
Opus 4.6 generated a set of unit tests with solid coverage. After reviewing them, I thought the test cases were well chosen. (Back in the Sonnet era, LLMs would often generate tests that existed just for the sake of having tests.)
| Test | What it verifies |
|---|---|
greedy_defaults | Default parameters → greedy |
temperature_zero_returns_argmax | t=0 → argmax |
negative_temperature_is_greedy | t<0 → argmax |
top_k_1_picks_argmax | top_k=1 → must pick the maximum |
deterministic_with_seed | Same seed → same result |
top_k_restricts_candidates | top_k=3 → only 3 tokens appear |
respects_top_p | Dominant logit + low top_p → must pick it |
top_k_and_top_p_combined | Both filters applied together |
low_temperature_concentrates | t=0.01 → always picks argmax |
high_temperature_spreads | t=100 → all tokens appear |
equal_logits_uniform | Equal logits → approximately uniform distribution |
empty_logits_panics | assert fires, should_panic |
negative_logits_argmax | All-negative logits → picks correct maximum |
single_logit | Single-element edge case |
LLM writes tests, I run the tests, I curl and check the actual output. Then I did a full code scan in Zed (excluding tests) and found a few things to fix:
Upgrade rand to 0.10 — it released recently.
(Checked the release notes — confirmed, upgrade makes sense. Seed fixed at 42 for now, no need for configurability yet, skipping.)
Then I manually tested it myself: greedy output and randomized output were indeed different, as expected. Switched branches, committed, submitted the PR.
Side Note: Streaming Responses
After finishing the Sampler, I added streaming response support while I was at it. The design is straightforward: spawn a tokio task per request, return a receiver, and push tokens to the client one by one via SSE.
One detail worth noting: after looking at mini sglang’s implementation, it has a detokenizer layer for word-level buffering — BPE splits words into subword pieces (e.g., unfortunately → ['un', 'fortunately']), so if the current token decodes to an incomplete subword, it is held until the next token arrives and they are sent together. This means the actual number of SSE events is lower than the number of tokens, saving bandwidth. I left a TODO for this and will add it later.
The PR is here: https://github.com/xiaguan/pegainfer/pull/1
Next Steps
With the Sampler and streaming in place, pegainfer can now run as a basically usable inference service. The next directions are:
Build a proper benchmarking framework, or reuse vllm’s and sglang’s bench_serving tools — but the workload needs to be diverse.
The next chapter should be: pegainfer attempts to match vllm and sglang performance. Of course, without CUDA graph support, decode will be somewhat slower.