Large Language Models (LLMs) provide different outputs for the same input despite setting temperature as zero - which makes sampling deterministic eliminating variance in output. This has been attributed to two main sources:

  1. floating point arithmetic, and
  2. Parallelism in GPU.

Together they tend to make floating point arithmetic non-associative.

a+(b+c)(a+b)+c a + (b + c) \neq (a + b) + c

We can think of this as GPU cores in parallel do computation, which results in nondeterministic sequence of values. And when we do floating point math on these, it results in different values.

Horace He1 from thinking labs argues that the non-associativity of floating point arithmetic is is technically true in theory, but not the main cause of nondeterminism during inference because modern inference kernels use deterministic vectorized reductions (not atomics). Atomics allow threads to update the same memory location concurrently, causing random ordering. He argues that inference using highly optimized libraries like PyTorch do not use atomic operations and rather apply vectorized operations. This eliminates the problem of non-associative arithmetic. He instead points out to batch size being the true culprit of non-determinism of LLMs during inference time. He points out that during inference / forward pass, non-determinism is introduced by varying batching size depending upon the load on the GPU.

Outputs can be deterministic for a given batch yet different across different batch sizes.

He identifies 3 main culprits that are impacted by batch-size: RMSnorm, attention and matmul.

Let us look at the code for batch RMSnorm.

# x: [batch_size, hidden_dim]
# weight: [hidden_dim]
def rms_norm(x, weight):
    return x * torch.rsqrt(torch.mean(x ** 2, dim=-1, keepdim=True)) * weight

Let us decode what is happening here. We understand at a high level that every row in X is processed by a thread block for data parallelism. But in reality, GPU kernel considers the GPU load (batch size) and if the load is less, it splits some of these rows to other thread block to make full utilization of the GPU kernel. This optimization results in non deterministic floating point accumulation order.

The fix is to use batch-invariant kernels that always reduce each row on a single thread block, even if it leaves some GPU cores idle.

Horace He notes this penalty for fixed batch-size is usually low when your real batch size is close to the fixed size, but can be substantial when it’s much smaller.

In short, LLM nondeterminism at temperature 0 comes not from random sampling or floating-point chaos, but from batch-size–dependent parallel reduction strategies. Making kernels batch-invariant restores determinism at the cost of some throughput.

References:


  1. He, Horace and Thinking Machines Lab, Defeating Nondeterminism in LLM Inference, Thinking Machines Lab: Connectionism, Sep 2025. ↩︎