Error Entropy

Introduction

https://arxiv.org/pdf/2510.04067

The paper analyze a phenomenon that the cross-entropy loss scales slower when the model getting scaled larger. They break the CE loss into three part: Error Entropy, Self Alignment and Confidence.

They argues that the EE is really the scaling prat and shows that the decomposition aligns with the experiment.

Definition

Let be the index of input data, be the correct token for this data. is the vocabulary. is the model’s output probability score for

They define a “random variable” that

Original Def:

Examples:

Prefix: I would like to eat some

apple banana
0.2 0.8

The correct token is banana.

So the RBE = 0 since banana is ranked first.

Next, they define a probability distribution function :

here is the collection of all output GT tokens in the corpus.

Remark: is the probability form of averaging over the tokens.

The definition of is we focus on next token, and we seek the distribution of its rank (in predicition).

On the other hand, we can define , which focuses on a vocabulary and observe its ranking.

The definition of paper:

Explanation and Decomposition

Formally, let be a random variable determined by the context by a conditional distribution (ideal model, real distribution) and model distribution .

Consider the original CE loss

Continue for decomposition

Comparsion with Energy-based Model

Given distribution

Remark: being logit here.

Now we further coarse-graining on ranking :

Microstate Macrostate

We see is the empirical macrostate distribution.

Now, condition on , we found the energy of marcostate as

We further assume the marcostate per se is generated by a Boltzmann distribution.

where , .

is the softmax denominator in the -th token generation.

Gives:

  1. , Boltzmann probability indicated by the marco energy.

  2. , partition function of marcostate’s Boltzmann distribution.

relates to the temperature indicated by microstate.

Now we decompose the CE, found that there is a mismatch between the empirical macrostate distribution and the marco Boltzmann distribution.

If grand canonical equilibrium

But the model cannot directly allocate between state in ranks (you only see the prefix in a sample).


Error Entropy
https://notdesigned.github.io/2026/04/20/Error-Entropy/
Author
Luocheng Liang
Posted on
April 20, 2026
Licensed under