Error Entropy

Introduction

The paper analyze a phenomenon that the cross-entropy loss scales slower when the model getting scaled larger. They break the CE loss into three part: Error Entropy, Self Alignment and Confidence.

They argues that the EE is really the scaling prat and shows that the decomposition aligns with the experiment.

Definition

Let \(i\) be the index of input data, \(v_i\) be the correct token for this data. \(\mathcal{V}\) is the vocabulary. \(s_i\) is the model’s output probability score for \(v_i\)

They define a “random variable” that

\[ R=\text{the rank of the ground truth token in model's next token prediction} \]

Original Def: \[ \mathrm{RBE}(v_i)=\sum_{v\in \mathcal{V}} \mathbf{1}\braces{s_v>s_{v_i}}. \]

Examples:

Prefix: I would like to eat some

apple	banana
0.2	0.8

The correct token is banana.

So the RBE = 0 since banana is ranked first.

Next, they define a probability distribution function \(p_e\):

\[ p_e = \mathrm{Pr}\bracket{RBE(v_i)=e|v_i\in \mathcal D} \]

here \(\mathcal D\) is the collection of all output GT tokens in the corpus.

Remark: \(p_e\) is the probability form of \(RBE\) averaging over the tokens.

The definition of \(p_e\) is we focus on next token, and we seek the distribution of its rank (in predicition).

On the other hand, we can define \(q_e\), which focuses on a vocabulary and observe its ranking.

The definition of paper:

\[ Q_e= \text{GeoMean}(\braces{s_{v_i}|RBE(v_i)=e}), q_e=\frac{Q_e}{C}, C=\sum_{e} Q_e. \]

Explanation and Decomposition

Formally, let \(R:\mathcal{V} \to \R^+\) be a random variable determined by the context \((x_{<t},x_t)\) by a conditional distribution \(q(x_t | x_{<t})\) (ideal model, real distribution) and model distribution \(p_{\theta}(x_t|x_{<t})\).

\[ \begin{aligned} R(x_t \mid x_{<t}) := \sum_{v \in V} \mathbf{1}\{p_\theta(v \mid x_{<t}) > p_\theta(x_t \mid x_{<t})\} \end{aligned} \]

\[ p_e := \Pr_{(x_{<t}, x_t) \sim q^*}[R = e] \]

\[ \log Q_e = \mathbb E_{(x_{<t},x_t)\sim q^*, R(x_t| x_{<t})=e} [\log p_{\theta}(x_t|x_{<t})] \]

Consider the original CE loss \[ CE=\mathbb E_{(x_t,x_{<t})\sim q}[\log p_{\theta}(x_t|x_{<t})] \]

\[ \begin{aligned} CE&= -\frac{1}{N} \sum_{i} s_i\\ &= -\frac{1}{N} \sum_{i} \log p_{\theta}(x_t^i|x_{<t}^i) \\ &= -\frac 1N\sum_{e} \sum_{i:R_i=e} \log p_{\theta}(x_t^i|x_{<t}^i) \\ &= -\sum_{e} \frac{n_e}{N} \frac{1}{n_e} \sum_{i:R_i=e} \log p_{\theta}(x_t^i|x_{<t}^i) \\ &= -\sum_{e} \frac{n_e}{N} \log \paren{\prod_{i:R_i=e} p_{\theta}(x^{i}_t|x^i_{<t})}^{1/n_e} \\ &= -\sum_{e} p_e \log Q_e \end{aligned} \]

Continue for decomposition \[ \begin{aligned} CE&=-\sum_{e}p_e \log Q_e\\ &=-\sum_{e} p_e \log q_e + \log C\\ &=\underbrace{-\sum_{e} p_e \log p_e}_{\text{Error Entropy}} +\underbrace{\mathrm{KL}(p_e\| q_e)}_{\text{Self Alignment}} + \underbrace{\log C}_{\text{Confidence}} \end{aligned} \]

Comparsion with Energy-based Model

Given distribution \(p(x)\) \[ p(x) \propto \exp(-\beta E(x)) \]

Remark: \(-E(x)\) being logit here.

\[ Z= \int \exp(-\beta E(x)) dx \]

\[ p(x) = \frac{\exp(-\beta E(x))}{Z} \]

\[ -\log p(x) = \beta E(x) + \log Z \]

\[ CE(q_{data},p)=\beta E_{q}[E(x)] + \log Z \]

Now we further coarse-graining on ranking \(R\) :

Microstate \(x = (x_{<t}, x_t)\sim q\) \(\mapsto\) Macrostate \(e = R_\theta(x_t \mid x_{<t}) \sim p_e\)

We see \(p_e\) is the empirical macrostate distribution.

Now, condition on \(R=e\), we found the energy of marcostate as \(\log Q_e\)

We further assume the marcostate per se is generated by a Boltzmann distribution.

\[ q_e \propto \exp(-\beta_2 E_e^{\text{macro}}) = \exp(-\beta_2 [\beta_1 \bar{E}_e + \bar{A}_e]) \] where \(\bar{A}_e := \frac{1}{n_e}\sum_{i: R_i = e} \log Z_{\text{soft},i}\), \(\bar{E}_e := \frac{1}{n_e}\sum_{i: R_i = e} E(x_t^{(i)}, x_{<t}^{(i)})\).

\(Z_{\text{soft},i}\) is the softmax denominator in the \(i\)-th token generation.

Gives:

\(q_e\), Boltzmann probability indicated by the marco energy.
\(C\), partition function of marcostate’s Boltzmann distribution.

\[ CE=\beta_1 \mathbb E_q[{E(x)}] = \beta_1 \mathbb E_{p_e}[\log Q_e] \]

\(\beta_1\) relates to the temperature indicated by microstate.

Now we decompose the CE, found that there is a mismatch between the empirical macrostate distribution and the marco Boltzmann distribution.

\[ CE=\beta S + \underbrace{\mathrm{KL}(p_e\| q_e)}_{\text{How far we deviate from Grand Canonical Balance}} + \log C \]

If \(KL=0 \Rightarrow\) grand canonical equilibrium

But the model cannot directly allocate between state in ranks (you only see the prefix in a sample).

Machine Learning

Error Entropy

https://adscn.dev/2026/04/20/Error-Entropy/

Author

Luocheng Liang

Posted on

April 20, 2026

Licensed under

Statistical Mechanics Next