The paper analyze a phenomenon that the cross-entropy loss scales
slower when the model getting scaled larger. They break the CE loss into
three part: Error Entropy, Self Alignment and Confidence.
They argues that the EE is really the scaling prat and shows that the
decomposition aligns with the experiment.
Definition
Let be the index of input
data, be the correct token for
this data. is the
vocabulary. is the model’s
output probability score for
They define a “random variable” that
Original Def:
Examples:
Prefix: I would like to eat some
apple
banana
0.2
0.8
The correct token is banana.
So the RBE = 0 since banana is ranked first.
Next, they define a probability distribution function :
here is the
collection of all output GT tokens in the corpus.
Remark: is the probability
form of averaging over the
tokens.
The definition of is we
focus on next token, and we seek the distribution of its rank (in
predicition).
On the other hand, we can define , which focuses on a vocabulary and
observe its ranking.
The definition of paper:
Explanation and
Decomposition
Formally, let be a random variable determined by the context by a conditional
distribution
(ideal model, real distribution) and model distribution .
Consider the original CE loss
Continue for decomposition
Comparsion with Energy-based
Model
Given distribution
Remark: being logit
here.
Now we further coarse-graining on ranking :
Microstate Macrostate
We see is the empirical
macrostate distribution.
Now, condition on , we found
the energy of marcostate as
We further assume the marcostate per se is generated by a Boltzmann
distribution.
where , .
is the softmax
denominator in the -th token
generation.
Gives:
, Boltzmann probability
indicated by the marco energy.
, partition function of
marcostate’s Boltzmann distribution.
relates to the
temperature indicated by microstate.
Now we decompose the CE, found that there is a mismatch between the
empirical macrostate distribution and the marco Boltzmann
distribution.
If grand
canonical equilibrium
But the model cannot directly allocate between state in ranks (you
only see the prefix in a sample).