MIROSTAT: A NEURAL TEXT DECODING ALGORITHM THAT DIRECTLY CONTROLS PERPLEXITY

10 min readJul 30, 2020

This is a brief overview of our work at ICLR 2021. Code is available here. The code is very easy to run on Google Colab from the example provided in the code. Try it out yourself for fun! :)

Tl;dr: We provide a new text decoding algorithm that directly controls generated text statistics and hence more human-like texts.

Surprise in (text) data is something like sugar in your tea, you just need the right amount for the perfect taste. Too little of sugar makes your tea bland or uninteresting, whereas too much sugar ruins the natural flavor of your tea. Similarly, too little surprise makes your text dull, boring, and repetitive, whereas too much surprise makes it confusing and may even lose the actual message. Here I will provide a brief overview of our work where we propose a text decoding algorithm, mirostat, which generates text data with predetermined surprise content.

In this paper, we work with an information-theoretic notion of surprise (defined in the paper, and is closely related to cross-entropy and perplexity) and analyze the surprise content in texts generated using popular decoding algorithms, find their relation to repetitions and incoherence in generated texts, and propose a text decoding algorithm that provides control over the surprise content in the texts. Hence, allowing the user to generate texts with a desirable amount of surprise in them.

Why control perplexity?

A short answer, apart from the fact that it helps control repetitions and incoherence, would be that the log of perplexity has a good relation with human-based quality evaluations as was found in this work by Zhang et. al. Let’s take a look at a plot from this paper.

Fig.1: Human judgment vs. log p(x) (Zhang et. al.).

The plot shows that human judgment is maximized for a certain range of log(p(x)). Note that this plot considered sentences consisting of 30 tokens, so log(p(x)) = 30*cross-entropy. Hence, clearly, a certain range of cross-entropy (average surprise content) maximizes human judgment. Adding this to the fact that existing text decoding algorithms do not provide good control over the statistics of the output concludes that mirostat can provide high-quality text when set at an appropriate target surprise value.

We shall later verify this claim with human experiments on texts generated using Mirostat and top-p (nucleus) sampling.

The relation between surprise and quality of the generated text

Without getting lost in the technical details, let’s get an intuition about the relation between surprise in generate texts and some of its attributes like repetitions and incoherence. In particular, first, let's take a look at boredom and confusion traps. We fed a GPT-2 model (117M parameters) with some context and used various decoding algorithms that generate the following texts and analyzed the surprise content in each of their words (tokens) conditioned on the previously generated text and the context, which is plotted next to each generated text.

Context = “For two months early in 1943, Shannon came into contact with the leading British mathematician Alan Turing. Turing had been posted to Washington to share with the U.S. Navy’s cryptanalytic service the methods used by the British Government Code and Cypher School at Bletchley Park to break the ciphers used by the Kriegsmarine U-boats in the north Atlantic Ocean. He was also interested in the encipherment of speech and to this end spent time at Bell Labs.”

Fig.2: Surprise values vs. token indices with top-p sampling for p = 0.4.

Generated text = “Turing’s work on the cryptography of the Kriegsmarine was based on the cipher used by the Kriegsmarine’s famous E-2. The Kriegsmarine code was based on the Cryptographic Algorithm (CAT) of the Royal Navy. It was a new method of solving the problem of encryption, which was a mathematical problem which was the basis of the cryptanalysis of the Japanese naval cipher.
Turing’s work on the cipher was based on the algorithm used by the Kriegsmarine’s famous E-2. The cipher was based on the algorithm used by the Kriegsmarine’s famous E-2.
Turing’s work on the cipher was based on the algorithm used by the Kriegsmarine’s famous E-2.
Turing’s work on the cipher was based on the algorithm used by the Kriegsmarine’s famous E-2.
Turing’s work on the cipher was based on”

The text generated above used top-p (nucleus) sampling with p = 0.4. We can clearly see from the plot that the surprise content in the generated text drops suddenly around token index 100 and that is roughly where the excessive repetitions start in the generated text. Based on numerous such experiments, we hypothesize that repetitions are correlated with small surprise values. We call this boredom trap. Next, let’s take a look at some generated text with excessive surprise content. In particular, we use top-p sampling with p = 1.0.

Fig.3: Surprise values vs. token indices with top-p sampling for p = 1.0.

Generated text = “In January of 1942, in an open letter delivered to 8 persons and addressed to Sherwood, Shannon took the opportunity to address skeptics about the little out-of-print $128,000 unreported race as “cryptanalysis apparatus of extraterrestrial intelligence.” His reply was also reprinted in 6 South Carolina publications (Atlanta Press, Fall Publishing, Miami, New House Press, Yorkshire Press, Sydney Press, Weatherby Press, Winter Press, and Thames & Hudson, to name a few), and eventually accepted as the final subscribed FHAT article. (Summary, were whom)
Birmingham City Police concludes that [Turing’s][new replace Robert] Symington is not guilty of breach of OKF law. Further, his suggestions for preventing his sentencing Should he reside anywhere near the city he defies extradition to Bavaria closer to dawn minimal? Lylimud Simpson (Ladies & Gentlemen, Lawrence Livermore University Press, Philadelphia): Jim Gilmore and its wife, Eva Civilsky,”

Clearly, on reading the above text, it is not clear what the message in the text is or in a sense the text is confusing in nature. We call this confusion trap as several decoding algorithms show a tendency of increase in surprise content in the text with longer texts. Hence producing confusing and incoherent text. Moreover, we show that existing decoding algorithms do not have good control over the statistics of the generated text. In this paper, we introduce mirostat sampling which provides control over the average surprise content in the generated text for a varying length of text (The word mirostat is derived from mirum which is Latin for surprise and stat meaning control).

Fig.4: Surprise values vs. token indices with mirostat sampling.

Generated text = “Turing learned of two cryptanalysts, the British mathematician and cryptographer, who had worked on the Kriegsmarine exploits, whom he called “the greatest cryptologists of our time”. They were the man who discovered the cipher code for Kriegsmarine’s Enigma code, the cipher in which the British government codes the code for the U-boat codes for the U.S. Navy’s aircraft carriers. Graham and Shannon met in the early 1950’s and worked on the Kriegsmarine exploit for the 18th century. They worked for the U.S., but also for the British. Turing testified that the Kriegsmarine report for Enigma’s code, written by the cryptanalytical team, was “a very good piece of evidence” that the Kriegsmarine took advantage of the Enigma code or that of its cipher. In the early 1950’s, Graham described how the cipher code was used by the crypt”

The above text not only looks plausible and related to the context but also avoids repetition and confusion. It was generated using mirostat with a target surprise of 3.0, which also happens to be the observed surprise value.

So far we noticed that both small and large surprise values are associated with poor-quality text generation. Based on a large scale human-based evaluation of generated texts, it was found in a recent work that these human-based evaluation scores are maximized for a certain range of surprise values. Existing decoding algorithms do not provide good control over the surprise in the generated text. Hence, mirostat plays a crucial role in generating high-quality texts for varying lengths.

How to control surprise in the generated text?

We will take an intuitive look at the working of mirostat and leave the technical details to the paper. The algorithm is a simple feedback-based control algorithm which first observes the average surprise in the generated text (which is technically the log of perplexity), then uses this feedback and uses top-k decoding using an appropriate value of k computed using some good approximations derived in the paper and an assumption that the words follow Zipfian statistics.

Step 1: Estimation

The first step of this algorithm is to assume Zipfian statistics of words and estimate a parameter s.

Step 2: Approximation

The next step is to approximately compute k based on estimated s and target surprise value.

Step 3: Feedback

Update the target surprise value based on observed surprise value

More technically,

Results: controlled cross-entropy

Here we look at some results showing that top-k and top-p decoding do not provide good control over cross-entropy (log of perplexity) and that mirostat provides very good control. In the paper, we have also shown theoretically, the nonlinearity of cross-entropy with k in top-k sampling and near-linearity of cross-entropy with p in top-p sampling, which can be observed from the plots.

Fig.5: Cross-entropy rate in top-*k sampling.*

Fig.6: Cross-entropy rate in top-p sampling.

Fig.7: Cross-entropy rate in mirostat sampling.

Results: repetitions and cross-entropy

We find that word-level repetitions decrease near-linearly with increase in cross-entropy rate. Moreover, this curve is independent of the sampling method used as illustrated below. Hence, having control over cross-entropy gives indirect control over repetitions.

Fig.8: Percentage repetition vs. observed cross-entropy rate for different sampling methods.

Fig.9: Percentage repetition vs. observed cross-entropy rate for different temperature values, T.

We find that using larger models provides fewer repetitions for the same cross-entropy rate as shown below.

Fig.10: Percentage repetition vs. observed cross-entropy rate for different language models.

Also, note that in human-generated text data, there are often common pronouns and conjunctions that are essential and are often repeated, hence we do not expect a good sampling algorithm to have absolutely zero word-level repetitions. But, we do expect a good sampling algorithm to have minimum sentence-level repetitions, which all the sampling algorithms seem to show beyond a threshold of cross-entropy, which seems to be around 2.5 for GPT-2 language model with 117M parameters as shown below.

Fig.11: Percentage repetition vs. observed cross-entropy for n-gram tokens and different sampling methods.

Results: more on boredom and confusion traps

We observe that small values of input parameters in top-k or top-p sampling fall into the boredom trap and their cross-entropy decrease with increases in length, which implies increase in percentage repetitions in the text.

Fig.12: Boredom trap for small values of k and p.

On the other hand, for large values of input parameters in top-k or top-p sampling, we find that the cross-entropy of the generated text keeps increasing with length of texts. High cross-entropy is associated with incoherence in generated texts.

Fig.13: Confusion trap for large values of k and p.

However, human-generated texts show more controlled cross-entropy, which is also shown by top-k and top-p sampling for moderate values of input parameters as shown below. But, it is unclear beforehand what value of k or p produces more controlled text.

Fig.14: Human-like cross-entropy rate for moderate k and p.

Mirostat shows good control over cross-entropy and hence, does not fall either into boredom trap or confusion trap for a wide range of target surprise value.

Fig.15: Controlled text generation over varying text lengths in mirostat sampling.

Human Experiments

We generated 300 tokens (words) using GPT-2 from a fixed context with average cross-entropy rate τ ∈ {2.5, 3, 4, 5} using both mirostat and top-p sampling. We presented these texts and a human-generated 300 word continuation of the context to 43 participants. Participants were not informed of the generation process and rated each text on 1 to 7 Likert scales for fluency, coherence, and overall quality. Further, the raters guessed if the text was AI- or human-generated. More details of the experiment can be found in the paper.

Figures above show texts that had cross-entropy rate τ = 3 received the best ratings by human participants for fluency, coherence, and overall quality for both mirostat and top-p sampling.

Further, for τ = 3, more than half of raters mistakenly guessed the AI-generated text to be human-generated.

Note that with mirostat we can choose τ beforehand to control the quality of the generated text, whereas, with other sampling methods like top-p or top-k, we get much variation in cross-entropy rates even for fixed input parameters p and k respectively. Also, note the sensitivity of these human evaluations to change in cross-entropy rates. This implies that top-p and top-k sampling can be less reliable in terms of controlling quality of generated text compared to mirostat

Thus, by controlling cross-entropy rates mirostat can control the quality of generated texts as well.