Creativity, Fan Fiction, and Compression

I’ve written before about the relationship between creativity and compressibility. To recap, a creative work is one that violates expectations, while a compressible statement is one that’s expected.

For instance, consider two sentences:

I suspect you find the second more creative.

Three more examples of creative sentences:

Given that less predictable sentences are more creative, and less predictable sentences are less compressible, creative works ought to be less compressible than non-creative ones. And, indeed, I found some evidence for this in a previous experiment.

But that was not too compelling as it compared technical, repetitive works to novels. This time, I decided to compare very creative writing to normal creative writing.


The idea then is to compare the compressibility of amateur creative writing with that of experts. To accomplish this, I took 95 of the top 100 most downloaded works from Project Gutenberg. I figure that these count as very creative works given that they’re still popular now, ~100 years later. For amateur writing, I downloaded 107 fanfiction novels listed as “extraordinary” from

I then selected the strongest open source text compression algorithm, as ranked by Matt Mahoney’s compression benchmarkpaq8pxd. I ran each work through the strongest level of compression, and then compared the ratio of compressed to uncompressed space for each work.

Analysis and Results

I plotted the data and examined the outliers, which turned out to be compressed files that my script incorrectly grabbed from Project Gutenberg. I removed these from the analysis, and produced this:


Here the red dots are fanfiction novels, while the blue-ish ones are classic works of literature. If the hypothesis were true, we’d expect them to fall into distinct clusters. They don’t.

Comparing compressibility alone produces this:


Again, no clear grouping.

Finally, I applied a Student’s t test to the data, which should tell us if the two data sets are distinguishable mathematically. Based on the graphs, intuition says it won’t, and indeed it doesn’t:

Welch Two Sample t-test

data:  dat$RATIO by dat$CLASS
t = -1.3614, df = 144.26, p-value = 0.1755
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.009924785  0.001828882
sample estimates:
    mean in group  FANFIC mean in group  LITERATURE 
                0.2230189                 0.2270668

The p-value here is 0.1755, which is not statistical significance. The code and data necessary to reproduce this are available on GitHub.


I must admit a certain amount of disappointment that we weren’t able to distinguish between literature and fanfiction by compressiblity. That would have been pretty neat.

So, what does this failure mean? There at least six hypothesis that get a boost based on this evidence:

And so on and, of course, it’s possible that I messed up the analysis somewhere.

Of all of these, my preferred explanation is that compression technology (and hardware) are not yet good enough. Consider, again, the difference between a creative and a not-creative sentence:

The first is boring, right? Why? Because we’ve heard it before. It’s compressible — but how’s a compression algorithm supposed to know that? Well, maybe if we trained it on a corpus of the English language, gave it the sort of experience that we have, then it might be able to identify a cliche.

But that’s not how compression works right now. I mean, sure, some have certain models of language, but nothing approaching the memory that a human has, which is where “human as computer compression algorithm” breaks down. Even with the right algorithm — maybe we already know it — the hardware isn’t there.

Scientific American estimates that the brain has a storage capacity of about 2.5 petabytes, which is sort of hand-wavy and I’d bet on the high side, but every estimate I’ve seen puts the brain at more than 4 gigabytes, by at least a couple orders of magnitude. I don’t know of any compressors that use memory anywhere near that, and certainly none that use anything like 2.5 petabytes. At the very least, we’re limited by hardware here.

But don’t just listen to me. Make up your own mind.

Further Reading

You've read this far⁠—want more?

Subscribe and I'll e-mail you updates along with the ideas that I don't share anywhere else.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.