Fixing VirtIO Code 39

I’m recording this here in case anyone else is unfortunate enough to encounter this Code 39 message, and so that she can avoid wasting several hours of her life attempting to fix it, by instead Googling it and reading this.

Alas, it’s too late for me.

If you’re attempting to install Red Hat’s VirtIO drivers onto a virtualized Windows box, and the whole thing seems to go okay, but then you receive the message “Windows cannot load the device driver for this hardware. The driver may be corrupted or missing (Code 39),” at which point, the device doesn’t work, and you have to remove the driver and do the entire thing over again…

Well, it’s because you’re installing the wrong VirtIO version. If you’re on a Windows Server 2008 R2 box like me, and you’re thinking “Oh, I’ll use the drivers in the Win8 folder, because their aren’t any drivers explicitly for 2008 server,” you have gone astray. Windows Server 2008 R2 uses the Windows 7 drivers, not 8.

It would be nice if, you know, Windows would tell you this, instead of giving you this less-than-helpful-to-put-it-politely message, but it’s not like people pay them for this software or anything, so what do you really expect? Or, uh, wait.

Tiger Petting, Not That Dangerous

I was not designed to be forced.
—Henry David Thoreau, On the Duty of Civil Disobedience

I’m going to run for president, and I’ve found my platform. (Note that this post follows in the very very long tradition of guy complaining on a blog.)

New York bureaucrats (I prefer the more technical term, “human trash”) have passed a bill banning “hugging, patting, or otherwise touching tigers at fairs or circuses.” Assemblywoman Linda Rosenthal, who proposed the bill, explained that her goal was to increase safety.

Oh, and it’s going to kill off all of these great pictures:

…which one intrepid Tumblrizen has been collecting from the dating app Tinder. This, too, I understand was, if not an explicit goal of the legislature, an unexpected “benefit.” Of course, infringing on the tiger-patting liberties of the populace is a very dubious sort of benefit, but this doesn’t prevent our popularly elected nannies from relishing it all the same: The Washington Post reports that one of the assemblywoman’s staffers joked, “I feel bad now. We’re killing bros’ dreams and chances of being laid!”

Before I tear into this one, note that 1) everyone who voted for this bill should be, if not hanged, barred from public office forevermore and 2) the governor has not yet signed it into law.

Wherein we calculate tiger petting risks

The internet’s reaction has been to repeatedly make the same lame joke: “…you’ve eliminated yet another way for Mother Nature to eliminate dumbasses from the planet.”

But I would like to paint you a different picture, one where people learn to distinguish minuscule from real risks and to avoid scope insensitivity. Guys who manage to set themselves apart in a hypercompetitve dating market by posing with tigers are not bros or dumbasses, but creative geniuses. Or, at least, sorta clever.

Especially when they are in what is, to a first approximation, zero danger.

There are somewhere between 5,000 and 10,000 tigers in the United States right now, a small subset of which are public zoo tigers. Many more are private property, owned by circuses, fairs, eccentric people, or small guy-with-a-backyard-full-of-cages level private zoos.

To make this as convincing as possible, we’ll assume the lower bound: there are a mere 5,000 tigers within the United States.

Now, how often does someone pet, pose with, or otherwise touch one of these beasts? Well, it’s hard to say for certain, but we can say something about it. It’s got to be higher than zero, ever — we have the photos after all. On the other hand, the average tiger is probably not being touched 1000 times per day, so that’s an upper bound.

As a conservative estimate, it seems reasonable (at least to me) that the average tiger might “hugged, patted, or otherwise touched” once a month. Sure, there are some unfriendly ones that no one touches, and others on the circus circuit who are very friendly, but the average tiger, at least once a month someone pats him on the head or whatever.

Round that down to 10, and we receive a lower bound of 50,000 tiger petting incidents in the United States per year.

Now, how often is someone killed in the United States by a tiger? Over the last 23 years, there have been a total of 15 tiger-related fatalities, if we generously count ligers as tigers — .7 tiger fatalities per year.

I’ll also note that during the last 23 years, there have been zero tiger-related fatalities in New York. In fact, I can only find 1 New York tiger fatality on record, and that was a 1985 death of a zookeeper, something this new law would not have prevented. It’s unclear what tiger attack epidemic, exactly, this new law is supposed to be preventing.

Indeed, only one person has ever been killed as a direct result of posing with a tiger, at least in the last 23 years — a 17 year old volunteer in Kansas.

So, if you believe that there are at least 50,000 tiger petting incidents in the United States per year, then, your risk of dying from petting a tiger is about 1 in 70,000. And remember this is a lower bound. (Although note that your chances of being somehow harmed by a tiger are higher, probably by about a factor of 5 to 10, but fatality rates are easier to compare.)

In which I compare tiger risks to other risks

But 1 in 70,000 is just a number. What is about as risky as tiger-related death as a result of petting?

The odds of dying from touching a tiger are about the same as:

Petting a tiger is less risky than:

So go forth, pet tigers, hold snakes, skydive. Don’t smoke or deal any crack — there’s a lot to live for: I know I for one am looking forward to Rage Against the Machine’s next single, “Fuck you, I’m petting that tiger.”

Results From The First Split Test

Measuring gives you a leg up on experts who are too good to measure.
—Walter Bright

If you’ve been following along via email (and if not, you should be: subscribe here), you understand that my current philosophy of site growth is to focus on long-term readers. Social traffic is easy to obtain, but oh-so-fleeting. By concentrating on returning visitors, I can build a sort of relationship with you, the gentle reader, and focus on consistent, steady growth, instead of the occasional dizzying flurry of activity when something takes off on Reddit. (Plus repeated interaction makes people nicer.)

Not that there’s anything wrong with these flurries of social traffic. They’re sort of like steroid injections. They keep the site strong but, at anything except the most unreasonable levels of abuse (Mike Tyson anyone?), the bulk of performance is the result of typical training activities.

Or something. Not my best metaphor ever.

As I see it, there are two ways that readers consistently engage with a site’s content: via an email subscription or an RSS feed. Well, okay, maybe two was a lie. I bet there are at least six: Those two just mentioned, people who follow the site via Twitter, those depraved individuals who have adopted the strategy of visit-once-in-a-while-and-refresh, and so on.

But primarily, there are the two means: RSS and email. And people, thus far anyways, engage a lot more with email than with RSS, plus some dozen-ish other benefits: Much better analytics, I can make emails a bit more personal, I can assume that email subscribers have a bit better idea of who I am and what I’m about, and so on. RSS feeds are more limited.

So, in pursuit of this goal of converting strangers into friends, I’m running a series of split (“AB”) tests to see whether different site changes improve the rate at which people subscribe via email. The basic idea: some people see one version of the site, other people see another. After I’ve gathered enough data, I should have a pretty rigorous idea about which is better.

And, given that we’re all good empiricists here (as I mentioned before), with a firm belief in the virtue of actually, you know, testing our beliefs, AB testing is like the bare minimum we ought to be doing in order to persuade ourselves that we’re not fit to be crowned the king of hypocrites.

The First Test

The popular wisdom, that I’ve had shunted into my brain via Patrick McKenzie’s podcasts (recommended) is that you ought to trade an incentive for someone’s email. I figure there are two ways to frame it: one is sorta predatory, like putting some cheese beneath a box propped up by a stick and catching an angry badger, or the more noble way of thinking about it, which is as a free exchange of value between parties. You receive updates and other valuables, I get an email address.

Or maybe it’s more like when your mom would promise a trip to the toy store if you’d just swallow some damn cold medicine.

So, anyways, on the heels of the success of the 100+ Interesting Data Sets For Statistics post, which is the most popular thing I’ve yet to write, I figured, hey, maybe the bad drawings resonated with people. I will promise people some more of those if they sign up.

And, off I went, spending way too much time figuring out how to set up a split test with Jekyll, and debugging some not-quite-right analytics code. (The bug boiled down to forgetting to copy and paste my user ID. Whoops. I look forward to the day that we obsolete humans.)

And, finally, ladies and gentleman of the jury (but, judging by the site analytics, mostly gentleman), I got to the point where users would see either exhibit A or exhibit B.



So, what happened?

First, I gained a more intuitive appreciation of the phenomenon where, when studying small effects, you need a lot of data to figure anything out. Over the past two weeks, about 15,000 unique visitors viewed each option, which still wasn’t enough to reach statistical significance.

I terminated the test anyways. With 43 subscribes to exhibit A and 35 to exhibit B, promising people bonus drawings increased conversions by about 23%, probably. This calculator spits out a confidence value of 82% and, given that our prior belief says that promising people stuff = more conversions, we can be a little more confident than that.

So I’m convinced. On to the next set of tests!

Further Reading

How the 2013 Boston Marathon Bombings Affected This Year’s Attendance

A year ago, I logged a prediction (at 60% confidence) that this year’s Boston Marathon attendance would be lower than the previous year’s as a result of the 2013 bombings.

Well, the numbers are in, and I wasn’t even close: Last year, 26839 people entered the race. This year? 35671 runners, about 33% growth. (In hindsight, what was I even thinking?)

I wanted to quantify just what sort of effect the bombings had on attendance, so I gathered all the data that’s readily available online, and plotted it:


You’ll notice that 2014 seems like a pretty clear outlier. Fitting a line to the data allows us to quantify what normal growth probably would have looked like in an alternate universe where there were no bombings:


Running the numbers, the difference between the predicted turn out and the observed turn out is an additional 6087 runners. You might wonder: what kind of economic windfall is that? Well, the 2012 marathon generated $137.5 million in revenue, some $5123 per runner. This means that the additional 6087 runners should generate an additional 31.2 million in revenue, or about a tenth of the cost of the bombings’s damage (at least according to one NBC estimate).

On Bad Publicity

But we should go back and ask, “What was wrong with my intuition a year ago such that I expected marathon attendance to decrease?” I suspect I underestimated just how compelling a message like, “2014: Let’s race against terror” or “Racing in loving memory of Martin William Richard” would be.

I think I was thinking along the lines of, “Well, people died. People will think it’s dangerous, so they won’t go.” But of course that didn’t happen. Maybe I overestimated just how irrational people are. They probably figured the odds of a second attack were tiny.

Or maybe the sort of media coverage you get when someone attacks your marathon is just surprisingly effective advertising. The Boston Marathon was not even on my radar a year ago, but I’m sitting here and talking about it now, and I’m confident I wouldn’t be otherwise, so it certainly got my attention.

The most relevant comparison I can think of is the 2012 theater shooting, which marred the release of The Dark Knight Rises. Given the growth in Boston Marathon attendance, we might expect — perversely — that the shootings would be good for sales.

This doesn’t appear to be the case. The movie brought in 30 million less in sales than expected, and a 2013 analysis in the Journal of Criminal Analysis reports that the “Aurora theater shooting resulted in striking declines for Cinemark (the targeted theater) as well as major US competitors, but had no impact on overseas theater chains.”

The most salient difference between the two is, I expect, timeframe. A year’s passing makes the bombings feel distant (at least to someone not directly involved) while the bulk of the expected ticket sales took place soon after the theater shootings.

Finally, you might wonder: is there any truth to this whole notion of “no such thing as bad publicity”? Well, sorta: a 2004 study found that any reviews, positive and negative, increased book sales. A 2010 sorta replication found that negative reviews increased sales, but only of mostly unknown authors. Bad reviews of well-known authors, in contrast, hurt sales.

The paper itself offers a few interesting tidbits, too, including:

A wine described “as redolent of stinky socks,” for example, saw its sales increase by 5% after it was reviewed by a prominent wine website (O’Connell 2006). Similarly, although the movie Borat made relentless fun of the country of Kazakhstan, reported a “300 percent increase in requests for information about the country” after the film was released (Yabroff 2006, p. 8).

But, in general, bad publicity is bad publicity and we should stop paying too much attention to questionable adages:

Negative publicity often hurts. When a rumor circulated that McDonald’s used worm meat in its hamburgers, sales decreased by more than 25% (Greene 1978). Coverage of musician Michael Jackson’s bizarre behavior and brushes with the law destroyed his career. Viacom Inc. Chairman Sumner Redstone estimated that negative publicity cost Mission Impossible 3 more than $100 million in ticket sales (Burrough 2006), and film pundits have suggested that it is “almost impossible to recover from bad buzz” (James 2006).

Academic research corroborates this sentiment and casts further doubt on the old adage that “any publicity is good publicity.” Negative publicity about a product has been shown to hurt everything from product and brand evaluation (Tybout et al. 1981, Wyatt and Badger 1984) to firm net present value and sales (Goldenberg et al. 2007, Reinstein and Snyder 2005). Negative movie reviews, for example, decrease box office receipts (Basuroy et al. 2003).

Further Reading

  • If you want to use the data for anything, it’s available here.

100+ Interesting Data Sets for Statistics

5-statistical-sinsEdit: Hey guys! This has proved to be one of the most popular articles on the site, so I’ve created a supplemental download on the 5 biggest statistics mistakes beginners make and how to avoid them. Enter your email below (or on any of the forms scattered around the site), and I’ll send it to you, along with ~2 emails per week on research backed techniques for achieving anything.

Here’s the form:


If we have data, let’s look at data. If all we have are opinions, let’s go with mine.
—Jim Barksdale

I’m not too fond of the phrase “information age.” It sounds like someone sat down and was like, “Hey, there’s a ton of information today… what should we call it? How about the information age?”

First of all, that’s just lazy and, second of all, it doesn’t capture how overwhelming it all is, the sort of angst and helplessness you feel when confronted with… everything. Just all of it.

A phrase that captures it a bit better is “drinking from the firehose.” I haven’t ever tried to drink from an actual firehose, but the metaphor certainly seems apt.

Maybe instead of information age, we could call it the saturation age, you know, because our brains are full to bursting. Or maybe just the overload age. Or how about the age of inundation?

One thing is certain, anyways. Some of us are drowning in data, most of us are oblivious, and some lucky few are surfing on it.  We can do things that we couldn’t in the past (e.g. without Project Gutenberg, neither of my two analyses of the relationship between creativity and compression would have been possible.)

And that got me wondering: just what other interesting data sets are out there? As part of my research, I decided to put together this sort of guided tour, a curated list if you will — adding a bit of structure to the firehose’s deluge.

Here’s my attempt at making it all just a bit more manageable.

Interesting Data Sets

* If, tomorrow, you get an email congratulating you on your new status as future Jeopardy contestant, how are you going to prepare? Well, one approach might be to download this archive of 216,930 past Jeopardy questions and plug them into your favorite spaced repetition system. Combine that with reading up on Jeopardy betting strategies, and you’re well on your way to becoming the next Arthur Chu (except hopefully nicer).

  • Ever get a morbid curiosity about what it’s like to be on death row? (Yeah, me neither.) But in case you ever have, Texas has graciously placed the last words of every inmate executed since 1984 online. So… sentiment analysis, anyone? (“How upbeat are death row inmates days before execution? With a little help from some data, we found out!”)

  • Speaking of prison, there’s more data on prisoners, including information about “their current offense and sentence, criminal history, family background and personal characteristics, prior drug and alcohol use and treatment programs, gun possession and use, and prison activities, programs, and services” available here.

  • How about reading other people’s emails? Ever wanted to do that, but can’t be bothered to train l33t hacking skills (and never mind the legality of it)? (Okay, this one I have thought about.) Well, I’ve got you covered. Check out the Enron corpus. It contains more than half a million emails from about 150 users, mostly senior management of Enron, organized into folders. Wikipedia calls it “unique in that it is one of the only publicly available mass collections of ‘real’ emails easily available for study.” Business idea: figure out what sort of information gets leaked in the emails that will later harm the execs at trial or whatever, then build a software system to automatically mine those out of real email. Either sell it to law enforcement or to corporate executives as the finest cover-your-ass email system.

  • Wondering what the internet really cares about? Well, I don’t know about that, but you could answer an easier question: What does Reddit care about? Someone has scraped the top 2.5 million Reddit posts and then placed them on GitHub. Now you can figure out (with data!) just how much Redditors love cats. Or how about a data backed equivalent of r/circlejerk? (The original use case was determining what domains are the most popular.)

  • Speaking of cats, here are 10,000 annotated images of cats. This ought to come in handy whenever I get around to training a robot to exterminate all non-cat lifeforms. (Or, if you’re Google, you could just train a cat recognition algorithm and then send those users cat-specific advertising.)


  • If you’re interested in building financial algorithms or, really, just predicting arbitrage opportunities for one of America’s largest cash crops, check out this data set, which tracks the price of marijuana from September 2nd, 2010 until about the present.

  • Who’s using what drugs and how often?

  • The earliest recorded chess match dates back to the 10th century, played between a historian from Baghdad and a student. Since then, it’s become a tradition for moves to be recorded – especially if a game has some significance, like a showdown between two strong players. As a consequence, today, students of the game benefit from one of the richest data sets of any game or sport. Perhaps the best freely available data set of games is known as the “Million Base,” boasting some 2.2 million matches. You can download it here. I can imagine an app that calculates your chess fingerprint, letting you know what grandmaster your play is most similar to, or an analysis of how play style has changed over time.

  • On the topic of games, for soccer fans, I recently came across this freely available data set of soccer games, players, teams, goals, and more. If that’s not enough, you can grab even more data via this Soccermetrics API python wrapper. I imagine that this could come in handy for coaches attempting to get an edge over opponent teams and, more generally, for that cross-section between geeks and gamblers attempting to build analytic models to make better bets.

  • Google has put made all their Google Books n-gram data freely available. An n-gram is an n word phrase, and the data set includes 1-grams through 5-grams. The data set is “based originally on 5.2 million books published between 1500 and 2008.” I can imagine using it to determine the most overused, cliche phrases, and those phrases that are in danger of becoming cliched. (Quick! Someone register the domain!)

  • Amazon has a number of freely available data sets (although I think you need to run your analysis on top of their cloud, AWS), including more than 2.8 billion webpages courtesy Common Crawl. The possibilities are endless, but an old business idea I had: analyze the Common Crawl data and determine cheap or not-currently-registered domains which are, for whatever reason, linked to buy many websites. Buy these up and then resell them to people involved in SEO. (Or you could, you know, try to build the next Google.)

  • How well do minorities do on the computer science advanced placement exam? You can find out and tell me.

  • There’s the Million Song data set, which contains information about a million different songs, including a metric “danceability.” Might be nice to pair that with a media player specialized for parties — start with “conversation” music, and slowly shift to more danceable stuff as the night drags on. The data could also be used for a clustering algorithm (automatic genre detection, maybe), but I’m not sure how useful that’d be. A number of people have tried to build recommendation algorithms based on the data, including Kagglers and a team from Cornell. One possible use: analyzing music by year — How danceable, fast, etc. were the 70s? 80s? 90s? (Or how about looking for a follow-the-leader effect. If one song goes viral with a unique style, do a bunch of copycats follow?)

  • Speaking of music data sets, has music data available. Collected from ~360,000 users, it’s in the form of “user, artists, ## of plays”. This would be good for clustering algorithms that automatically determine label genre or recommender systems. (Even a “this artist is most similar to” thing would be sorta cool.)

  • When I think geeks, I think math and computer geeks, but there are many more. Terry Pratchett geeks (dated one!), Whovians, anime geeks, theater geeks and, with some relevance to this next data set, comic book geeks. Cesc Rosselló, Ricardo Alberich, and Joe Miro have put together a “social graph” of the Marvel Universe, and the data is freely available. Ideas for use: Maybe it could be overlaid on Facebook’s social graph to produce a new take on the “What superhero are you?” quiz.

  • Yelp has a freely available subset of their data, including restaurant rankings and reviews. One business idea: use tweets to predict restaurant star ratings. This would enable you to build out a Yelp competitor without requiring an active user base — you could just mine Twitter for data!

  • If you’re interested in data about data (metadata!), Jürgen Schwärzler, a statistician from Google’s public data team, has put together a list of the most frequently searched for data. The top 5 are school comparisons, unemployment, population, sales tax, and salaries. I was surprised that school comparisons were number 1 but, then again, I don’t have any brats running around (yet?). This list would be a good first step in researching what sort of data comparisons people actually care about.

  • Some of my readers are, no doubt, evil geniuses. Others want to save the world. There’s a subset of both of these groups who are interested in superintelligent robots. But to build such a robot, you’re going to have to teach it facts. All the things we take for granted, like that every person has one father. It would be a pain to insert those 10 million facts by hand (and, at a fact a minute, take more than 19 years). Thankfully, Freebase has done part of the job for you, making more than 1.9 billion facts freely available.

  • Maybe your plans are slightly less ambitious. You don’t want to build a superintelligent machine, just one smarter than your run of the mill mathematician. If that’s the case, you’re going to need to teach your machine a lot about mathematics, probably in the form of proofs and theorems. In that case, check out the Mizar project, which has formalized more than 9400 definitions and 49000 theorems.

  • And let’s say you build this mathematician and, sure, it can help you with proofs, but so what? You long for someone you can connect with on a deeper level. Someone who can summarize any topic imaginable. In that case, you might want to feed your robot on Wikipedia data. While all of Wikipedia is freely available, DBpedia is an attempt to synthesize it into a more structured format.

  • Now, you get tired of mathematics and Wikipedia. It turns out that proofs don’t pay the bills, so instead you decide to become a software engineer. Somehow, though, you’ve managed to build these machines without ever a rudimentary understanding of programming, and you want a machine that will teach it to you. But where to find the data for such a thing? You might start with downloading all 7.3 million StackOverflow questions. (Actually, all the StackExchange data is freely available, so you could feed it more math information from both MathOverflow and the other math stackexchange. Plus statistics from Cross Validated, and so on.)

  • Ever wanted to study true friendship? (C’mon! Free your inner <s>child</s> social scientist.) Y’know, genuine, platonic love, like the kind embodied by dolphins? Well, now you can! All thanks to your humble author and Mark Newman, who’s placed a network of “frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand.” Business idea: Flippr. It’s like Facebook, but for dolphins, with plans to expand into emerging whale and sea turtle markets. Most revenue will come from sardine sales.

  • Do left-leaning blogs more often link to other left-leaning blogs than right-leaning ones? Well, I don’t know, but it sounds reasonable. And, thanks to permission from Lada Adamic, you can download her network of hyperlinks between weblogs on US politics, recorded in 2005. (Or you could just read her paper. Spoilers: conservatives more freely link to other conservatives than liberals link to liberals so, if you’re interested in link building, maybe you should register Republican.<a href=”#citation-1″><sup>1</sup></a>

  • Who’s friendlier: the average jazz musician or the average dolphin? You could find out by combining the dolphin data set mentioned earlier with Pablo M. Gleiser and Leon Danon’s jazz musicians network data set.

  • What about 1930s southern women or prisoners? Who’s friendlier? How about fraternity members or HAM radio operators? All this and more can be figured out with these network data sets.

  • How about dolphins or Slashdotters?

  • Web 2.0 websites (like Reddit) are sometimes gamed by “voting rings,” which are groups of people that intentionally vote up each other’s content, regardless of quality. I’ve often wondered if the same thing happens in academic circles. Like, you know, one night during your first year in grad school, you’re kidnapped in the middle of the night and made to swear a blood oath that you’ll cite every other member of the club. Or something. Well, Stanford has put online Arxiv’s High Energy Physics paper citation network, so you could find out.

  • You read this blog, so you’re pretty smart, right? And maybe you’d like to be rich, you know, so you can found the next Bill and Melinda Gates Foundation and save the world. (Because that’s why you want to be rich, right?) Well, then maybe you ought to develop some new-fangled trading algorithm and pick up like a trillion pennies from in front of the metaphorical steam-roller that is the market. (Quantitative finance!) But, in such a case, you’d better at least test your strategy on historical market data. Market data which you can get here.

  • The Open Product Data website aims to make barcode data available for every brand for free. Business idea: a specialty tattoo parlor that only does barcode tattoos, but lets customers pick whatever product they want. Think about it: “What’s your tattoo mean?” “It’s a Twinkie barcode, because Twinkies last forever, man, just like my faith.”

  • The European Center for Medium-Range Weather Forecasts has an impressive looking collection of weather data. Why, you ask, does the weather matter? The economic incentives for predicting the weather are absurd. When should you plant crops? Plan a big event? Launch a space shuttle? Go deep sea fishing? But I want to talk about the most fun application of weather data I’m aware of: The financial industry. I have a lot of respect for finance, mostly because of the crazy stuff they do. The only practical application of neutrinos I’ve heard of, for instance, is “because finance.” Should your algorithm buy Indonesian sesame seed futures? With weather data, it might know.

* If you need nutrition data about food, the USDA has you covered. Business idea: A phone application called, “Am I allergic to that?” Then, lobby for your state to pass some law regulating each school into buying a license of it for every student.

  • For a wordsmith, a good dictionary is indispensable, and when it comes to word data, you could do a lot worse than check out the freely available WordNet. WordNet has significant advantages over your run of the mill dictionary as it focuses on the structure of language, grouping words into “sets of cognitive synonyms (synsets), each expressing a distinct concept.” It also has some information about relationships, such as “a chair has legs.”

  • We’ve already established that some of you are evil geniuses, in which case, where are you going to build your secret lair? I mean, a volcano is pretty cool, but is it evil and genius enough for competing in today’s modern world? You know what the other evil geniuses don’t have? A secret base on a planet outside of the solar system. With NASA’s list, you can get busy commissioning someone to build you a base on KOI-3284.01.<a href=”#citation-2″><sup>2</sup></a>

  • The Federal Railroad administration keeps a list of “railroad safety information including accidents and incidents, inventory and highway-rail crossing data.” Someone (like the NY Times) could overlay this on a map of the United States and figure out if people in poor regions are more likely to be hit by trains or something.

  • If you need a database of comprehensive book data, perhaps to build a competitor to Goodreads or an online digital library, the Open Library allows people to freely download their entire database.

  • Who is the United States killing with drones? If you’re content with Pakistan specific data, there is a list of drone strikes available here.

  • If you’re interested in building a Papers2 competitor with support for automatically importing citation data (please do this), CrossRef metadata search might be a good place to check out.

  • Mnemosyne is a virtual flash card program that takes advantage of spaced repetition to maximize learning. (As you might recall, I’m a big fan of spaced repetition.) The project has been collecting user data for years, and gwern has graciously agreed to freely host the data for a few months. Perhaps one could run some sort of unsupervised learning algorithm over it and try to discover heretofore unknown information about human memory.

  • How much would it cost to hire Justin Bieber to play at your wedding? The fine lads at Priceconomics have figured out how much it would cost to hire your favorite band. You could take this data and calculate some sort of popularity to price ratio — What’s the most fame for your buck?

  • I’ve mentioned in a few of the other data sets just how lucrative it is to be able to better predict the stock market than everyone else. In 2011, researchers found that they could use data from twitter to do just that: they went through tweets, found one’s related to publicly traded companies, and then calculated a mood score. With this they write, ” We find an accuracy of 86.7% in predicting the daily up and down changes in the closing values of the DJIA.” A number of Twitter data sets are freely available here.

  • A 2014 paper by Clifford Winston and Fred Mannering reports that vehicle traffic costs the United States 100 billion dollars each year.<a href=”#citation-3″><sup>3</sup></a> There’s money to be made, then, in routing traffic more efficiently. One way to do this would be to feed an algorithm historical traffic data and then use that to predict hotspots, which you would route people around. Lots of that data is available on

  • On the other hand, if you were building an app to track current traffic data, you’ll need a different data source.

  • If you want to launch a spam-fighting service, or maybe just analyze what type of emails spammers are sending, you’ll need data. UC Irvine has you covered.

  • But maybe you want to extend your spam-fighting service to text messages. Still got you covered.

  • There is a wealth of data sets available for R and all you have to do is install a package. Ecdat is one of those packages, containing gobs of econometric data. How about an analysis of how math levels correlate with number of cigarettes smoked? I’d read that.

* Ever wondered about how one person will be on the board of directors of several companies and it’s like, hey, maybe Condoleezza Rice with her ties to government surveillance isn’t the best choice for Dropbox? What if you could analyze those connections? Well, with this data set, you can. But only for Norway — it’s a network of the board members of public companies in Norway.

  • Ever seen a TV show where a government determines that someone is a terrorist based on their social ties? I always figured that data would be locked down tight somewhere, y’know, classified. But it turns out it isn’t. You, too, can analyze the social networks of terrorists.

  • There’s been a fair bit of controversy around all the bureaucracy of Wikipedia. But how does one become a bona fide Wikipedia big shot? Who’s the ideal Wikipedia administrator? Well, they’re voted for, and the data is available for download.

  • Harvard has opened up its set of “over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.”

  • If you need small data sets for students, check out DASL. One at random: does sterilizing dominant males in a wild mustang population reduce the population?

  • GET-Evidence has put up public genomes for download. I think Steven Pinker’s data is in there someone. Maybe you could make yourself a clone?

  • Oh, and speaking of genomes, the 1000 Genomes project has made ~260 terabytes of genome data downloadable.

  • In what is the smallest data set on this list, the survival rates of men and women on the Titanic. Female passengers were ~4x times more likely to survive than male passengers.

  • Want an super specific breakdown of the contents of your food? You’re in luck. (Thanks Canada!)

  • There’s a similar database of the metabolites in the human body. I’m not sure what you could do with it, but it might come in handy in some sort of dystopian future where humans are raised like cattle for their nutrients. (Maybe someone could use this to build a viral marketing campaign along the lines of, “How nutritious is your mom?”)

calories-in-a-human* The Reference Energy Disaggregation Data Set has about 500 GB of compressed data on home energy use. Obvious use candidates: improving home efficiency or creating a visualization of just where people’s energy bills are going.

* Did you know that you can download all the PDFs on Arxiv? Once we manage to teach machines natural language, we can just have a computer read it all and give us the cliff notes (and the scientific breakthroughs).

  • If you need economic census data on any industry, check out’s industry statistics portal. If finance is really evil, you ought to be able to find something damning in the data.

  • For those unfamiliar with Usenet, it’s sort of like a huge, text-only forum. It was much more popular before the rise of the world wide web. Anyways, you can download a huge data set of postings to Usenet here. It might be pretty good for some kind of textual analysis project or training a machine learning algorithm (maybe a spellchecker?) You could use the data to build out a Google Groups competitor, too.

  • Nick Bostrom has a very interesting paper called “Existential Risk Prevention as Global Priority.” The basic intuition is that preventing even small risks of human extinction is worthwhile if we consider all the human generations it would save. One way to start saving all those future lives might be by digging into this data set of every recorded meteor impact on Earth from 2500 BCE to 2012.

  • How do gender and mental illness affect crime? This data set was collected explicitly with that question in mind.

  • Speaking of mental health, if you’re interested in how it affects minorities specifically, try this.

  • There are a lot of lonely men and women out there, and some of those lonely men and women have excellent analytical skills. For those lonely people, I suggest using this data set, which “surveyed how Americans met their spouses and romantic partners, and compared traditional to non-traditional couples” to determine the best way to meet that special someone.


  • Tons of data on what is called “adolescent health” available here, but is actually more, including a bunch of relationship data and biomarkers. (Not creatine levels, unfortunately.)

  • Here’s a question: Are modern jobs worse than those of the past? My grandparents built tires at Firestone. Today, people rarely have that level of control and visceral experience of the finished product of their work. This set of five surveys regarding how different groups experience employment could answer that question. I can see the article now — “Is everything getting slightly worse? We found out.”

  • Stanford has 35 million Amazon reviews available for download. Lot’s of stuff you could do with this: use it to improve recommendation algorithms, figure out whether or not there’s a follow-the-leader effect with reviews (i.e. Do early positive reviews beget more positive reviews?)

  • Based on some of my research prior to writing this, the google keyword “data sets on serial killers” is 1) really specific and 2) weirdly popular, but I guess there’s no accounting for taste. And, of course, we’ve got data for that, thanks to the Serial Killer Information Center.<a href=”#citation-4″><sup>4</sup></a>

  • In this gruesome vein, the University of Maryland has a “Global Terrorism Database,” which is a set of more than 113,000 terror incidents. You can download it after filling out a form. Ideas for use: visualization of terror incidents by location over time, predicting and preventing terror attacks, and creating early alert systems for vulnerable areas.

  • The MNIST Database is a classic in the field of machine learning. It’s a set of labeled hand-written characters, which are necessary for OCR algorithms. Today, some algorithms are actually more accurate than human judges! This would have been nice to have back when I was in grade school. I distinctly recall once arguing with a teacher over missing a question because she insisted that I had written the letter j when it was clearly a d. In the future, we’ll let the machines decide.

  • UCI has a poker hand data set available. My poker-fu is fairly weak, but I’m sure there’s some interesting analysis to be done there. I’ve heard second hand that humans still maintain some advantage over machines when it comes to poker, but I’m unable to verify that via Google. Machines have won in at least one tournament.

  • Another data set from UCI: images labeled as either advertisements or non-advertisements. This is good for building up classification algorithms that decide whether or not a new image is an ad or not, which might be good for, say, automatic ad blocking or spam detection. Or maybe a Google Glass application that filters out real life advertisements. That’d be cool. Look at a billboard and instead see a virtual extension of the natural landscape.

  • Remember the whole Star Wars Kid debacle? Wikipedia informs me that Attack of the Show rated it the number 1 viral video of all time. Andy Baio, one of the guys who was in on it before it was cool and coined the phrase “Star Wars Kid” has made his server logs from the time publicly available. Someone could take this data and produce a visualization of who saw it when via maps, along with annotations of where the traffic was coming from.

  • Who’s linking to who (and what) on WordPress? (Tidbit: most of the links to this site come from WordPress blogs.) With this WordPress crawl, you can find out. Visualizing the network might be sorta cool, but it’d be cooler still to uncover some information about “supernodes” that either are linked to often or put out a lot of links (or maybe both). Or maybe clustering people by interest.

  • Is Obama in bed with big oil? Or extremist environmentalists? Or the corn lobbies? And who was backing that Herman Cain dude, anyways? The 2012 Presidential Campaign Finance data is available for download. It would be neat to see an analysis of what industries prefer what candidates.

  • Which private colleges are the best value?

  • Which public colleges are the best value?

  • Cigarette data by state. Kentucky smokes the most, with West Virginia as a close second. Given the massive social harm of tobacco, a good analysis could very well save a lot of lives.

superhero* On December 5th of 2008, what was being downloaded from The Pirate Bay?

Further Reading


    1. With apologies to JFK: “Let us seek not the Democrat link or the Republican link, but the right link.”

    1. Wikipedia says: “KOI-3284.01 is believed to be the most Earth like exoplanet to be found so far by the Kepler space probe. It is predicted to have a radius 1.5 times that of Earth’s. It is predicted to be located at the proper distance from the sun to sustain liquid water.”

    1. “The Texas Transportation Institute’s latest Urban Mobility Report puts the annual cost of congestion to the nation, including both travel delays and expenditures on fuel, at more than $100 billion.”

  1. If that’s not enough, there seems to be a fair amount of research around “murder topology” which is not, as you might naively expect, a super badass branch of mathematics, but rather concerned with the movement patterns of serial killers.

Surprisingly Dangerous Jobs In America

You can’t avoid danger.
—Jeannette Walls, Half Broke Horses

Yeah, you can. Don’t get one of these jobs, for instance.

David Henderson has rightly earned the title contrarian with his latest post which, to kick off National Police Week, points out that it’s more dangerous to be a farmer than a policeman — “For every 100,000 police, the annual fatality rate is 20. For every 100,000 farmers, it is 40% higher, at 28.” (Source.)

Now, on this blog, we’re good empiricists, and nothing warms the heart of an empiricist more than refuting a well-known, common sense “truth” with, you know, observations and data.

So that got me thinking: What jobs are more (or less) dangerous than one might naively suspect?

I present to you this delightful graph, taken from the Bureau of Labor Statistics:

jobs-by-dangerNotice that the data here for farmers agrees with David’s. He has 28 per 100,000 versus the charts 25.3 per 100,000. (And given the endemic underreporting, the 28 number might well be more accurate.) Police officer isn’t included on the chart but David’s data would make it about as dangerous as… taxi driver. That’s right, folks. The brave folks keeping the peace of our nation? Just as brave as your local cabby. (Actually, given that police deaths dropped 20 percent in 2012, cab drivers might be braver.)

I’m going to propose we replace Police Week with Fisherman Week, because it’s about 6 times more dangerous to be a fisherman than a police officer. (And who doesn’t love a good tuna steak?)

Or maybe we should keep Police Week, but dedicate 6 weeks to celebrating fishermen. It’s only fair. And, of course, three weeks to pilots and people involved with flying, along with a solid two weeks for garbage men.

Some other fun facts

Digging a bit further into the data, we find this somewhat troubling statistic: 92% of workplace fatalities are men. (Do we blame the patriarchy for this one?)

And if you were wondering what state is the most dangerous: North Dakota. It’s about as dangerous to work in North Dakota as it is to be a police officer. From the AFL-CIO’s “Death on the Job” report:

Among all of the states, North Dakota stands out as an exceptionally dangerous and deadly place to work. The state’s job fatality rate of 17.7/100,000 workers is alarming. It is more than five times the national average and is one of the highest state job fatality rates ever reported for any state.

So probably we should have a week celebrating North Dakatons, too.

Further Reading

What I’m Watching

From most recommended to least (roughly):

Hard Books Are Overrated

Hot air balloons take people on adventures. Books do, too.

Hot air balloons take people on adventures. Books do, too.

Widely used calculus books must be mediocre.
— W. Rudin

I’ve noticed a phenomenon, especially in mathematics, where anyone asking for book recommendations invariably is recommended the least-gentle-but-still-reasonable textbook imaginable. A high school student might ask for an introduction to calculus and someone will tell them to read Principles of Mathematical Analysis. Looking for an introduction to programming? Hey, try Structure and Interpretation of Computer Programs — or just read Knuth!

Several possible explanations spring to mind:

  • By recommending a difficult book to someone, you’re signalling your own intelligence.
  • Struggling against a text results in a sort of Stockholm syndrome, where one becomes more fond of it as they sink more hours into it.
  • Recommending an easy book can be socially costly as the receiver might interpret it as an insult. It’s safer to recommend a hard book.
  • When asked for a book recommendation, people mentally grasp the book most representative of that subject, and this leads to a bias towards not-beginner-friendly texts.
  • Hard texts are that much better than anything else.
  • Familiarity with a subject damages one’s ability to judge whether or not a text is appropriate for someone without the same background.
  • Those recommending hard textbooks have not, generally, tried to self-study them, but rather sat through a class that required the text and completed 10% of the problems as homework.

All of these could be true, but what we really want to know is: are hard books systematically overvalued?

Comparing the utility of books

Now, I have no doubt that reading one hard book is more valuable than reading one easy book, but that’s not a fair fight. I could get through The Design of Everyday Things in a couple of days, while working through Knuth’s Concrete Mathematics would take me the better part of a year. If you accept that a book like Concrete Mathematics can take 200 hours while an easy book might take 4, the question becomes: Is one hard book worth fifty easy ones?


We can consider taking this completely literally. Consider perhaps the most straightforward application of microeconomic theory ever: If two men are selling rugs, but one man is able to sell his rugs for ten times as much, well, customers are deriving more value from his rugs.

50 easy books will, most of the time, cost more than 1 hard book. We might expect, then, that 50 easy books will provide more value to the consumer. I actually have painstakingly collected statistics on my own reading, but there’s no correlation between my rating of a book and its price (alas).

books-price-rating-correlationA stroll through the New York Times’s best seller list also reveals no obvious relationship between how interesting a book looks and its price. A look at computer science and programming books is a bit more compelling. The obviously more valuable books (e.g. The Art of Computer Programming) are selling for more than the nth Adobe Photoshop manual.

So easy books win here, but it’s not clear if this win is worth anything.

Amount learned

Instead, we can take the tack of estimating the amount learned from a book. When going through The Design of Everyday Things, I learned maybe 10 things. Today, I started reading Petzold’s CODE and have taken about 50 separate notes. At this rate, I should end up with something like 300. A significant variance here, but let’s say the average easy text can teach someone 35 new things — more if you’re careful with book selection.

If I consider, on the other hand, a harder text, like Artificial Intelligence: A Modern Approach, the amount available to learn is about an order of magnitude greater. Depending on the number of exercises one wades through, anywhere between one and three thousand seems reasonable.

This one could really go either way. A broad selection of fifty popular science texts will teach a person more than one really hard book, but one hard book will teach you more than fifty so-so spiritual or diet books.

And, of course, all learning is not created equal. Some things really are more important than others.

Possible heuristic: popular science is great for building a broad understanding, while difficult works are great for pushing one to the next level via deliberate practice.

Revealed preference

I can count the number of people I know who routinely fight their way through hard books on one hand. It’s not a normal thing that people do, even curious people. Indeed, curious people are notable for being more willing than most to read a wide variety of texts, not for the difficulty of those texts.

University courses

University professors typically assign more difficult textbooks (although often not truly difficult, a sorta middle ground.) Still, consider the popularity of a book like CLRS versus The Algorithm Design Manual. The first is more popular, the second more readable.

I’m not sure who wins this round: on the one hand, university textbooks are harder to read than what’s on the New York Times bestsellers list. On the other hand, most university courses are not requiring books like Structure and Interpretation of Computer Programs, instead opting for something gentler.

Plus, these medium difficulty textbooks are often little more than props for a class, something to accompany lectures and provide exercises. Still, on the whole, it seems more honest to call this a win for difficult books.

Comprehension, amount retained

On difficult texts, one Less Wrong commenter wrote this:

I found that when a text requires a second or third reading, taking a lot of notes, etc., I won’t be able to master it at the level of the material that I know well, and it won’t be retained as reliably, for example I won’t be able to re-generate most of the key constructions and theorems without looking them up a couple of years later (this applies even if more advanced topics are practiced in the meantime, as they usually don’t involve systematic review of the basics). Thus, there is a triple penalty for working on challenging material: it takes more effort and time to process, the resulting understanding is less fluent, and it gets forgotten faster and to a greater extent.


If we live in a world where hard books were clearly superior to easier ones, I would expect that the reading habits of successful people would center around difficult books.

This isn’t the impression that I get, in general. CEOs fill their reading lists with biographies, not textbooks. If you looked at the reading habits of Bill Gates, you’ll see that it’s filled with popular non-fiction rather than hefty technical works.

Closing remarks

All of this suggests a heuristic: to decide what to read, ask yourself, “What’s the easiest book I could read that will teach me a lot?” Or, alternatively, “What’s the easiest book that will move me towards my goals?”

For reading broadly, popular works seem like a clear win. I think it’s best to save difficult works in a field until you’ve reached diminishing returns on easier texts. If you’re not learning much, then you ought to move onto something harder.

Ideally, the process of reading through easy texts on a subject before tackling harder ones transforms once difficult books into easy ones. This pattern of reading looks more like a slow progression than a violent struggle — small steps instead of leaps.

Further Reading

Response to BasicBookReader

How can e-readers be improved? This is a response to Austin G. Walters BasicBookReader project.

Features For Authors

Better feedback for authors. When seeking feedback on a draft of something, authors want analytics. Where does someone stop reading? If someone puts down your book at a certain page, that’s a page that ought to be rewritten. You’ve bored them enough that they’ve decided to stop reading.

After all, an author — at least of fiction — is aiming at creating an addicting product. You want to write something that people can’t put down, the sort of book that people stay awake late into the night reading. Notice that this is what popular websites (Facebook, Reddit) have in common: they’re addicting.

I expect, though, for an author to get really useful feedback, they’re going to need quite a few “test” readers. Otherwise, the signal will get drowned in noise — maybe someone places a book down because the phone rings, etc. (Heh, noisy data caused by literal noise.) But with lots and lots of test readers, the law of large numbers should kick in, and you’ll be able to objectively identify boring passages.

Extending online analytics

The current generation of cutting edge analytics are being applied to the web. These could, in principle, be extended to books read on a screen. Take AB testing for instance. Maybe some readers get a version of the original book. Others get a version with the “stopping points” rewritten. Which version are people more likely to finish?

With enough such trials, a mediocre book can be pounded and shaped, like a blacksmith would iron, into something great.

I suspect there are other techniques that could similarly be lifted from website analytics.

Features for Readers

Export highlights as text

One of the more powerful features of Skim is the ability to export highlights from PDFs as text. I then save these annotations and can search through them via grep whenever I’m looking for a piece of evidence that I vaguely recall.



I’ve spoken with people who say that they’re able to read more if the page automatically scrolls — this forces them to maintain attention of the book in front of them. I’ve been unable to duplicate this success, but it seems plausible.


A useful feature in academic texts is the ability to save one’s current position in order to look up a citation or footnote. You might need to skip ahead 100 or so pages or whatever, so you want to be able to quickly jump back and forth.

Allowing a “split screen” where you can view two pages at once is similarly useful.

Analytics for readers

Are there any useful analytics for readers? It’s hard to think of any. You might correlate time of day with pages per minute, or something. These analytics could help identify when one’s at an intellectual peak. Words per minute might be similarly useful.

It’s hard to think of any other information someone might find useful. The Kindle has a feature that automatically highlights popular segments of a text purchased through them. It’s entertaining, but I’m not sure how useful.

Graphs of pages per day, total pages, and books read are, at the very least, inspiring.


Intersecting that could build a business around taking ugly websites and making them beautiful, eh? This seems like a place where a newcomer could win over established players. When given text or ePub, the reader could lay it out is some pleasing form.

The styling could be tweaked in some form of AB testing, figuring out what’s most conducive to long reading sessions.

Saving book locations

If the app crashes or my computer does, I’d like to be able to restore exactly where I left off reading. It’s also useful to have a “farthest page read” marker in case something goes wrong. I find I’ll often press the wrong key on the Kindle and lose my place.

Fast dictionary look up

A means to quickly figure out what a word means would be useful.

Table of contents sidebar

The largest advantage print books still have over electronic ones is that it’s easy to figure out the structure of a print book at any time. You keep one finger stuck in the page you’re on and flip to the table of contents, or to the next chapter.

As far as a I know, no software has successfully replicated this yet. Skim contains a sidebar, like so:


Unfortunately, this feature breaks down when a PDF doesn’t ship with table of contents metadata.


Too Smart To Understand

Here is a meme I would very much like to see die forever. I’ll be reading book reviews and come across people gushing about how great the book was — and they know it’s great because they couldn’t understand any of it.

In “Greatly Exaggerated” he is so fucking smart that I couldn’t even read the essay, because I am not, and never will be, his intellectual equal.
<span id=”quote-attribute”>—from a review of <em>A Supposedly Fun Thing I’ll Never Do Again: Essays and Arguments</em></span>

So many 5 star reviews describe a book as incomprehensible.

I want to shake them. You’ve been tricked! Intelligence isn’t about hiding your ideas in an impenetrable shroud! It’s about laying insights bare so that anyone can understand them. Great writing is a combination of interdisciplinary mastery and clarity.

Where do people get this notion of intelligence-as-obfuscation from, anyways?