Prolonged Eye Contact and Attraction: What The Science Tells Us

eye_contact_book Edit: Hey guys! This has proved to be one of the most popular articles on the site, so I’ve created a supplemental download on techniques for improving eye contact. Enter your email below (or on any of the forms scattered around the site), and I’ll send it to you, along with ~2 emails per week on research backed techniques for achieving anything.

Here’s the form:




Belladonna means “beautiful woman” in Italian, but it’s also the name of a type of plant. The origins of the term belladonna are uncertain, but date back to at least 1554.

It’s been suggested (and this is my favorite theory) that the name might be related to belladonna’s use as a cosmetic. Women would consume the plant in order to dilate their pupils, in an attempt to enhance beauty.

The only problem? Belladonna (sometimes called nightshade) is poisonous.

Richard Pultney’s 1757 paper, “A brief botanical and medical history of the Solanum Lethale, Bella-donna, or Deadly Nightshade,” recounts this tale:

Its relaxing quality is very surprising, as appears by that memorable case… of a lady’s applying a leaf of it to a little ulcer, suspected to be of the cancerous kind, a little below her eye, which rendered the pupil so paralytic, that it lost all its motion for some time afterward: and that this event was really owing to that application, appears from the experiment’s being repeated with the same effect three times.

belladonna-in-eyeBut they were really onto something! This is the craziest part of the whole thing. (Suffering for fashion is passé.) Hess (1965) took two pictures of the same woman and presented it to male subjects and asked them to describe the woman
in the picture. The researchers altered the photos so that one had slightly larger pupils. By and large, the male subjects preferred the woman with the larger pupils.

Try it:


(The one on the bottom is the one that you’re supposed to find more attractive, although I’ve just terrifically biased you by telling you that.)

This has since been replicated at least five times.

Let’s just take a minute and reflect on this. Women in 16th century Italy anticipated the findings of modern scientific research by about 400 years. They not only discovered that belladonna reliably increases pupil size, but they also noticed that men were attracted to that.

I propose a hypothesis similar to the efficient markets hypothesis. We’ll call it the efficient beauty hypothesis: if a beauty-increasing cosmetic intervention exists, some enterprising individual somewhere will discover it.

You might wonder, then: are women interested in men with large pupils? Tombs and Silverman’s 2004 paper, “Pupillometry: A sexual selection approach” tried to answer this question. The paper includes this graph:

The realtionship between prolonged eye contact and attraction.

The relationship between prolonged eye contact and attraction.

You’ll notice that women find average pupil sizes (on men) the most attractive, while men subscribe to the Texan, bigger-is-better philosophy. The authors additionally report that, “Further investigation revealed that females attracted by large pupils also reported preferences for proverbial bad boys as dating partners.”

At this point, you might wonder why men find large pupils attractive. And, of course, evolution has good reason for that, as confirmed by a 2007 study:

We found an increase in mean pupil diameter for sexually significant stimuli during the fertile phase and this pupillary change was also specific to pictures of the participants’ actual sexual partners. Moreover, this effect was only seen for women who did not use oral contraceptives. These findings confirm that women’s attention for sexually significant stimuli is higher during their fertile phase of the menstrual cycle, and that changes in sexual interest are implicitly measurable using pupillometry.

Or, in plain English, fertile women tend to have larger pupils.


In Elana Clift’s Honors thesis, “Picking Up and Acting Out: Politics of Masculinity in the Seduction Community,” she argues that the “pick up artist” movement is the result of the lack of available dating scripts for young men. Back in, say, Victorian England, everyone knew how this whole relationship thing worked. Today, we’re all horribly confused.

I was sorta convinced by that for a while, and I think that explains some of it, but now I’m plagued by doubt. Lots of pick-up strikes me as actively toxic. I mean, yeah, especially to women — there are a disproportionate amount of vocal misogynists associated with the “manosphere” generally, but I mean to men, too: Pick-up is an advertiser’s wet dream. Nothing sells better than insecurity, and what more poignant insecurity than masculine identity and status anxiety about attractiveness? (Whenever you hear the phrase “real” men, ask what they’re selling.)

Of course, my concerns here are hardly limited to men, although I’m more familiar with the struggles of young men everywhere. Cosmopolitan magazine is the female-equivalent of pick-up, telling young women that they need to fit into some sort of mold in order to attract a guy — that they shouldn’t answer the phone on the first ring or whatever — and I’m sure lots more nonsense which isn’t even on my radar, but probably ought to be.

Which brings me to the topic at hand: eye contact. These unsavory actors sell prolonged eye contact as some sort of panacea. An actual example I found with 10 seconds of googling: “Master These Eye Contact Techniques To Create Powerful Attraction,” complete with tips that the author promised “will blow my mind.” (Hint: they didn’t.) Another blog targeted at “Helping men reclaim their masculinity and their relationships,” (gag) includes this gem: “…strong eye contact is difficult to maintain if you do not have the confidence to back it up (thus making it an honest signal).”

Yeah, right. Because if you don’t maintain strong eye contact, it’s because you lack confidence, and definitely not because you haven’t yet mastered the serial killer’s thousand yard stare.

Frankly, this all smacks of the purest bullshit. Evolution has spent billions of years and computational cycles optimizing male-female relations. If maintaining eye contact with your crush is so effective, why don’t people just do it naturally? Could advising people to maintain strong eye contact be harmful? Maybe unnaturally strong eye contact just comes off as creepy.

I decided to find out.

The Evidence on Prolonged Eye Contact

An interlude during which the author does a lot of research.

My (somewhat begrudging) subjective feeling after reading through 5 or 6 relevant papers is that, yes, the pick-up artists are right, the majority of men ought to be making more eye contact. The case for women is less clear. As far as I can tell, too much eye contact is always better than too little, and eye contact combined with a smile is difficult to get wrong.

My neat evolution-has-optimized-eye-contact argument has at least one damning flaw: children learn the association between eye contact and liking. It’s not innate.

The association between gaze and liking appears to be learned. Children do not use eye contact to judge affiliation and friendship until about age 6 (Abramovitch & Daly, 1978; Post & Hetherington, 1974).

Now, is there such thing as too much gaze? Yes. Moderate gaze is better than constant gaze:

Gaze also influences people’s liking for each other, with moderate amounts of gaze generally preferred over constant or no gaze (Argyle, Lefebvre, & Cook, 1974; Exline, 1971).

Bu-u-u-ut constant eye contact is still better than no eye contact:

British college students rated a same-sex peer they met in an experiment as more pleasant and less nervous when the person gazed at them continuously rather than not at all (Cook & Smith. 1975).

Compare that with a mock interview study, which had students either exhibit low, natural, or high gaze. Notably, researchers defined high gaze here as near-constant eye contact. They found no difference in likability between normal and high gaze:

High-levels of gaze do not differ from normative gaze patterns in earning more favorable endorsements for hiring from an interviewer, in conferring greater credibility, in increasing attraction and in receiving favorable relational communication interpretations.

Indeed, there were even some benefits to near-constant gaze. Interviewers labeled near-constant gazers (not to be confused with goats) as more attractive, more intimate, and more dominate than those who displayed normal levels of eye contact. So, again, more evidence that too much eye-contact is way better than too little.

Those who make lots of eye contact are even judged to be more intelligent (!):

Wheeler, Baron, Michell, and Ginsburg (1979) reported a positive correlation between an interviewee’s eye contact with an interviewer and estimates made by observers of the interviewee’s intelligence.

And it’s not even confined to those you look at. If someone sees you making a lot of eye contact with someone, they’ll like you more than if you didn’t:

The positive feelings associated with gaze generalize to observers, who favor people when they gaze at moderate rather than low levels while approaching others (Gary. 1978a) or in social interactions (Abele, 1981; Shrout & Fiske, 1981).

Of course, people like it most of all when you look at them, which a 2005 study, “The look of love: gaze shifts and person perception,” verified.

Ratings of likability were elevated when social attention was directed toward rather than away from the raters.

In the same study, men rated women who paid attention to them not only as more likable, but more attractive, too:

Whereas gaze cues elevated ratings of likability among both male and female participants, only the men displayed gaze-related effects on person evaluation when the physical attractiveness of the targets was assessed.

Here’s another belief I held that turns out to be wrong. I’ve observed that people look at the speaker while listening, and look away while speaking. But this turns out to be totally okay to violate (surprise!) and you can stare all the time if you want (or, at least, high status people do it):

Equivalent amounts of gazing while speaking and listening were found with research participants who were given high status or who were discussing issues on which they had expertise (Ellyson, Dovidio, & Corson, 1981; Ellyson, Dovidio, Corson. & Vinicur. 1980).

And more eye contact makes you more powerful:

Dovidio and Ellyson (1982) reported that high gazing-while-speaking ratios were directly related to ratings of power in an interaction.

Want to make friends? Have you tried staring at people?

College women gazed more at a female confederate when they were trying to make friends (Pellegrini, Hicks, & Gordon, 1970), and college men gazed more at a woman when they wanted to interest her in a social conversation (Lefebvre, 1975).

It even holds for imaginary friends!

Mehrabian (I968a, 1968b) reported that research participants gazed more when they approached an imaginary person they liked rather than disliked.

And real ones, too:

Russo (1975) reported greater amounts of eye contact between elementary school children who were friends rather than nonfriends.

What does eye contact mean, though?

While doing keyword research for this, I noticed that a lot of men and women are confused about what prolonged eye contact means. Does it indicate sexual interest? Well, it definitely can!:

Participants in a study by Griffitt, May, and Veitch (1974) gazed more at opposite-sex peers when they had previously been exposed to sexually arousing slides.

It might even imply that you’re smokin’ hot (and trust me, gentle reader, you totally are):

Coutts and Schneider (1975) reported positive correlations between gaze directed by research participants toward opposite-sex peers and experimenter ratings of the peers’ physical attractiveness.

But not always. People will look at you more even if you’re just plain nice to them:

People gazed more after receiving positive evaluations (Coutts, Schneider, & Montgomery, 1980; Exline & Winters, 1965; Walsh et al., 1977) or warm nonverbal responses (Ho & Mitchell, 1982).

Is eye contact ever bad?

Even if you’re hitchhiking, more eye contact is better:

Drivers were more likely to stop for gazing hitchhikers (M. Snyder, Grether, & Keller. 1974), pedestrians were more likely to help a gazing experimenter pick up dropped coins (Valentine, 1980) and dropped questionnaires (Goldman & Fordyce, 1983), and bystanders were more likely to help an injured gazing jogger (Shetland & Johnson, 1978).

Or when you’re buying cereal, according to the 2014 study, “Why Is Cap’n Crunch Looking Down at My Child?”:

We showed that eye contact with cereal spokes-characters increased feelings of trust and connection to the brand, as well as choice of the brand over competitors

Now, you might wonder: are there ever times where you shouldn’t make so much eye contact? Well, when waiting for a green light:

Ellsworth et al., (1972) and Greenbaum and Rosenfeld (1978) had experimenters stand on street corners and gaze constantly or not at all at pedestrians and motorists who were waiting for a red light. When the light changed to green, pedestrians and drivers crossed the intersection significantly faster when they had received constant gaze from the experimenter.

But just dress nice and you’re okay:

For example, pedestrians did not cross the street as fast to escape a staring experimenter when the experimenter was dressed and made up to be physically attractive (Kmiecik, Mausar, & Banziger, 1979)

Or add a smile:

People were also less likely to avoid a staring experimenter when the experimenter smiled (Elman, Schulte. & Bukoff. 1977).

Sex Differences

It turns out, though, that there are sex differences. Women (on average) respond positively to lots of eye contact, while men prefer less. For instance, if you want a female friend to reveal all her secrets, eye contact is good:

Female speakers disclosed more personal information about themselves to listeners who gazed. Female speakers also liked gazing listeners more than nongazing listeners. (Ellsworth and Ross 1975)

For men, though, the opposite is true:

Male speakers, in contrast, disclosed more and felt greater liking when the listener did not gaze.

A similar phenomenon holds with asking for help when picking up coins:

For example, women gave more help in picking up dropped coins to a female experimenter who gazed at them (Valentine & Ehrlichman, 1979). Men gave more help to a male experimenter who did not gaze at them.

Women even like it when they’re told that a man looked at them an unusually high amount:

Kleinke et al. (1973) introduced college men and women in pairs and left them in a room to get acquainted. After their conversation, an experimenter told participants that one person (whose gaze was supposedly recorded through a one-way mirror) had gazed at the other person an unusually high, an average, or an unusually low amount of the time. Women were most favorable toward men whose gaze had ostensibly been high.

But not men:

Men’s reactions were exactly opposite. Men were most favorable toward women when they were told the woman’s gaze or their own gaze had been low.

I wonder if this is just male insecurity? If I was told some chick had been staring at me, I might wonder, “Is there something wrong with my hair? Has one of my legs grown two legs and walked off of its own volition?”

Does eye contact cause love?

To see is to devour.
—Victor Hugo, Les Misérables

Finally, though, what you really want to know: if I maintain eye contact with my crush, will they fall madly and deeply in love with me? Well, sorta. If you convince someone to maintain eye contact with you for ~2 minutes, they’ll (on average) be more attracted to you. The experimenters in this study told their subjects to maintain eye contact in order to “tune their extra-sensory abilities” and, afterwards, they rated their partners as significantly more attractive than controls. Hey, worth a shot, right?

Actually, it turns out, just tricking your crush into thinking they look at you a lot is enough. (“Hey, Maria, why do you keep looking at me? Is it because you’re in lo-o-o-ove with me?”)

In one of these, Kleinke, Bustos, Meeker, and Staneski (1973) did not actually induce their subjects to gaze at their partners. Instead the subjects were told that they had done so. This produced modest increases in attraction for the partner.

Further Reading

  • If you want to settle down with a book on relationships, the best scientific overview I’ve read is the Handbook of Relationship Initiation. For lighter fare, The Moral Animal is pretty entertaining.
  • If you liked this, you’ll love the Social Issues Research Centre’s “Guide to Flirting.”
  • If you want to dive into the original sources for yourself (or look up references), start with “Gaze and eye contact: a research review,” which is where the bulk of this information came from. (Where it didn’t, I’ve indicated in the text.)
  • One of the most useful bits of research to come out of the study of human relationships is the notion of the “mere exposure effect” which suggests that the more you see someone (or something), the more you’ll come to like them.

100+ Interesting Data Sets for Statistics

5-statistical-sinsEdit: Hey guys! This has proved to be one of the most popular articles on the site, so I’ve created a supplemental download on the 5 biggest statistics mistakes beginners make and how to avoid them. Enter your email below (or on any of the forms scattered around the site), and I’ll send it to you, along with ~2 emails per week on research backed techniques for achieving anything.

Here’s the form:


If we have data, let’s look at data. If all we have are opinions, let’s go with mine.
—Jim Barksdale

I’m not too fond of the phrase “information age.” It sounds like someone sat down and was like, “Hey, there’s a ton of information today… what should we call it? How about the information age?”

First of all, that’s just lazy and, second of all, it doesn’t capture how overwhelming it all is, the sort of angst and helplessness you feel when confronted with… everything. Just all of it.

A phrase that captures it a bit better is “drinking from the firehose.” I haven’t ever tried to drink from an actual firehose, but the metaphor certainly seems apt.

Maybe instead of information age, we could call it the saturation age, you know, because our brains are full to bursting. Or maybe just the overload age. Or how about the age of inundation?

One thing is certain, anyways. Some of us are drowning in data, most of us are oblivious, and some lucky few are surfing on it.  We can do things that we couldn’t in the past (e.g. without Project Gutenberg, neither of my two analyses of the relationship between creativity and compression would have been possible.)

And that got me wondering: just what other interesting data sets are out there? As part of my research, I decided to put together this sort of guided tour, a curated list if you will — adding a bit of structure to the firehose’s deluge.

Here’s my attempt at making it all just a bit more manageable.

Interesting Data Sets

* If, tomorrow, you get an email congratulating you on your new status as future Jeopardy contestant, how are you going to prepare? Well, one approach might be to download this archive of 216,930 past Jeopardy questions and plug them into your favorite spaced repetition system. Combine that with reading up on Jeopardy betting strategies, and you’re well on your way to becoming the next Arthur Chu (except hopefully nicer).

  • Ever get a morbid curiosity about what it’s like to be on death row? (Yeah, me neither.) But in case you ever have, Texas has graciously placed the last words of every inmate executed since 1984 online. So… sentiment analysis, anyone? (“How upbeat are death row inmates days before execution? With a little help from some data, we found out!”)

  • Speaking of prison, there’s more data on prisoners, including information about “their current offense and sentence, criminal history, family background and personal characteristics, prior drug and alcohol use and treatment programs, gun possession and use, and prison activities, programs, and services” available here.

  • How about reading other people’s emails? Ever wanted to do that, but can’t be bothered to train l33t hacking skills (and never mind the legality of it)? (Okay, this one I have thought about.) Well, I’ve got you covered. Check out the Enron corpus. It contains more than half a million emails from about 150 users, mostly senior management of Enron, organized into folders. Wikipedia calls it “unique in that it is one of the only publicly available mass collections of ‘real’ emails easily available for study.” Business idea: figure out what sort of information gets leaked in the emails that will later harm the execs at trial or whatever, then build a software system to automatically mine those out of real email. Either sell it to law enforcement or to corporate executives as the finest cover-your-ass email system.

  • Wondering what the internet really cares about? Well, I don’t know about that, but you could answer an easier question: What does Reddit care about? Someone has scraped the top 2.5 million Reddit posts and then placed them on GitHub. Now you can figure out (with data!) just how much Redditors love cats. Or how about a data backed equivalent of r/circlejerk? (The original use case was determining what domains are the most popular.)

  • Speaking of cats, here are 10,000 annotated images of cats. This ought to come in handy whenever I get around to training a robot to exterminate all non-cat lifeforms. (Or, if you’re Google, you could just train a cat recognition algorithm and then send those users cat-specific advertising.)


  • If you’re interested in building financial algorithms or, really, just predicting arbitrage opportunities for one of America’s largest cash crops, check out this data set, which tracks the price of marijuana from September 2nd, 2010 until about the present.

  • Who’s using what drugs and how often?

  • The earliest recorded chess match dates back to the 10th century, played between a historian from Baghdad and a student. Since then, it’s become a tradition for moves to be recorded – especially if a game has some significance, like a showdown between two strong players. As a consequence, today, students of the game benefit from one of the richest data sets of any game or sport. Perhaps the best freely available data set of games is known as the “Million Base,” boasting some 2.2 million matches. You can download it here. I can imagine an app that calculates your chess fingerprint, letting you know what grandmaster your play is most similar to, or an analysis of how play style has changed over time.

  • On the topic of games, for soccer fans, I recently came across this freely available data set of soccer games, players, teams, goals, and more. If that’s not enough, you can grab even more data via this Soccermetrics API python wrapper. I imagine that this could come in handy for coaches attempting to get an edge over opponent teams and, more generally, for that cross-section between geeks and gamblers attempting to build analytic models to make better bets.

  • Google has put made all their Google Books n-gram data freely available. An n-gram is an n word phrase, and the data set includes 1-grams through 5-grams. The data set is “based originally on 5.2 million books published between 1500 and 2008.” I can imagine using it to determine the most overused, cliche phrases, and those phrases that are in danger of becoming cliched. (Quick! Someone register the domain!)

  • Amazon has a number of freely available data sets (although I think you need to run your analysis on top of their cloud, AWS), including more than 2.8 billion webpages courtesy Common Crawl. The possibilities are endless, but an old business idea I had: analyze the Common Crawl data and determine cheap or not-currently-registered domains which are, for whatever reason, linked to buy many websites. Buy these up and then resell them to people involved in SEO. (Or you could, you know, try to build the next Google.)

  • How well do minorities do on the computer science advanced placement exam? You can find out and tell me.

  • There’s the Million Song data set, which contains information about a million different songs, including a metric “danceability.” Might be nice to pair that with a media player specialized for parties — start with “conversation” music, and slowly shift to more danceable stuff as the night drags on. The data could also be used for a clustering algorithm (automatic genre detection, maybe), but I’m not sure how useful that’d be. A number of people have tried to build recommendation algorithms based on the data, including Kagglers and a team from Cornell. One possible use: analyzing music by year — How danceable, fast, etc. were the 70s? 80s? 90s? (Or how about looking for a follow-the-leader effect. If one song goes viral with a unique style, do a bunch of copycats follow?)

  • Speaking of music data sets, has music data available. Collected from ~360,000 users, it’s in the form of “user, artists, ## of plays”. This would be good for clustering algorithms that automatically determine label genre or recommender systems. (Even a “this artist is most similar to” thing would be sorta cool.)

  • When I think geeks, I think math and computer geeks, but there are many more. Terry Pratchett geeks (dated one!), Whovians, anime geeks, theater geeks and, with some relevance to this next data set, comic book geeks. Cesc Rosselló, Ricardo Alberich, and Joe Miro have put together a “social graph” of the Marvel Universe, and the data is freely available. Ideas for use: Maybe it could be overlaid on Facebook’s social graph to produce a new take on the “What superhero are you?” quiz.

  • Yelp has a freely available subset of their data, including restaurant rankings and reviews. One business idea: use tweets to predict restaurant star ratings. This would enable you to build out a Yelp competitor without requiring an active user base — you could just mine Twitter for data!

  • If you’re interested in data about data (metadata!), Jürgen Schwärzler, a statistician from Google’s public data team, has put together a list of the most frequently searched for data. The top 5 are school comparisons, unemployment, population, sales tax, and salaries. I was surprised that school comparisons were number 1 but, then again, I don’t have any brats running around (yet?). This list would be a good first step in researching what sort of data comparisons people actually care about.

  • Some of my readers are, no doubt, evil geniuses. Others want to save the world. There’s a subset of both of these groups who are interested in superintelligent robots. But to build such a robot, you’re going to have to teach it facts. All the things we take for granted, like that every person has one father. It would be a pain to insert those 10 million facts by hand (and, at a fact a minute, take more than 19 years). Thankfully, Freebase has done part of the job for you, making more than 1.9 billion facts freely available.

  • Maybe your plans are slightly less ambitious. You don’t want to build a superintelligent machine, just one smarter than your run of the mill mathematician. If that’s the case, you’re going to need to teach your machine a lot about mathematics, probably in the form of proofs and theorems. In that case, check out the Mizar project, which has formalized more than 9400 definitions and 49000 theorems.

  • And let’s say you build this mathematician and, sure, it can help you with proofs, but so what? You long for someone you can connect with on a deeper level. Someone who can summarize any topic imaginable. In that case, you might want to feed your robot on Wikipedia data. While all of Wikipedia is freely available, DBpedia is an attempt to synthesize it into a more structured format.

  • Now, you get tired of mathematics and Wikipedia. It turns out that proofs don’t pay the bills, so instead you decide to become a software engineer. Somehow, though, you’ve managed to build these machines without ever a rudimentary understanding of programming, and you want a machine that will teach it to you. But where to find the data for such a thing? You might start with downloading all 7.3 million StackOverflow questions. (Actually, all the StackExchange data is freely available, so you could feed it more math information from both MathOverflow and the other math stackexchange. Plus statistics from Cross Validated, and so on.)

  • Ever wanted to study true friendship? (C’mon! Free your inner <s>child</s> social scientist.) Y’know, genuine, platonic love, like the kind embodied by dolphins? Well, now you can! All thanks to your humble author and Mark Newman, who’s placed a network of “frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand.” Business idea: Flippr. It’s like Facebook, but for dolphins, with plans to expand into emerging whale and sea turtle markets. Most revenue will come from sardine sales.

  • Do left-leaning blogs more often link to other left-leaning blogs than right-leaning ones? Well, I don’t know, but it sounds reasonable. And, thanks to permission from Lada Adamic, you can download her network of hyperlinks between weblogs on US politics, recorded in 2005. (Or you could just read her paper. Spoilers: conservatives more freely link to other conservatives than liberals link to liberals so, if you’re interested in link building, maybe you should register Republican.<a href=”#citation-1″><sup>1</sup></a>

  • Who’s friendlier: the average jazz musician or the average dolphin? You could find out by combining the dolphin data set mentioned earlier with Pablo M. Gleiser and Leon Danon’s jazz musicians network data set.

  • What about 1930s southern women or prisoners? Who’s friendlier? How about fraternity members or HAM radio operators? All this and more can be figured out with these network data sets.

  • How about dolphins or Slashdotters?

  • Web 2.0 websites (like Reddit) are sometimes gamed by “voting rings,” which are groups of people that intentionally vote up each other’s content, regardless of quality. I’ve often wondered if the same thing happens in academic circles. Like, you know, one night during your first year in grad school, you’re kidnapped in the middle of the night and made to swear a blood oath that you’ll cite every other member of the club. Or something. Well, Stanford has put online Arxiv’s High Energy Physics paper citation network, so you could find out.

  • You read this blog, so you’re pretty smart, right? And maybe you’d like to be rich, you know, so you can found the next Bill and Melinda Gates Foundation and save the world. (Because that’s why you want to be rich, right?) Well, then maybe you ought to develop some new-fangled trading algorithm and pick up like a trillion pennies from in front of the metaphorical steam-roller that is the market. (Quantitative finance!) But, in such a case, you’d better at least test your strategy on historical market data. Market data which you can get here.

  • The Open Product Data website aims to make barcode data available for every brand for free. Business idea: a specialty tattoo parlor that only does barcode tattoos, but lets customers pick whatever product they want. Think about it: “What’s your tattoo mean?” “It’s a Twinkie barcode, because Twinkies last forever, man, just like my faith.”

  • The European Center for Medium-Range Weather Forecasts has an impressive looking collection of weather data. Why, you ask, does the weather matter? The economic incentives for predicting the weather are absurd. When should you plant crops? Plan a big event? Launch a space shuttle? Go deep sea fishing? But I want to talk about the most fun application of weather data I’m aware of: The financial industry. I have a lot of respect for finance, mostly because of the crazy stuff they do. The only practical application of neutrinos I’ve heard of, for instance, is “because finance.” Should your algorithm buy Indonesian sesame seed futures? With weather data, it might know.

* If you need nutrition data about food, the USDA has you covered. Business idea: A phone application called, “Am I allergic to that?” Then, lobby for your state to pass some law regulating each school into buying a license of it for every student.

  • For a wordsmith, a good dictionary is indispensable, and when it comes to word data, you could do a lot worse than check out the freely available WordNet. WordNet has significant advantages over your run of the mill dictionary as it focuses on the structure of language, grouping words into “sets of cognitive synonyms (synsets), each expressing a distinct concept.” It also has some information about relationships, such as “a chair has legs.”

  • We’ve already established that some of you are evil geniuses, in which case, where are you going to build your secret lair? I mean, a volcano is pretty cool, but is it evil and genius enough for competing in today’s modern world? You know what the other evil geniuses don’t have? A secret base on a planet outside of the solar system. With NASA’s list, you can get busy commissioning someone to build you a base on KOI-3284.01.<a href=”#citation-2″><sup>2</sup></a>

  • The Federal Railroad administration keeps a list of “railroad safety information including accidents and incidents, inventory and highway-rail crossing data.” Someone (like the NY Times) could overlay this on a map of the United States and figure out if people in poor regions are more likely to be hit by trains or something.

  • If you need a database of comprehensive book data, perhaps to build a competitor to Goodreads or an online digital library, the Open Library allows people to freely download their entire database.

  • Who is the United States killing with drones? If you’re content with Pakistan specific data, there is a list of drone strikes available here.

  • If you’re interested in building a Papers2 competitor with support for automatically importing citation data (please do this), CrossRef metadata search might be a good place to check out.

  • Mnemosyne is a virtual flash card program that takes advantage of spaced repetition to maximize learning. (As you might recall, I’m a big fan of spaced repetition.) The project has been collecting user data for years, and gwern has graciously agreed to freely host the data for a few months. Perhaps one could run some sort of unsupervised learning algorithm over it and try to discover heretofore unknown information about human memory.

  • How much would it cost to hire Justin Bieber to play at your wedding? The fine lads at Priceconomics have figured out how much it would cost to hire your favorite band. You could take this data and calculate some sort of popularity to price ratio — What’s the most fame for your buck?

  • I’ve mentioned in a few of the other data sets just how lucrative it is to be able to better predict the stock market than everyone else. In 2011, researchers found that they could use data from twitter to do just that: they went through tweets, found one’s related to publicly traded companies, and then calculated a mood score. With this they write, ” We find an accuracy of 86.7% in predicting the daily up and down changes in the closing values of the DJIA.” A number of Twitter data sets are freely available here.

  • A 2014 paper by Clifford Winston and Fred Mannering reports that vehicle traffic costs the United States 100 billion dollars each year.<a href=”#citation-3″><sup>3</sup></a> There’s money to be made, then, in routing traffic more efficiently. One way to do this would be to feed an algorithm historical traffic data and then use that to predict hotspots, which you would route people around. Lots of that data is available on

  • On the other hand, if you were building an app to track current traffic data, you’ll need a different data source.

  • If you want to launch a spam-fighting service, or maybe just analyze what type of emails spammers are sending, you’ll need data. UC Irvine has you covered.

  • But maybe you want to extend your spam-fighting service to text messages. Still got you covered.

  • There is a wealth of data sets available for R and all you have to do is install a package. Ecdat is one of those packages, containing gobs of econometric data. How about an analysis of how math levels correlate with number of cigarettes smoked? I’d read that.

* Ever wondered about how one person will be on the board of directors of several companies and it’s like, hey, maybe Condoleezza Rice with her ties to government surveillance isn’t the best choice for Dropbox? What if you could analyze those connections? Well, with this data set, you can. But only for Norway — it’s a network of the board members of public companies in Norway.

  • Ever seen a TV show where a government determines that someone is a terrorist based on their social ties? I always figured that data would be locked down tight somewhere, y’know, classified. But it turns out it isn’t. You, too, can analyze the social networks of terrorists.

  • There’s been a fair bit of controversy around all the bureaucracy of Wikipedia. But how does one become a bona fide Wikipedia big shot? Who’s the ideal Wikipedia administrator? Well, they’re voted for, and the data is available for download.

  • Harvard has opened up its set of “over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.”

  • If you need small data sets for students, check out DASL. One at random: does sterilizing dominant males in a wild mustang population reduce the population?

  • GET-Evidence has put up public genomes for download. I think Steven Pinker’s data is in there someone. Maybe you could make yourself a clone?

  • Oh, and speaking of genomes, the 1000 Genomes project has made ~260 terabytes of genome data downloadable.

  • In what is the smallest data set on this list, the survival rates of men and women on the Titanic. Female passengers were ~4x times more likely to survive than male passengers.

  • Want an super specific breakdown of the contents of your food? You’re in luck. (Thanks Canada!)

  • There’s a similar database of the metabolites in the human body. I’m not sure what you could do with it, but it might come in handy in some sort of dystopian future where humans are raised like cattle for their nutrients. (Maybe someone could use this to build a viral marketing campaign along the lines of, “How nutritious is your mom?”)

calories-in-a-human* The Reference Energy Disaggregation Data Set has about 500 GB of compressed data on home energy use. Obvious use candidates: improving home efficiency or creating a visualization of just where people’s energy bills are going.

* Did you know that you can download all the PDFs on Arxiv? Once we manage to teach machines natural language, we can just have a computer read it all and give us the cliff notes (and the scientific breakthroughs).

  • If you need economic census data on any industry, check out’s industry statistics portal. If finance is really evil, you ought to be able to find something damning in the data.

  • For those unfamiliar with Usenet, it’s sort of like a huge, text-only forum. It was much more popular before the rise of the world wide web. Anyways, you can download a huge data set of postings to Usenet here. It might be pretty good for some kind of textual analysis project or training a machine learning algorithm (maybe a spellchecker?) You could use the data to build out a Google Groups competitor, too.

  • Nick Bostrom has a very interesting paper called “Existential Risk Prevention as Global Priority.” The basic intuition is that preventing even small risks of human extinction is worthwhile if we consider all the human generations it would save. One way to start saving all those future lives might be by digging into this data set of every recorded meteor impact on Earth from 2500 BCE to 2012.

  • How do gender and mental illness affect crime? This data set was collected explicitly with that question in mind.

  • Speaking of mental health, if you’re interested in how it affects minorities specifically, try this.

  • There are a lot of lonely men and women out there, and some of those lonely men and women have excellent analytical skills. For those lonely people, I suggest using this data set, which “surveyed how Americans met their spouses and romantic partners, and compared traditional to non-traditional couples” to determine the best way to meet that special someone.


  • Tons of data on what is called “adolescent health” available here, but is actually more, including a bunch of relationship data and biomarkers. (Not creatine levels, unfortunately.)

  • Here’s a question: Are modern jobs worse than those of the past? My grandparents built tires at Firestone. Today, people rarely have that level of control and visceral experience of the finished product of their work. This set of five surveys regarding how different groups experience employment could answer that question. I can see the article now — “Is everything getting slightly worse? We found out.”

  • Stanford has 35 million Amazon reviews available for download. Lot’s of stuff you could do with this: use it to improve recommendation algorithms, figure out whether or not there’s a follow-the-leader effect with reviews (i.e. Do early positive reviews beget more positive reviews?)

  • Based on some of my research prior to writing this, the google keyword “data sets on serial killers” is 1) really specific and 2) weirdly popular, but I guess there’s no accounting for taste. And, of course, we’ve got data for that, thanks to the Serial Killer Information Center.<a href=”#citation-4″><sup>4</sup></a>

  • In this gruesome vein, the University of Maryland has a “Global Terrorism Database,” which is a set of more than 113,000 terror incidents. You can download it after filling out a form. Ideas for use: visualization of terror incidents by location over time, predicting and preventing terror attacks, and creating early alert systems for vulnerable areas.

  • The MNIST Database is a classic in the field of machine learning. It’s a set of labeled hand-written characters, which are necessary for OCR algorithms. Today, some algorithms are actually more accurate than human judges! This would have been nice to have back when I was in grade school. I distinctly recall once arguing with a teacher over missing a question because she insisted that I had written the letter j when it was clearly a d. In the future, we’ll let the machines decide.

  • UCI has a poker hand data set available. My poker-fu is fairly weak, but I’m sure there’s some interesting analysis to be done there. I’ve heard second hand that humans still maintain some advantage over machines when it comes to poker, but I’m unable to verify that via Google. Machines have won in at least one tournament.

  • Another data set from UCI: images labeled as either advertisements or non-advertisements. This is good for building up classification algorithms that decide whether or not a new image is an ad or not, which might be good for, say, automatic ad blocking or spam detection. Or maybe a Google Glass application that filters out real life advertisements. That’d be cool. Look at a billboard and instead see a virtual extension of the natural landscape.

  • Remember the whole Star Wars Kid debacle? Wikipedia informs me that Attack of the Show rated it the number 1 viral video of all time. Andy Baio, one of the guys who was in on it before it was cool and coined the phrase “Star Wars Kid” has made his server logs from the time publicly available. Someone could take this data and produce a visualization of who saw it when via maps, along with annotations of where the traffic was coming from.

  • Who’s linking to who (and what) on WordPress? (Tidbit: most of the links to this site come from WordPress blogs.) With this WordPress crawl, you can find out. Visualizing the network might be sorta cool, but it’d be cooler still to uncover some information about “supernodes” that either are linked to often or put out a lot of links (or maybe both). Or maybe clustering people by interest.

  • Is Obama in bed with big oil? Or extremist environmentalists? Or the corn lobbies? And who was backing that Herman Cain dude, anyways? The 2012 Presidential Campaign Finance data is available for download. It would be neat to see an analysis of what industries prefer what candidates.

  • Which private colleges are the best value?

  • Which public colleges are the best value?

  • Cigarette data by state. Kentucky smokes the most, with West Virginia as a close second. Given the massive social harm of tobacco, a good analysis could very well save a lot of lives.

superhero* On December 5th of 2008, what was being downloaded from The Pirate Bay?

Further Reading


    1. With apologies to JFK: “Let us seek not the Democrat link or the Republican link, but the right link.”

    1. Wikipedia says: “KOI-3284.01 is believed to be the most Earth like exoplanet to be found so far by the Kepler space probe. It is predicted to have a radius 1.5 times that of Earth’s. It is predicted to be located at the proper distance from the sun to sustain liquid water.”

    1. “The Texas Transportation Institute’s latest Urban Mobility Report puts the annual cost of congestion to the nation, including both travel delays and expenditures on fuel, at more than $100 billion.”

  1. If that’s not enough, there seems to be a fair amount of research around “murder topology” which is not, as you might naively expect, a super badass branch of mathematics, but rather concerned with the movement patterns of serial killers.

Surprisingly Dangerous Jobs In America

You can’t avoid danger.
—Jeannette Walls, Half Broke Horses

Yeah, you can. Don’t get one of these jobs, for instance.

David Henderson has rightly earned the title contrarian with his latest post which, to kick off National Police Week, points out that it’s more dangerous to be a farmer than a policeman — “For every 100,000 police, the annual fatality rate is 20. For every 100,000 farmers, it is 40% higher, at 28.” (Source.)

Now, on this blog, we’re good empiricists, and nothing warms the heart of an empiricist more than refuting a well-known, common sense “truth” with, you know, observations and data.

So that got me thinking: What jobs are more (or less) dangerous than one might naively suspect?

I present to you this delightful graph, taken from the Bureau of Labor Statistics:

jobs-by-dangerNotice that the data here for farmers agrees with David’s. He has 28 per 100,000 versus the charts 25.3 per 100,000. (And given the endemic underreporting, the 28 number might well be more accurate.) Police officer isn’t included on the chart but David’s data would make it about as dangerous as… taxi driver. That’s right, folks. The brave folks keeping the peace of our nation? Just as brave as your local cabby. (Actually, given that police deaths dropped 20 percent in 2012, cab drivers might be braver.)

I’m going to propose we replace Police Week with Fisherman Week, because it’s about 6 times more dangerous to be a fisherman than a police officer. (And who doesn’t love a good tuna steak?)

Or maybe we should keep Police Week, but dedicate 6 weeks to celebrating fishermen. It’s only fair. And, of course, three weeks to pilots and people involved with flying, along with a solid two weeks for garbage men.

Some other fun facts

Digging a bit further into the data, we find this somewhat troubling statistic: 92% of workplace fatalities are men. (Do we blame the patriarchy for this one?)

And if you were wondering what state is the most dangerous: North Dakota. It’s about as dangerous to work in North Dakota as it is to be a police officer. From the AFL-CIO’s “Death on the Job” report:

Among all of the states, North Dakota stands out as an exceptionally dangerous and deadly place to work. The state’s job fatality rate of 17.7/100,000 workers is alarming. It is more than five times the national average and is one of the highest state job fatality rates ever reported for any state.

So probably we should have a week celebrating North Dakatons, too.

Further Reading

Web Roundup: More Links For May

Curiosity is only vanity. Most frequently we wish to know but to talk.
—Blaise Pascal, Pensées

The Unreasonable Effectiveness of Checklists

checklist-salesDr. Peter Provonost had a problem. People were dying and — to borrow a line from Fight Club — not in the Sylvia Plath, Tibetan Buddhist, we’re-all-dying-so-get-used-to-it sense of the word.

No, this is hospital kind of death we’re talking. I mean death in all of its macabre horror. You know, the horror we cover up with euphemisms like “passing away” and pretend that white sheets and a sterile environment somehow make the notion of oblivion no longer panic-inducing. That kind of death.

And not the inevitable sort. Not of the “his body just gave out” or “there was nothing we could do” kind of death, although I’m sure plenty of attendings fell back on that convenient cliche. No, I mean preventable death. Death of the there-but-for-the-grace-of-unwashed-hands-now-I’m-dead kind. I mean the kind where you’re in the hospital for a routine procedure and some dumbass with 15 years of schooling forgets to wear gloves so now you’re profoundly, absolutely dead. That kind. The sort of death where if the average doctor had one more percentile of conscientious you wouldn’t be dead because he wouldn’t have killed you.

The sort of deaths that define why hospitals are a dangerous place.

That sort of death was Dr. Provonost’s problem. Mistakes were killing people at his hospital. Not some podunk care center, either, but critical care at Johns Hopkins.

So he did the obvious, boring thing. He implemented a checklist for one basic-but-still-error-prone-and-infectious procedure, inserting a central venous catheter, and everyone had to follow it. And this checklist of his wasn’t complicated. These weren’t instructions where, in order to understand them, you need to rack up the equivalent of the GDP of a small nation state in medical school debt. There were five whole things, and they boil down to two: clean yourself and the patient, wear a mask and gloves. Not super tricky, only-clever-people-know steps.

These were the hospital equivalent of brushing your teeth before bed and wearing deodorant. The absolute basics. Stuff everyone is supposed to do, but sometimes people forget. Except when you forget to wear deodorant at a hospital, it’s a lot worse than spending a day fretting over whether or not your crush has discovered that your natural smell is not coffee-cinnamon-woodland, but something decidedly funky. When you forget at a hospital, someone catches Legionnaires’ disease and dies forever.

And maybe you’re skeptical: “A checklist for five things? I can remember five things no problem. How many mistakes could doctors possibly be making?” (And that’s how I know you’re not a programmer.) But you wouldn’t be alone. Dr. Gawande, a surgeon at Brigham and Women’s Hospital in Boston told the New York Times, “It seemed silly to make a checklist for something so obvious.”

Except, you know, this stupid checklist of five whole things totally worked. After a year, the rate of infection on this specific procedure dropped from 11 percent to zero. By two years, it had saved the hospital 2 million dollars, prevented 8 unnecessary deaths, and avoided 43 infections. Consequently, the hospital implemented still more checklists – reducing the average ICU stay by half and saving 21 lives.

A 2009 study duplicated this success in 8 other hospitals: “its use improved compliance with standards of care by 65% and reduced the death rate following surgery by nearly 50%.”

Checklists are awesome.

How awesome are they? A brief review

The bulk of the evidence for the effectiveness of checklists comes from medicine and is relatively recent. While other disciplines, such as aviation and engineering, have long used checklists, they haven’t bothered to actually vet that they work. A 2002 study puts it this way:

Aviation safety … was not built on evidence that certain practices reduced the frequency of crashes (but) relied on the widespread implementation of hundreds of small changes in procedures, equipment and organization (to produce) an incredibly strong safety culture and amazingly effective practices. These changes made sense; were usually based on sound principles, technical theory or experience; and addressed real-life problems, but few were subjected to controlled experiments

This is less surprising when we consider when the pre-flight checklist was implemented. They’ve been a constant in the airline industry since 1937.

Even newer and emerging disciplines, like software engineering and quality assurance, have done little to empirically verify the effectiveness of checklists. A 2007 paper, “Best Practices in Code Inspection for Safety-Critical Software,” is a typical example. Though it focuses solely on the use of checklists to improve software quality, it presents no evidence on the actual effectiveness of checklists. Similarly, a 1999 review further calls using checklists to inspect software a “best practice,” but again assumes their efficacy.

Reassuringly, though, the evidence from medicine is near overwhelming. Checklists have been found effective in scenarios as diverse as oxytocin administration to pregnant mothers (which decreased the rate of cesarean delivery by a quarter), actually giving patients medication, measuring lipid levels, screening stroke patients, and improving the report quality of RCT trials.

This and more is captured in a 2012 review and meta-analysis, which finds that checklists reduce the risk of mortality by nearly half:

This review shows that with the use of the checklist the relative risk for mortality is 0.57 and for any complications 0.63.

It should come as no surprise, then, that checklists are cost-effective, with the ability to save hospitals anywhere between $103,829 and $2,671,253.

If you’re still not convinced, not only do checklists save lives and money, but they may also improve process efficiency and productivity:

Use of a “preflight checklist” in Kaiser Permanente Southern California’s operating rooms resulted in improved nurse retention as turnover decreased from 23% to 7%. Also, after implementation of Kaiser Permanente’s checklist, there was a decrease in the number of operative cases that were canceled or delayed.

So, if the question is “How awesome are checklists?” I’d say: pretty awesome.

Further Reading

Web Roundup: Links for May

Why Category Theory Matters

I hope most mathematicians continue to fear and despise category theory, so I can continue to maintain a certain advantage over them.
—John Baez


The above is a graph of the number of times the phrase “category theory” has been used in books, from about 1950 through the present. It speaks for itself.

But why? What’s the big deal? Why does category theory matter? I’m about a quarter of the way through Conceptual Mathematics: A First Introduction to Categories and still not sure why I’m bothering with fleshing out all this theory. Is this just set theory for hipsters?

What category theory is about

Category theory is, essentially, the study of mathematical structure. It’s the study of things and the mappings between those things, the translations of these objects. These are usually called objects and morphisms (or arrows, if you prefer). Objects can be thought of as sets and arrows as functions, though they are not limited to this interpretation.


The subject’s major insight is, in order to understand something, focus on the structure preserving mappings of that something — the legal translations.

What the excitement is about

The vast applicability and expressiveness of category theory leads to the observation that most structures in mathematics are best understood from a category theoretic or higher category theoretic viewpoint.

Category theory is one of, if not the most, abstract fields of mathematics. It’s even been dubbed, as one might tease a younger sibling, “abstract nonsense.” After all, the field throws out all the specific properties of objects and instead focuses solely on their translations.

This extreme generality of category theory means that it can say something about anything, but nothing too specific. In other words, part of the growth of category is probably because you can use it to talk about damn near anything. (See the applications below for examples.)

In this respect, category theory is like set theory. The popularity of set theory is a result of the fact that, hey, it’s a pretty good language for talking about a lot of different types of mathematics. Most things can be formalized as a combination of sets and first order logic, and it’s not that unnatural to think in terms of sets so, bam, popularity.

In Categories for Software Engineering, the authors put it this way: “The way we like to present category theory is as a toolbox similar to set theory: as a kind of mathematical lingua franca in the sense that it can be used for formalizing concepts that arise in our day-to-day activity.”

This generality mirrors the difference between strong and weak methods in artificial intelligence. General methods, while widely applicable, don’t typically scale up to hard problems. Instead, specialized tools are necessary. In the same way, category theory is more of a tool for elucidating connections between mathematical structures than for solving problems — in contrast with something like linear algebra, or really any field of applied math.

Benefits of category theory over set theory

God made the integers, all the rest is the work of man.
—Leopold Kronecker

To be honest, I don’t like set theory. It’s artificial — the axioms aren’t obviously true, but rather the product of a search for a paradox-free foundation for mathematics. It’s sort of complicated, maybe not at the lowest levels, but definitely once you try to build up something like the real numbers. (A 2008 issue of the AMS reports, “…to expand the definition of the number 1 fully in terms of Bourbaki primitives requires over 4 trillion symbols.”)

The whole enterprise is bizarre. As humans, we didn’t start out with sets and then build out mathematics. No, the Egyptians did arithmetic and some algebra. (The oldest extant mathematical records deal with the Pythagorean theorem.) Animals have some notion of magnitude and many can even count. Set theory, rather than a natural extension of mathematical enterprise, seems more like something forced — the difference between English and Esperanto.

As far as I can tell, the mathematical community agrees with me. Paul Cohen is the only person to ever win a Fields medal for work on foundations and today “foundations of mathematics” is code not for mathematics, but philosophy.

So, immediately, category theory has an advantage over set theory in that it’s a less artificial construction, given that it stems directly from work in algebraic topology.

But, beyond that, is their anything else exciting about category theory? The main draw is its ability to connect otherwise disparate fields, a sort of skeleton to hang other knowledge on. Mike Stay and John Baez write about this in their “Physics, topology, logic and computation: a Rosetta Stone”, where they use category theoretic constructs to speak about the similarities between — you guessed it — physics, topology, logic, and computation.

Jocelyn Ireson-Paine puts it this way, “category theory is a great source of unifying concepts and organizing principles.” This is the benefit of all the abstraction — by throwing away all the details, an object’s structure reveals itself.

As a concrete example, consider one of the most profound mathematical achievements, Descartes’s discovery of analytic geometry — the realization that geometry can be translated into cartesian coordinates and, thus, the power of algebra can be brought to bear on the subject.

With category theory, this discovery can be expressed in what has to be one of the most satisfying formulas of all time:

\( P \xrightarrow{\quad f \quad} \mathbb{R}^{2} \)

Applications of category theory

The above is nice and all, but it’s still just sort of hey-take-my-word-for-it, which is not so satisfying. Here are some actual examples:

I will leave you with the following:

[Category theory] does not itself solve hard problems in topology or algebra. It clears away tangled multitudes of individually trivial problems. It puts the hard problems in clear relief and makes their solution possible.
“The Last Mathematician from Hilbert’s Gottingen”

Further Reading

What I’m Watching

From most recommended to least (roughly):

Hard Books Are Overrated

Hot air balloons take people on adventures. Books do, too.

Hot air balloons take people on adventures. Books do, too.

Widely used calculus books must be mediocre.
— W. Rudin

I’ve noticed a phenomenon, especially in mathematics, where anyone asking for book recommendations invariably is recommended the least-gentle-but-still-reasonable textbook imaginable. A high school student might ask for an introduction to calculus and someone will tell them to read Principles of Mathematical Analysis. Looking for an introduction to programming? Hey, try Structure and Interpretation of Computer Programs — or just read Knuth!

Several possible explanations spring to mind:

  • By recommending a difficult book to someone, you’re signalling your own intelligence.
  • Struggling against a text results in a sort of Stockholm syndrome, where one becomes more fond of it as they sink more hours into it.
  • Recommending an easy book can be socially costly as the receiver might interpret it as an insult. It’s safer to recommend a hard book.
  • When asked for a book recommendation, people mentally grasp the book most representative of that subject, and this leads to a bias towards not-beginner-friendly texts.
  • Hard texts are that much better than anything else.
  • Familiarity with a subject damages one’s ability to judge whether or not a text is appropriate for someone without the same background.
  • Those recommending hard textbooks have not, generally, tried to self-study them, but rather sat through a class that required the text and completed 10% of the problems as homework.

All of these could be true, but what we really want to know is: are hard books systematically overvalued?

Comparing the utility of books

Now, I have no doubt that reading one hard book is more valuable than reading one easy book, but that’s not a fair fight. I could get through The Design of Everyday Things in a couple of days, while working through Knuth’s Concrete Mathematics would take me the better part of a year. If you accept that a book like Concrete Mathematics can take 200 hours while an easy book might take 4, the question becomes: Is one hard book worth fifty easy ones?


We can consider taking this completely literally. Consider perhaps the most straightforward application of microeconomic theory ever: If two men are selling rugs, but one man is able to sell his rugs for ten times as much, well, customers are deriving more value from his rugs.

50 easy books will, most of the time, cost more than 1 hard book. We might expect, then, that 50 easy books will provide more value to the consumer. I actually have painstakingly collected statistics on my own reading, but there’s no correlation between my rating of a book and its price (alas).

books-price-rating-correlationA stroll through the New York Times’s best seller list also reveals no obvious relationship between how interesting a book looks and its price. A look at computer science and programming books is a bit more compelling. The obviously more valuable books (e.g. The Art of Computer Programming) are selling for more than the nth Adobe Photoshop manual.

So easy books win here, but it’s not clear if this win is worth anything.

Amount learned

Instead, we can take the tack of estimating the amount learned from a book. When going through The Design of Everyday Things, I learned maybe 10 things. Today, I started reading Petzold’s CODE and have taken about 50 separate notes. At this rate, I should end up with something like 300. A significant variance here, but let’s say the average easy text can teach someone 35 new things — more if you’re careful with book selection.

If I consider, on the other hand, a harder text, like Artificial Intelligence: A Modern Approach, the amount available to learn is about an order of magnitude greater. Depending on the number of exercises one wades through, anywhere between one and three thousand seems reasonable.

This one could really go either way. A broad selection of fifty popular science texts will teach a person more than one really hard book, but one hard book will teach you more than fifty so-so spiritual or diet books.

And, of course, all learning is not created equal. Some things really are more important than others.

Possible heuristic: popular science is great for building a broad understanding, while difficult works are great for pushing one to the next level via deliberate practice.

Revealed preference

I can count the number of people I know who routinely fight their way through hard books on one hand. It’s not a normal thing that people do, even curious people. Indeed, curious people are notable for being more willing than most to read a wide variety of texts, not for the difficulty of those texts.

University courses

University professors typically assign more difficult textbooks (although often not truly difficult, a sorta middle ground.) Still, consider the popularity of a book like CLRS versus The Algorithm Design Manual. The first is more popular, the second more readable.

I’m not sure who wins this round: on the one hand, university textbooks are harder to read than what’s on the New York Times bestsellers list. On the other hand, most university courses are not requiring books like Structure and Interpretation of Computer Programs, instead opting for something gentler.

Plus, these medium difficulty textbooks are often little more than props for a class, something to accompany lectures and provide exercises. Still, on the whole, it seems more honest to call this a win for difficult books.

Comprehension, amount retained

On difficult texts, one Less Wrong commenter wrote this:

I found that when a text requires a second or third reading, taking a lot of notes, etc., I won’t be able to master it at the level of the material that I know well, and it won’t be retained as reliably, for example I won’t be able to re-generate most of the key constructions and theorems without looking them up a couple of years later (this applies even if more advanced topics are practiced in the meantime, as they usually don’t involve systematic review of the basics). Thus, there is a triple penalty for working on challenging material: it takes more effort and time to process, the resulting understanding is less fluent, and it gets forgotten faster and to a greater extent.


If we live in a world where hard books were clearly superior to easier ones, I would expect that the reading habits of successful people would center around difficult books.

This isn’t the impression that I get, in general. CEOs fill their reading lists with biographies, not textbooks. If you looked at the reading habits of Bill Gates, you’ll see that it’s filled with popular non-fiction rather than hefty technical works.

Closing remarks

All of this suggests a heuristic: to decide what to read, ask yourself, “What’s the easiest book I could read that will teach me a lot?” Or, alternatively, “What’s the easiest book that will move me towards my goals?”

For reading broadly, popular works seem like a clear win. I think it’s best to save difficult works in a field until you’ve reached diminishing returns on easier texts. If you’re not learning much, then you ought to move onto something harder.

Ideally, the process of reading through easy texts on a subject before tackling harder ones transforms once difficult books into easy ones. This pattern of reading looks more like a slow progression than a violent struggle — small steps instead of leaps.

Further Reading

Web Roundup: More Links For April