By Ben Blatt
In 1964, Mosteller and Wallace published Inference and Disputed Authorship: The Federalist. The paper used statistical analysis to try to determine if James Madison, Alexander Hamilton, or John Jay was the author of the unaccredited essays that were part of The Federalist Papers. They approached this historical mystery by using differences in word frequencies and Bayesian statistics. While controversial, similar methods have been used to investigate other authorship debates such as Shakespeare’s sonnets and plays. The same can be done for sports articles. While it would certainly be easier to look at the author’s name right underneath the title than to perform a statistical analysis of authors, I thought it would be fun anyways.
I started by picking three sportswriters of national prominence who don’t limit their writing to one single sport. I choose Bill Simmons, Rick Reilly, and Jason Whitlock. I downloaded all of their columns from November 1st 2009 to October 31st 2010 and ran a word frequency macro in Microsoft Word to determine the number of times each word appeared. The counts of each of the 22,633 unique words by themselves are not that interesting. The most common words are basic words such as “in”, “I”, or “was”. What is more interesting, and more useful for a Bayesian analysis, is to determine P(Author|Word ‘a’): the probability that if you saw only one word from an article that the article belonged to a particular author.
After using , a direct use of Bayes’ Theorem and the Law of Total Probability, I determined P(Author|Word ‘a’) for each word and each author. Here are the results in order of the words with the highest values of P for each author. These words are the best indicators of authorship. I only included those that were used by each author a minimum of three times, as infrequently used words are a rare case from which it would be difficult to draw conclusions.
Bill Simmons | Jason Whitlock | Rick Reilly |
Boston | Favre | Anybody |
Following | Elway | Says |
Movie | Vs | Tour |
Everyone | Quarterback | Beer |
Happens | Journalist | He’d |
Picks | Brett | Tattoo |
Scene | Tiger’s | Mom |
Suns | Moss | Somebody |
Trade | Media | Buzz |
Biggest | Offseason | PGA |
Many of the top words are names or sports specific which could be a side effect of a single column on a particular player or team. It might be more interesting to look at the top used adjectives and adverb that are not directly tied to sports, to determine a unique writing style. Here are the results.
Simmons | Whitlock | Reilly |
Biggest | Spoiled | Tiny |
Excited | Several | Large |
Eventually | Allegedly | Very |
Almost | Particular | Nice |
Low | Important | Dumbest |
The use of statistical methods to determine authorship is controversial. To a reader, simply the frequency of each word might not seem like the defining writing feature of an author. The original paper by Mosteller on the Federalist Papers was controversial because it had no results that could be tested.
To test the validity of such methods, I used a Bayesian method to determine the author of six articles. I choose the first article written after October 31st 2010 and the last article written before November 1st 2009 by each of the three authors. These articles were not included in the original data set used to determine word frequencies. I used only 150 words (the 50 top distinctive words from each author).
The formula is below. It is based on treating each time word appears as an independent one-word article. The only values entered were the number of times each word appeared in the article and P(Author|Word ‘a’) which was calculated previously. The Author1, Author2, and Author3 all correspond to either Simmons, Reilly, or Whitlock and the probability of any author is assumed to be given no prior knowledge. P(Author|Word ‘a’) is noted as P1(a) below for simplicity and indicates the number of times word ‘a’ was in the article.
(That last formula comes out fuzzy, but if you click the image you can see it clearly.)
Here are the results showing the probability of authorship each article was assigned.
P(Author=Simmons) | P(Author=Whitlock) | P(Author=Reilly) | |
Simmons November 5th 2010 | ~1 | 1.34×10-25 | 4.70×10-43 |
Simmons October 31st 2009 | ~1 | 5.03×10-27 | 6.76×10-26 |
Whitlock November 2nd, 2010 | 4.65×10-37 | ~1 | 1.95×10-23 |
Whitlock October 29th,2009 | 2.28×10-28 | ~1 | 6.00×10-27 |
Reilly November 3rd, 2010 | 2.92×10-7 | 2.00×10-11 | ~1 |
Reilly October 28nd, 2009 | 1.82×10-9 | 9.00×10-8 | ~1 |
As you can see, the calculations determined the correct author in all six of the articles tested. Simmons’ most recent article had about a 1 in 50 tredecillion chance of being written by Reilly. These results demonstrated accuracy far greater than I had anticipated. This shows that not only that it is theoretically possible to determine the author using word frequency, but the methods outlined previously are a valid method. I was surprised to see the values so close to 0 or to 1. While I believe this method is accurate, I would doubt the probability values are as extreme as calculated. This most likely reflects the fact that each word was treated as an independent event and so many word frequencies entered the calculations.
Word frequency can identify authorship, but it is certainly not the only way to do so. For instance, with absolutely no statistical tests and only gut feeling that I am correct, I have determined with 100% probability that I am the author of this post.
Ben Blatt can be contacted at bbblatt@gmail.com.
Ben:
This is awesome. Especially the Reilly words. That’s all.
Were there any dental-related words for Reilly in the top 10 or 20? I bet there are some gems in the top hundred for each.
Agreed. This is awesome
this post is great Ben – very entertaining
I second that
Thanks guys. As for dental related words David, the highest is ‘mouth’ by Reilly at his 85th most indicative. Other highlights in the top 100 include ‘myself’ as the 68th most indicative for Simmons and ‘idiot’ as the 93rd most indicative for Whitlock.
Amazing job, Ben. Thoroughly impressive and entertaining.
This is excellent. My favorite part is to infer from the adjectives/adverbs chart what topic the write focuses on. It seems that Simmons is most concerned with the sport as a whole and Whitlock with the individual players, while Reilly seems to be working on a children’s book.
You guys don’t really mean that you think Rick Reilly is cool? He’s the absolute worst writer ever!
Also, while all of the incorrect author probabilities are quite low, Simmons and Whitlock have a much, much higher chance of having written Reilly’s articles than the other way around. Is this a sample-size issue, as Reilly’s columns were shorter?
That is definitely true David although it does not explain the whole picture. Reilly’s columns are the shortest, but within within the same range as Whitlock’s. However, a Simmons column is 5000-6000 words which is unusually long. While that difference explains the difference in values for Reilly and Simmons, it does not explain why Whitlock’s values are in the same order of magnitude as those for Simmons. My guess would be that Whitlock has a more distinct vocabulary compared to Simmons and Reilly.
Crazy stuff
Were the articles that you checked outside of the testing set? If not, that might have skewed your results.
“These articles were not included in the original data set used to determine word frequencies. “
Love it! A very entertaining use of Bayesian analysis, much better than anything we did in b-school.
Couple of points.
One, Mosteller and Wallace’s book represents one of the first truly Bayesian analyses done at any scale. It profusely thanks the calculators using index cards and slide rules. They used a negative-binomial (overdispersed Poisson) rather than the simple multinomial you’re using, which provides much better probability estimates.
What you’re using is often called the naive Bayes method. The naivete is in assuming the words are independent, and is what leads the probability estimates to be so skewed toward 0 or 1. But naive Bayes isn’t a Bayesian method in that it only uses Bayes rule on observable data, which is kosher with the frequentists, too.
Here’s a blog post on the Bayesian form of naive Bayes:
http://lingpipe-blog.com/2009/10/02/bayesian-naive-bayes-aka-dirichlet-multinomial-classifiers/
Here’s a general rundown on what it means to be Bayesian:
http://lingpipe-blog.com/2009/09/09/what-is-bayesian-statistical-inference/
which was a preface to my series on Bayesian stats which uses batting average estimation as the first substantive example:
http://lingpipe-blog.com/2009/09/11/batting-averages-bayesian-vs-mle-estimate/
http://lingpipe-blog.com/2009/09/15/moment-matching-empirical-bayes-beta-priors-batting-average/
with the final Bayesian analysis in:
http://lingpipe-blog.com/2009/09/23/bayesian-estimators-for-the-beta-binomial-model-of-batting-ability/
“I thought it would be fun anyways?” Ben, that’s hardly the best verbal vantage point from which to critique the writing of others.
As the editor of the post, I probably should have fixed that. However, given the jocular nature of his explanation, I decided to leave it (rather than replace it with the more formal “anywho”).
Very interesting analysis. I find myself using the same words in my writing all the time.
Interesting that the odds were only in the billions/trillions that Simmons and Whitlock could’ve written like Reilly, but the odds that Reilly could write like either of them was in the billions of billions of trillions.
Are there many words that are used by only 1 of the 3 authors? If so, the estimated probability of an author using that word in the testing phase will be either 1 or 0.
In maximum likelihood estimation, it is common to place a floor on all probabilities, so a 0 probability event is treated as 10^-10.
Did you do this? Do you think words that are used by only one author in your data may be driving the probabilities that are so close to 0 or 1?
As I stated in the post before the first table of words, only words that were used by each author a minimum of three times were included in the data set. While some words were used much more than others, none of the individual p-values approached 10^-10. That being said, each author had a set of words which they used much more frequently than others which drove the final probabilities to extremes.
Reblogged this on Stats in the Wild.