By Ben Blatt
In 1964, Mosteller and Wallace published Inference and Disputed Authorship: The Federalist. The paper used statistical analysis to try to determine if James Madison, Alexander Hamilton, or John Jay was the author of the unaccredited essays that were part of The Federalist Papers. They approached this historical mystery by using differences in word frequencies and Bayesian statistics. While controversial, similar methods have been used to investigate other authorship debates such as Shakespeare’s sonnets and plays. The same can be done for sports articles. While it would certainly be easier to look at the author’s name right underneath the title than to perform a statistical analysis of authors, I thought it would be fun anyways.
I started by picking three sportswriters of national prominence who don’t limit their writing to one single sport. I choose Bill Simmons, Rick Reilly, and Jason Whitlock. I downloaded all of their columns from November 1st 2009 to October 31st 2010 and ran a word frequency macro in Microsoft Word to determine the number of times each word appeared. The counts of each of the 22,633 unique words by themselves are not that interesting. The most common words are basic words such as “in”, “I”, or “was”. What is more interesting, and more useful for a Bayesian analysis, is to determine P(Author|Word ‘a’): the probability that if you saw only one word from an article that the article belonged to a particular author.
After using , a direct use of Bayes’ Theorem and the Law of Total Probability, I determined P(Author|Word ‘a’) for each word and each author. Here are the results in order of the words with the highest values of P for each author. These words are the best indicators of authorship. I only included those that were used by each author a minimum of three times, as infrequently used words are a rare case from which it would be difficult to draw conclusions.
Many of the top words are names or sports specific which could be a side effect of a single column on a particular player or team. It might be more interesting to look at the top used adjectives and adverb that are not directly tied to sports, to determine a unique writing style. Here are the results.
The use of statistical methods to determine authorship is controversial. To a reader, simply the frequency of each word might not seem like the defining writing feature of an author. The original paper by Mosteller on the Federalist Papers was controversial because it had no results that could be tested.
To test the validity of such methods, I used a Bayesian method to determine the author of six articles. I choose the first article written after October 31st 2010 and the last article written before November 1st 2009 by each of the three authors. These articles were not included in the original data set used to determine word frequencies. I used only 150 words (the 50 top distinctive words from each author).
The formula is below. It is based on treating each time word appears as an independent one-word article. The only values entered were the number of times each word appeared in the article and P(Author|Word ‘a’) which was calculated previously. The Author1, Author2, and Author3 all correspond to either Simmons, Reilly, or Whitlock and the probability of any author is assumed to be given no prior knowledge. P(Author|Word ‘a’) is noted as P1(a) below for simplicity and indicates the number of times word ‘a’ was in the article.
(That last formula comes out fuzzy, but if you click the image you can see it clearly.)
Here are the results showing the probability of authorship each article was assigned.
|Simmons November 5th 2010
|Simmons October 31st 2009
|Whitlock November 2nd, 2010
|Whitlock October 29th,2009
|Reilly November 3rd, 2010
|Reilly October 28nd, 2009
As you can see, the calculations determined the correct author in all six of the articles tested. Simmons’ most recent article had about a 1 in 50 tredecillion chance of being written by Reilly. These results demonstrated accuracy far greater than I had anticipated. This shows that not only that it is theoretically possible to determine the author using word frequency, but the methods outlined previously are a valid method. I was surprised to see the values so close to 0 or to 1. While I believe this method is accurate, I would doubt the probability values are as extreme as calculated. This most likely reflects the fact that each word was treated as an independent event and so many word frequencies entered the calculations.
Word frequency can identify authorship, but it is certainly not the only way to do so. For instance, with absolutely no statistical tests and only gut feeling that I am correct, I have determined with 100% probability that I am the author of this post.
Ben Blatt can be contacted at firstname.lastname@example.org.