Other Sports

Unnecessary Inference and Undisputed Authorship: Bill Simmons, Jason Whitlock, and Rick Reilly

November 10, 2010

5 min read

By Ben Blatt

In 1964, Mosteller and Wallace published Inference and Disputed Authorship: The Federalist. The paper used statistical analysis to try to determine if James Madison, Alexander Hamilton, or John Jay was the author of the unaccredited essays that were part of The Federalist Papers. They approached this historical mystery by using differences in word frequencies and Bayesian statistics. While controversial, similar methods have been used to investigate other authorship debates such as Shakespeare’s sonnets and plays. The same can be done for sports articles. While it would certainly be easier to look at the author’s name right underneath the title than to perform a statistical analysis of authors, I thought it would be fun anyways.

I started by picking three sportswriters of national prominence who don’t limit their writing to one single sport. I choose Bill Simmons, Rick Reilly, and Jason Whitlock. I downloaded all of their columns from November 1st 2009 to October 31st 2010 and ran a word frequency macro in Microsoft Word to determine the number of times each word appeared. The counts of each of the 22,633 unique words by themselves are not that interesting. The most common words are basic words such as “in”, “I”, or “was”. What is more interesting, and more useful for a Bayesian analysis, is to determine P(Author|Word ‘a’): the probability that if you saw only one word from an article that the article belonged to a particular author.

After using , a direct use of Bayes’ Theorem and the Law of Total Probability, I determined P(Author|Word ‘a’) for each word and each author. Here are the results in order of the words with the highest values of P for each author. These words are the best indicators of authorship. I only included those that were used by each author a minimum of three times, as infrequently used words are a rare case from which it would be difficult to draw conclusions.

Bill Simmons	Jason Whitlock	Rick Reilly
Boston	Favre	Anybody
Following	Elway	Says
Movie	Vs	Tour
Everyone	Quarterback	Beer
Happens	Journalist	He’d
Picks	Brett	Tattoo
Scene	Tiger’s	Mom
Suns	Moss	Somebody
Trade	Media	Buzz
Biggest	Offseason	PGA

Many of the top words are names or sports specific which could be a side effect of a single column on a particular player or team. It might be more interesting to look at the top used adjectives and adverb that are not directly tied to sports, to determine a unique writing style. Here are the results.

Simmons	Whitlock	Reilly
Biggest	Spoiled	Tiny
Excited	Several	Large
Eventually	Allegedly	Very
Almost	Particular	Nice
Low	Important	Dumbest

The use of statistical methods to determine authorship is controversial. To a reader, simply the frequency of each word might not seem like the defining writing feature of an author. The original paper by Mosteller on the Federalist Papers was controversial because it had no results that could be tested.

To test the validity of such methods, I used a Bayesian method to determine the author of six articles. I choose the first article written after October 31^st2010 and the last article written before November 1^st 2009 by each of the three authors. These articles were not included in the original data set used to determine word frequencies. I used only 150 words (the 50 top distinctive words from each author).

The formula is below. It is based on treating each time word appears as an independent one-word article. The only values entered were the number of times each word appeared in the article and P(Author|Word ‘a’) which was calculated previously. The Author_1, Author_2,and Author₃all correspond to either Simmons, Reilly, or Whitlock and the probability of any author is assumed to be given no prior knowledge. P(Author|Word ‘a’) is noted as P₁(a) below for simplicity and indicates the number of times word ‘a’ was in the article.

P(Author₁|Words_a,b,c,…)=

(That last formula comes out fuzzy, but if you click the image you can see it clearly.)

Here are the results showing the probability of authorship each article was assigned.

	P(Author=Simmons)	P(Author=Whitlock)	P(Author=Reilly)
Simmons November 5th 2010	~1	1.34×10^-25	4.70×10^-43
Simmons October 31st 2009	~1	5.03×10^-27	6.76×10^-26
Whitlock November 2nd, 2010	4.65×10^-37	~1	1.95×10^-23
Whitlock October 29th,2009	2.28×10^-28	~1	6.00×10^-27
Reilly November 3rd, 2010	2.92×10^-7	2.00×10^-11	~1
Reilly October 28nd, 2009	1.82×10^-9	9.00×10^-8	~1

As you can see, the calculations determined the correct author in all six of the articles tested. Simmons’ most recent article had about a 1 in 50 tredecillion chance of being written by Reilly. These results demonstrated accuracy far greater than I had anticipated. This shows that not only that it is theoretically possible to determine the author using word frequency, but the methods outlined previously are a valid method. I was surprised to see the values so close to 0 or to 1. While I believe this method is accurate, I would doubt the probability values are as extreme as calculated. This most likely reflects the fact that each word was treated as an independent event and so many word frequencies entered the calculations.

Word frequency can identify authorship, but it is certainly not the only way to do so. For instance, with absolutely no statistical tests and only gut feeling that I am correct, I have determined with 100% probability that I am the author of this post.

Ben Blatt can be contacted at bbblatt@gmail.com.

What Predicts ATP Tennis Rankings? Hint: It’s Not Break Points

2010: The Year in Injuries

About the author

harvardsports

View all posts

30 Comments

jezekowitz says:

November 10, 2010 at 10:49 am

Ben:
This is awesome. Especially the Reilly words. That’s all.

Reply
- David Roher says:
  
  November 10, 2010 at 10:51 am
  
  Were there any dental-related words for Reilly in the top 10 or 20? I bet there are some gems in the top hundred for each.
  
  Reply
- Alex Koenig says:
  
  November 10, 2010 at 11:11 am
  
  Agreed. This is awesome
  
  Reply
Jake says:

November 10, 2010 at 11:13 am

this post is great Ben – very entertaining

Reply
- stefan cheplick says:
  
  November 10, 2010 at 3:29 pm
  
  I second that
  
  Reply
Ben Blatt says:

November 10, 2010 at 1:35 pm

Thanks guys. As for dental related words David, the highest is ‘mouth’ by Reilly at his 85th most indicative. Other highlights in the top 100 include ‘myself’ as the 68th most indicative for Simmons and ‘idiot’ as the 93rd most indicative for Whitlock.

Reply
AK says:

November 10, 2010 at 1:43 pm

Amazing job, Ben. Thoroughly impressive and entertaining.

Reply
Pingback: 11/10 Daily Links « OKC ThunderDome
James says:

November 10, 2010 at 2:58 pm

This is excellent. My favorite part is to infer from the adjectives/adverbs chart what topic the write focuses on. It seems that Simmons is most concerned with the sport as a whole and Whitlock with the individual players, while Reilly seems to be working on a children’s book.

Reply
Tyler says:

November 10, 2010 at 2:59 pm

You guys don’t really mean that you think Rick Reilly is cool? He’s the absolute worst writer ever!

Reply
David Roher says:

November 10, 2010 at 6:10 pm

Also, while all of the incorrect author probabilities are quite low, Simmons and Whitlock have a much, much higher chance of having written Reilly’s articles than the other way around. Is this a sample-size issue, as Reilly’s columns were shorter?

Reply
- Ben Blatt says:
  
  November 10, 2010 at 9:09 pm
  
  That is definitely true David although it does not explain the whole picture. Reilly’s columns are the shortest, but within within the same range as Whitlock’s. However, a Simmons column is 5000-6000 words which is unusually long. While that difference explains the difference in values for Reilly and Simmons, it does not explain why Whitlock’s values are in the same order of magnitude as those for Simmons. My guess would be that Whitlock has a more distinct vocabulary compared to Simmons and Reilly.
  
  Reply
Pingback: The Layup Line — Should Coach K being the SI Sportsman of the Year? | College Hoops Journal
Pingback: Rick Reilly Really Likes the Words “Tiny,” “Large,” “Very,” and “Nice” - SportsNewser
Mike says:

November 11, 2010 at 6:01 pm

Crazy stuff

Reply
Matt says:

November 11, 2010 at 10:10 pm

Were the articles that you checked outside of the testing set? If not, that might have skewed your results.

Reply
- jezekowitz says:
  
  November 11, 2010 at 10:17 pm
  
  “These articles were not included in the original data set used to determine word frequencies. “
  
  Reply
John says:

November 12, 2010 at 12:36 pm

Love it! A very entertaining use of Bayesian analysis, much better than anything we did in b-school.

Reply
lingpipe says:

November 12, 2010 at 2:19 pm

Couple of points.

One, Mosteller and Wallace’s book represents one of the first truly Bayesian analyses done at any scale. It profusely thanks the calculators using index cards and slide rules. They used a negative-binomial (overdispersed Poisson) rather than the simple multinomial you’re using, which provides much better probability estimates.

What you’re using is often called the naive Bayes method. The naivete is in assuming the words are independent, and is what leads the probability estimates to be so skewed toward 0 or 1. But naive Bayes isn’t a Bayesian method in that it only uses Bayes rule on observable data, which is kosher with the frequentists, too.

Here’s a blog post on the Bayesian form of naive Bayes:

http://lingpipe-blog.com/2009/10/02/bayesian-naive-bayes-aka-dirichlet-multinomial-classifiers/

Here’s a general rundown on what it means to be Bayesian:

http://lingpipe-blog.com/2009/09/09/what-is-bayesian-statistical-inference/

which was a preface to my series on Bayesian stats which uses batting average estimation as the first substantive example:

http://lingpipe-blog.com/2009/09/11/batting-averages-bayesian-vs-mle-estimate/

http://lingpipe-blog.com/2009/09/15/moment-matching-empirical-bayes-beta-priors-batting-average/

with the final Bayesian analysis in:

http://lingpipe-blog.com/2009/09/23/bayesian-estimators-for-the-beta-binomial-model-of-batting-ability/

Reply
Sharon Normal says:

November 12, 2010 at 3:00 pm

“I thought it would be fun anyways?” Ben, that’s hardly the best verbal vantage point from which to critique the writing of others.

Reply
- David Roher says:
  
  November 12, 2010 at 3:38 pm
  
  As the editor of the post, I probably should have fixed that. However, given the jocular nature of his explanation, I decided to leave it (rather than replace it with the more formal “anywho”).
  
  Reply
Basil says:

November 12, 2010 at 10:57 pm

Very interesting analysis. I find myself using the same words in my writing all the time.

Reply
Nate Rice says:

November 13, 2010 at 1:07 am

Interesting that the odds were only in the billions/trillions that Simmons and Whitlock could’ve written like Reilly, but the odds that Reilly could write like either of them was in the billions of billions of trillions.

Reply
Dan says:

November 16, 2010 at 11:34 am

Are there many words that are used by only 1 of the 3 authors? If so, the estimated probability of an author using that word in the testing phase will be either 1 or 0.

In maximum likelihood estimation, it is common to place a floor on all probabilities, so a 0 probability event is treated as 10^-10.

Did you do this? Do you think words that are used by only one author in your data may be driving the probabilities that are so close to 0 or 1?

Reply
- Ben Blatt says:
  
  November 16, 2010 at 12:49 pm
  
  As I stated in the post before the first table of words, only words that were used by each author a minimum of three times were included in the data set. While some words were used much more than others, none of the individual p-values approached 10^-10. That being said, each author had a set of words which they used much more frequently than others which drove the final probabilities to extremes.
  
  Reply
Pingback: Rick Reilly Really Likes the Words "Tiny," "Large," "Very," and "Nice" - TVNewser
Pingback: Rick Reilly Really Likes the Words "Tiny," "Large," "Very," and "Nice" - TVNewser
Pingback: The Reading Level of Sports Writing | The Harvard College Sports Analysis Collective
statsinthewild says:

May 23, 2012 at 4:17 pm

Reblogged this on Stats in the Wild.

Reply
Pingback: The Reading Level of Sports Writing and Skip Bayless. | gallopingael.net