Analyzing Inagural Addresses

I went to the JFK Presidential Library the other day, and I spent quite a while listening to JFK's inagural address. Presidential inagural addresses represent an important part of history — hopes, fears, obstacles, and plans. They touch on larger themes in history, as well as specific plans about the next four years. So I decided to do a textual analysis of all the presidential inagural addresses through history.

To go directly to the visualization, click here.

Throughout that analysis, I used several Python libraries: Scrapy, NumPy, and MatPlotLib, as well as the D3 JavaScript library for this post.

To see my source code alongside this post, go here.

Scraping the Addresses from the Web

I couldn't find the inagural addresses in file form, so I used the Scrapy python library to scrape the addresses from the University of California Santa Barbara online archives.

Scrapy was easy to get via a pip install. And an easy call to scrapy startproject projectname sets up your file structure. I called my ucsb.

Scrapy can dump all of your data into various back ends using item pipelines, but I didn't have very much data, so I just wanted it to dump everything into a JSON object and have it printed to a file. I want it to store it in the following structure: {"year" : year, "text" : text}.

In order to get this output, we need to set up an item type to which Scrapy can map the data. So, in add the following code:

		class UcsbItem(Scrapy.Item):
			year = scrapy.Field()
			text = scrapy.Text()

Next, you have to create a spider to crawl the websites and dump the information you want into the item. So, in the spiders directory, you can add and add add the following code:

		from scrapy.contrib.spiders import CrawlSpider, Rule
		from scrapy.contrib.linkextractors import LinkExtractor

		class UcsbSpider(CrawlSpider):

		    name = 'UcsbSpider'
		    allowed_domains = ['']
		    start_urls = ['']
		    rules = [Rule(LinkExtractor(allow=['/ws/index\.php\?pid=\d+']), 'parse_ucsb')]

		    def parse_ucsb(self, response):
		        i = UcsbItem()
		        i['year'] = response.xpath("//span/[@class='docdate']/text()").extract()
		        i['text'] = " ".join( response.xpath("//span/[@class='displaytext']/i/text()").extract() + response.xpath("//span/[@class='displaytext']/p/text()").extract() )
		        yield i

What This Code Is Doing

First, we name the spider by setting the name variable inside our new UcsbSpider class. The name must be unique! We will be using it to run the spider from the command line.

Next, we name the allowed domains to which the spider can crawl. Next, we set the start URLs, where the spider will begin. Finally, we set the rules for following links and parsing the pages to which they lead. The rules variable is a list of type scrapy.contrib.spiders.Rule, which takes a LinkExtractor and a few other, optional, arguments. The link extractor is how you will determine which links to follow and which to ignore. It takes a few different arguments, such as allow, deny, and others. I only used allow, since that was all the functionality I needed, and I set the pattern of the URL that follows the domain name to those of each of the inagural address pages. All of them followed the pattern ''. Therefore, I set that as my regular expression, escaping the period and question mark using '\' and replacing 'NUMBER' with '\d+'.

Finally, I define my parsing method that should be executed on the links that fit the rule I defined. Do not name this method parse! It will override another method and your spider won't work! I know that the links that fit the rule will be parsed by parse_ucsb because I passed 'parse_ucsb' in as the second argument of the rule. In this parse_ucsb method, I define an item, which we outlined in Then, I set the fields of that item to extracted data.

To extract this data, we use xpath to find the sections of the HTML we want to extract. The argument of xpath is a way to navigate DOM documents, where '/parent/child'. You can use the double backslash ('//') to say that it will be an arbitrary branch. You can also specify attributes of branches by using the following syntax: [@attribute='value']. Finally, you can get to the text stored in the branch by using the /text() method, and then put all of the responses into a list using the .extract() method.

I was specifically looking to connect all of the paragraphs and italicized sections in . Therefore, I concatenated the list of italicized sections with the list of paragraphs in that section, and joined them on a space.

Lastly, we yeilded this item. We yield instead of returning because we could have to return more than one item from this method.

Indexing the Results

Once the data about the inaugural addresses is in the JSON file, we have to read it into memory and index the words used. I decided to use word density instead of the number of times the words were used because certain inaugural addresses are much longer than others.

We also want to exclude commonly used, arbitrary words, called stop words. So, I downloaded a CSV file of English stop words that I found on Wikipedia. I first read in this file into an array called stopwords.

Next, I went through the JSON file with the inaugural address data. The first thing I do for each address is I read the information I want into memory. So I parse the date to extract just the year, and I encode it with UTF-8. I also extract the text of the address (and encode it as well) and then I split the text on spaces to acquire an array of the words. I then take the length of that array and add that length to the total number of words used in all inagural addresses, and I also store the number of words used in that year in another dictionary.

But the words in the array have punctuation attached to them. The word "second" and the word "second." are mapped differently in our dictionary, which isn't what we want. So I used the translate function to take out the punctuation by writing the line raw_word.translate(None, string.punctuation). Then, I further normalize the word by setting it to lowercase and stripping any whitespace around the edges.

Finally, I go through the logic of adding that word to the dictionary with the following code:

		if word not in stopwords and word != '':
			if word in words:
				words[word]['num_times'] = words[word]['num_times'] + 1
				if year in words[word]['years']:
					words[word]['years'][year] = words[word]['years'][year] + 1
					words[word]['years'][year] = 1 
				map_to = {}
				map_to['num_times'] = 1
				map_to['years'] = {year : 1}
				words[word] = map_to

At this point, we have two dictionaries: one stores the word, the total number of times it has been used in all inaugural addresses, and the number of times it was used in each year that it was used; another stores the total number of words used in each year.

Finding Text Density and Ranking

The final step we need to take before analyzing our results is that we need to change our data from total number of times a word was used, to the word density, meaning the number of times a word was used divided by the total number of words used.

To do this, we are going to loop through all of the words in the dictionary storing the data about them and divide the total number of times the word was used by the total number of words used, and then go through the years in which the word was used and divide the number of times the word was used that year by the total number of words used that year.

Finally, we are ready to analyze this information. To be safe, I dumped my dictionary of the data about each word into a file as a JSON object. But then I also printed out the top n most frequently used words, where n is a variable set at the top of the file. I set it to ten.

We find these words by sorting the entire dictionary on the total word density, and take the first n items in that array. I then print that list using a for loop.

Displaying the Results

Along with printing those words, I also grab the word density and the years it's used in and dump it into two arrays that I can plot on a line graph using the MatPlotLib library. In order to plot a line graph, you just call plot(x, y) where x is an array of the x values and y is an array of the y values. Calling show() displays the graph.

For the purposes of this post, I also created an interactive graphic of the top ten most frequently used words using D3.js. I first created a bar graph of how many times each of the words is used per 100,000 words (including stop words). Next, I made each bar clickable, so if you click on it, a line graph of the density of that word appears.

Top Ten Most Frequently Used Words in U.S. Presidential Inaugural Addresses

Hint: Click on the bars!