Friday, November 13, 2009

Hadoopification - Phase 1

We are interested in processing large amounts of text and doing non-trivial things with it. For example, we want to do part-of-speech tagging, entity tagging, parsing and extensive featurization of text from large text corpora. These sorts of operations can be both memory and CPU intensive. We can't wait a week for a job running on a standalone machine to finish processing multi-gigs of text. We also don't have easy access to a capitalized cluster. The best option for us is to push the data up to a cloud, and to use the map/reduce paradigm for parallelizing the processing. The Apache Hadoop project provides the framework for doing this, and Amazon's Elastic Map Reduce makes this easy and cheap. We have some preliminary benchmarks based on a small set of sample text data that demonstrate the value of this approach.

Our task was to run a part-of-speech tagger on sentences extracted from blogs. We implemented our map/reduce in Java. The mapper class used the Stanford POS tagger to get parts-of-speech for each word in a sentence, with each sentence assigned a key that consisted of a unique blog post id with the sentence's relative position in the document as a suffix. The reducer just wrote the results of the tagging of each sentence to a file with the key.

Example Input - Raw sentences


2272096_0 For the most part, the traditional news outlets lead and the blogs follow, typically by 2.
2272096_1 5 hours, according to a new computer analysis of news articles and commentary on the Web during the last three months of the 2008 presidential campaign.
2272096_2 Skip to next paragraph Multimedia Graphic Picturing the News Cycle The finding was one of several in a study that Internet experts say is the first time the Web has been used to track — and try to measure — the news cycle, the process by which information becomes news, competes for attention and fades.
2272096_3 Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.
2272096_4 6 million mainstream media sites and blogs.
2272096_5 Some 90 million articles and blog posts, which appeared from August through October, were scrutinized with their phrase-finding software.
2272096_6 Frequently repeated short phrases, according to the researchers, are the equivalent of “genetic signatures??
2272096_7 for ideas, or memes, and story lines.
2272096_8 The biggest text-snippet surge in the study was generated by “lipstick on a pig.



Example Output - Tagged Sentences


2272096_0 For/IN the/DT most/JJS part,/VBP the/DT traditional/JJ news/NN outlets/NNS lead/VBP and/CC the/DT blogs/NNS follow,/VBP typically/RB by/IN 2./CD
2272096_1 5/CD hours,/NN according/VBG to/TO a/DT new/JJ computer/NN analysis/NN of/IN news/NN articles/NNS and/CC commentary/NN on/IN the/DT Web/NNP during/IN the/DT last/JJ three/CD months/NNS of/IN the/DT 2008/CD presidential/JJ campaign./NN
2272096_2 Skip/VB to/TO next/JJ paragraph/NN Multimedia/NNP Graphic/NNP Picturing/VBG the/DT News/NN Cycle/NN The/DT finding/NN was/VBD one/CD of/IN several/JJ in/IN a/DT study/NN that/IN Internet/NNP experts/NNS say/VBP is/VBZ the/DT first/JJ time/NN the/DT Web/NNP has/VBZ been/VBN used/VBN to/TO track/VB �/NN and/CC try/VB to/TO measure/VB �/SYM the/DT news/NN cycle,/VBD the/DT process/NN by/IN which/WDT information/NN becomes/VBZ news,/NN competes/VBZ for/IN attention/NN and/CC fades./NN
2272096_3 Researchers/NNS at/IN Cornell,/NNP using/VBG powerful/JJ computers/NNS and/CC clever/JJ algorithms,/NN studied/VBD the/DT news/NN cycle/NN by/IN looking/VBG for/IN repeated/VBN phrases/NNS and/CC tracking/VBG their/PRP$ appearances/NNS on/IN 1./CD
2272096_4 6/CD million/CD mainstream/NN media/NNS sites/NNS and/CC blogs./VB
2272096_5 Some/DT 90/CD million/CD articles/NNS and/CC blog/NN posts,/VBP which/WDT appeared/VBD from/IN August/NNP through/IN October,/NNP were/VBD scrutinized/VBN with/IN their/PRP$ phrase-finding/JJ software./NN
2272096_6 Frequently/RB repeated/VBN short/JJ phrases,/NN according/VBG to/TO the/DT researchers,/NN are/VBP the/DT equivalent/NN of/IN �genetic/JJ signatures??/NN
2272096_7 for/IN ideas,/NN or/CC memes,/NN and/CC story/NN lines./NN
2272096_8 The/DT biggest/JJS text-snippet/NN surge/NN in/IN the/DT study/NN was/VBD generated/VBN by/IN �lipstick/CD on/IN a/DT pig./NN



The example text was from the post "Study Measures the Chatter of the News Cycle" by Allen Jenkins.

The results here are based on 40K sentences. This is not a lot of data. The blog corpus these sentences were taken from, as of October 6, 2008, had 450K posts and 4.9M sentences. In truth, that's not even a lot of data. All that said, the results are shown below. Note that an EC2MR small instance is $0.015 an hour and a medium instance is $0.03 and hour. They round up, so you pay on an hour basis.

1 Small Instance - 97 minutes. Cost, $0.03
4 Small Instances - 39 minutes. Cost, $0.06
10 Small Instances - 16 minutes. Cost, $0.15
20 Small Instances - 9 minutes. Cost, $0.30
20 Medium Instances - 4 minutes. Cost $1.20.

Assuming this is linear - and I'm not sure if that's a totally safe assumption - 5M sentences on 20 medium instances will take 8 hours, 33 minutes and cost $150.00. Given how expensive time is, being able to get processing results in a day which would otherwise take multiple days or weeks is a real plus.

We will keep you up to date as to how this all works out. So far, the results are encouraging.
 
Creative Commons License
finegameofnil by Clay Fink is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.