Our task was to run a part-of-speech tagger on sentences extracted from blogs. We implemented our map/reduce in Java. The mapper class used the Stanford POS tagger to get parts-of-speech for each word in a sentence, with each sentence assigned a key that consisted of a unique blog post id with the sentence's relative position in the document as a suffix. The reducer just wrote the results of the tagging of each sentence to a file with the key.
Example Input - Raw sentences
2272096_0 For the most part, the traditional news outlets lead and the blogs follow, typically by 2.
2272096_1 5 hours, according to a new computer analysis of news articles and commentary on the Web during the last three months of the 2008 presidential campaign.
2272096_2 Skip to next paragraph Multimedia Graphic Picturing the News Cycle The finding was one of several in a study that Internet experts say is the first time the Web has been used to track — and try to measure — the news cycle, the process by which information becomes news, competes for attention and fades.
2272096_3 Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.
2272096_4 6 million mainstream media sites and blogs.
2272096_5 Some 90 million articles and blog posts, which appeared from August through October, were scrutinized with their phrase-finding software.
2272096_6 Frequently repeated short phrases, according to the researchers, are the equivalent of “genetic signatures??
2272096_7 for ideas, or memes, and story lines.
2272096_8 The biggest text-snippet surge in the study was generated by “lipstick on a pig.
Example Output - Tagged Sentences
2272096_0 For/IN the/DT most/JJS part,/VBP the/DT traditional/JJ news/NN outlets/NNS lead/VBP and/CC the/DT blogs/NNS follow,/VBP typically/RB by/IN 2./CD
2272096_1 5/CD hours,/NN according/VBG to/TO a/DT new/JJ computer/NN analysis/NN of/IN news/NN articles/NNS and/CC commentary/NN on/IN the/DT Web/NNP during/IN the/DT last/JJ three/CD months/NNS of/IN the/DT 2008/CD presidential/JJ campaign./NN
2272096_2 Skip/VB to/TO next/JJ paragraph/NN Multimedia/NNP Graphic/NNP Picturing/VBG the/DT News/NN Cycle/NN The/DT finding/NN was/VBD one/CD of/IN several/JJ in/IN a/DT study/NN that/IN Internet/NNP experts/NNS say/VBP is/VBZ the/DT first/JJ time/NN the/DT Web/NNP has/VBZ been/VBN used/VBN to/TO track/VB �/NN and/CC try/VB to/TO measure/VB �/SYM the/DT news/NN cycle,/VBD the/DT process/NN by/IN which/WDT information/NN becomes/VBZ news,/NN competes/VBZ for/IN attention/NN and/CC fades./NN
2272096_3 Researchers/NNS at/IN Cornell,/NNP using/VBG powerful/JJ computers/NNS and/CC clever/JJ algorithms,/NN studied/VBD the/DT news/NN cycle/NN by/IN looking/VBG for/IN repeated/VBN phrases/NNS and/CC tracking/VBG their/PRP$ appearances/NNS on/IN 1./CD
2272096_4 6/CD million/CD mainstream/NN media/NNS sites/NNS and/CC blogs./VB
2272096_5 Some/DT 90/CD million/CD articles/NNS and/CC blog/NN posts,/VBP which/WDT appeared/VBD from/IN August/NNP through/IN October,/NNP were/VBD scrutinized/VBN with/IN their/PRP$ phrase-finding/JJ software./NN
2272096_6 Frequently/RB repeated/VBN short/JJ phrases,/NN according/VBG to/TO the/DT researchers,/NN are/VBP the/DT equivalent/NN of/IN �genetic/JJ signatures??/NN
2272096_7 for/IN ideas,/NN or/CC memes,/NN and/CC story/NN lines./NN
2272096_8 The/DT biggest/JJS text-snippet/NN surge/NN in/IN the/DT study/NN was/VBD generated/VBN by/IN �lipstick/CD on/IN a/DT pig./NN
The example text was from the post "Study Measures the Chatter of the News Cycle" by Allen Jenkins.
The results here are based on 40K sentences. This is not a lot of data. The blog corpus these sentences were taken from, as of October 6, 2008, had 450K posts and 4.9M sentences. In truth, that's not even a lot of data. All that said, the results are shown below. Note that an EC2MR small instance is $0.015 an hour and a medium instance is $0.03 and hour. They round up, so you pay on an hour basis.
1 Small Instance - 97 minutes. Cost, $0.03
4 Small Instances - 39 minutes. Cost, $0.06
10 Small Instances - 16 minutes. Cost, $0.15
20 Small Instances - 9 minutes. Cost, $0.30
20 Medium Instances - 4 minutes. Cost $1.20.
Assuming this is linear - and I'm not sure if that's a totally safe assumption - 5M sentences on 20 medium instances will take 8 hours, 33 minutes and cost $150.00. Given how expensive time is, being able to get processing results in a day which would otherwise take multiple days or weeks is a real plus.
We will keep you up to date as to how this all works out. So far, the results are encouraging.
1 comment:
So, how is it working out? I hope you had a Merry Christmas!
Post a Comment