Friday, November 13, 2009

Hadoopification - Phase 1

We are interested in processing large amounts of text and doing non-trivial things with it. For example, we want to do part-of-speech tagging, entity tagging, parsing and extensive featurization of text from large text corpora. These sorts of operations can be both memory and CPU intensive. We can't wait a week for a job running on a standalone machine to finish processing multi-gigs of text. We also don't have easy access to a capitalized cluster. The best option for us is to push the data up to a cloud, and to use the map/reduce paradigm for parallelizing the processing. The Apache Hadoop project provides the framework for doing this, and Amazon's Elastic Map Reduce makes this easy and cheap. We have some preliminary benchmarks based on a small set of sample text data that demonstrate the value of this approach.

Our task was to run a part-of-speech tagger on sentences extracted from blogs. We implemented our map/reduce in Java. The mapper class used the Stanford POS tagger to get parts-of-speech for each word in a sentence, with each sentence assigned a key that consisted of a unique blog post id with the sentence's relative position in the document as a suffix. The reducer just wrote the results of the tagging of each sentence to a file with the key.

Example Input - Raw sentences


2272096_0 For the most part, the traditional news outlets lead and the blogs follow, typically by 2.
2272096_1 5 hours, according to a new computer analysis of news articles and commentary on the Web during the last three months of the 2008 presidential campaign.
2272096_2 Skip to next paragraph Multimedia Graphic Picturing the News Cycle The finding was one of several in a study that Internet experts say is the first time the Web has been used to track — and try to measure — the news cycle, the process by which information becomes news, competes for attention and fades.
2272096_3 Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.
2272096_4 6 million mainstream media sites and blogs.
2272096_5 Some 90 million articles and blog posts, which appeared from August through October, were scrutinized with their phrase-finding software.
2272096_6 Frequently repeated short phrases, according to the researchers, are the equivalent of “genetic signatures??
2272096_7 for ideas, or memes, and story lines.
2272096_8 The biggest text-snippet surge in the study was generated by “lipstick on a pig.



Example Output - Tagged Sentences


2272096_0 For/IN the/DT most/JJS part,/VBP the/DT traditional/JJ news/NN outlets/NNS lead/VBP and/CC the/DT blogs/NNS follow,/VBP typically/RB by/IN 2./CD
2272096_1 5/CD hours,/NN according/VBG to/TO a/DT new/JJ computer/NN analysis/NN of/IN news/NN articles/NNS and/CC commentary/NN on/IN the/DT Web/NNP during/IN the/DT last/JJ three/CD months/NNS of/IN the/DT 2008/CD presidential/JJ campaign./NN
2272096_2 Skip/VB to/TO next/JJ paragraph/NN Multimedia/NNP Graphic/NNP Picturing/VBG the/DT News/NN Cycle/NN The/DT finding/NN was/VBD one/CD of/IN several/JJ in/IN a/DT study/NN that/IN Internet/NNP experts/NNS say/VBP is/VBZ the/DT first/JJ time/NN the/DT Web/NNP has/VBZ been/VBN used/VBN to/TO track/VB �/NN and/CC try/VB to/TO measure/VB �/SYM the/DT news/NN cycle,/VBD the/DT process/NN by/IN which/WDT information/NN becomes/VBZ news,/NN competes/VBZ for/IN attention/NN and/CC fades./NN
2272096_3 Researchers/NNS at/IN Cornell,/NNP using/VBG powerful/JJ computers/NNS and/CC clever/JJ algorithms,/NN studied/VBD the/DT news/NN cycle/NN by/IN looking/VBG for/IN repeated/VBN phrases/NNS and/CC tracking/VBG their/PRP$ appearances/NNS on/IN 1./CD
2272096_4 6/CD million/CD mainstream/NN media/NNS sites/NNS and/CC blogs./VB
2272096_5 Some/DT 90/CD million/CD articles/NNS and/CC blog/NN posts,/VBP which/WDT appeared/VBD from/IN August/NNP through/IN October,/NNP were/VBD scrutinized/VBN with/IN their/PRP$ phrase-finding/JJ software./NN
2272096_6 Frequently/RB repeated/VBN short/JJ phrases,/NN according/VBG to/TO the/DT researchers,/NN are/VBP the/DT equivalent/NN of/IN �genetic/JJ signatures??/NN
2272096_7 for/IN ideas,/NN or/CC memes,/NN and/CC story/NN lines./NN
2272096_8 The/DT biggest/JJS text-snippet/NN surge/NN in/IN the/DT study/NN was/VBD generated/VBN by/IN �lipstick/CD on/IN a/DT pig./NN



The example text was from the post "Study Measures the Chatter of the News Cycle" by Allen Jenkins.

The results here are based on 40K sentences. This is not a lot of data. The blog corpus these sentences were taken from, as of October 6, 2008, had 450K posts and 4.9M sentences. In truth, that's not even a lot of data. All that said, the results are shown below. Note that an EC2MR small instance is $0.015 an hour and a medium instance is $0.03 and hour. They round up, so you pay on an hour basis.

1 Small Instance - 97 minutes. Cost, $0.03
4 Small Instances - 39 minutes. Cost, $0.06
10 Small Instances - 16 minutes. Cost, $0.15
20 Small Instances - 9 minutes. Cost, $0.30
20 Medium Instances - 4 minutes. Cost $1.20.

Assuming this is linear - and I'm not sure if that's a totally safe assumption - 5M sentences on 20 medium instances will take 8 hours, 33 minutes and cost $150.00. Given how expensive time is, being able to get processing results in a day which would otherwise take multiple days or weeks is a real plus.

We will keep you up to date as to how this all works out. So far, the results are encouraging.

Wednesday, October 14, 2009

"Meme-tracking"

Jure Leskovec, Lars Backstrom and Jon Kleinberg published the paper "Meme-tracking and the Dynamics of the News Cycle" earlier this year. Their work revolves about how memes flow through online news sources and social media. The "meme" is a notion suggested by biologist Richard Dawkins that describes a basic unit of thought - an idea - as it is transmitted through human culture and is modified in ways similar to how genes change over time in biological systems. Leskovec et al base their work on 90 million news stories and blog posts collected during the three months prior to the 2009 US presidential election. They extract memes by looking at short text phrases, and variations of those phrases, that appear with significant volume across news stores and blog posts. The most significant phrases from August through October, 2008 are shown in the in this visualization using the "ThemeRiver" technique:






Most mentioned phrases during the 2008 U.S. presidential campaign


The most significant meme found during this period was associated with then candidate Barak Obama, when he compared John McCain's policies to those of President George W. Bush, by saying that "But you know, you can -- you know, you can put lipstick on a pig; it's still a pig." They found some characteristic behavior for this and other memes. In the eight hours around the peak volume of a meme, the volume increases and decreases exponentially with time, with the decrease being somewhat slower. They also found that the peak volume in the blogosphere lags behind that of online media by about 2.5 hours. Another interesting phenomenon was how the media volume shows an additional peak after the blogs get a hold of a story, and then another peak in blog interest as the story bounces back into the blogosphere. They were also able to detect a small number of memes - 3.5% - that originated in the blogosphere and then spread to the online media.

There's much that's important about this paper. Their methodology is impressive and they show how you can work with this large volume of messy data and use scalable approaches - in this case, a relatively simple graph partitioning heuristic - to get interesting results. They also know the right questions to ask of the data. Based on what they saw in the ThemeRiver visualization, they developed a simple model of the news cycle, based on the volume and recency of news stories, that basically duplicates the phenomenon captures in the visualization.

Beyond these sorts of geeky accomplishments, the fact is that they demonstrated how you can extract open source, online data and do quantitative analysis on cultural phenomena such as the news cycle. Analysis at this scale was not possible before the advent of the Web. The tools we have today, such as the map/reduce methodology and cloud computing, along with being able to build years of work in NLP, graph theory and machine learning, make working with enormous amounts of text possible. What was previously the province of more qualitative disciplines such as Political Science, Sociology - not dissing you quantitative Soc guys - and Journalism, can now be integrated with more quantitative disciplines. They say it well in the paper:

Moving beyond qualitative analysis has proven difficult here, and
the intriguing assertions in the social-science work on this topic
form a significant part of the motivation for our current approach.
Specifically, the discussions in this area had largely left open the
question of whether the “news cycle” is primarily a metaphorical
construct that describes our perceptions of the news, or whether
it is something that one could actually observe and measure. We
show that by tracking essentially all news stories at the right level
of granularity, it is indeed possible to build structures that closely
match our intuitive picture of the news cycle, making it possible to
begin a more formal and quantitative study of its basic properties.


Check out the Meme-tracker site when you get a chance.

Tuesday, October 6, 2009

Hadoop World 2009

Hadoop is a Java framework for implementing the Map/Reduce programming model described in the 2004 paper by Jeffrey Dean and Sanjay Ghemawat of Google. Map/Reduce allows you to parallelize a large task by breaking it into small map functions that perform a transformation on a key/value pair and combine the results of the mappings using a reduce function. Using a cluster of nodes - either your own or those in a computing cloud like Amazon’s EC2 - you can do large scale parallel processing.


I attended the Hadoop World conference in NYC on October 1, 2009 and came back very excited about this technology. We have had an interest in exploring Map/Reduce to support our work analyzing blog text, but have not until now had the resources to devote to it. After getting a feel for what this technology makes possible, we are ready to dive in. A copy of Tom White’s “Hadoop:The Definitive Guide” was provided, so I am on my way.


The message I took away from thus conference is that with the vast amount of data available today, from genomic and other biological data to data in online social networks, we can use new computational tools to help us better understand people at the micro and macro levels. We need processing capabilities that scale to the pentabyte level to do this, however, and technologies like Hadoop, coupled with cloud computing, are one way to approach this problem. The notion that
“more data beats better algorithms” may or may not hold in all cases, but “big data” is here and we now have usable and relatively cheap ways to process it.


Hadoop grew out of the Lucene/Nutch community and became an official Apache project in 2006. Yahoo! soon afterward adopted it for their Web crawls. Since then a number of subprojects have started up under Hadoop, allowing for querying, analysis and data management.


Cloudera (“Cloud-era”, get it?) was the major organizer of Hadoop World, and Christophe Bisciglia, one of the principals of Cloudera, started off giving us a review of the history of Hadoop, and describing a number of the subprojects it has spawned. Cloudera’s business is based on providing Linux packages for deploying Hadoop on private servers, and maintaining Amazon EC2 instances for use in the cloud. Cloudera also has introduced a browser based GUI for managing Hadoop clusters: the Cloudera Desktop.


Peter Sirota, the manager of Amazon’s Elastic MapReduce, was up next and discussed
Pig, an Apache/Hadoop subproject that provides a high-level data analysis layer for Map/Reduce, and Karmasphere, their own NetBeans based GUI for managing EC2 instances and Map/Reduce jobs.


Eric Baldeschwieler of Yahoo! described their use of Hadoop. Yahoo!, he said, is the “engine
room” of Hadoop development and is the largest contributor to the open source project. They use Hadoop to process the data for the Yahoo! Web search, using a 10,000 core
Lunux cluster and with 5 Pentabytes of raw disk space. They also used Hadoop to win the Jim Gray Sort Benchmark competition, sorting one Terabyte in 62 seconds and one pentabyte
in 16 hours.


Other presenters included Ashish Thusoo of Facebook. The amount of new data added to Facebook on a daily basis is astounding. In March of 2008 it was a modest 200 gigabytes a day. In April of 2009 it was over two terabytes a day, and in October of ’09 it is over four terabytes per day. They have made Hadoop an integral part of their processing pipeline in order to deal with this rate of growth. One new Apache/Hadoop subproject that they have been using is Hive. Hive is a data warehouse infrastructure that allows for data analysis and querying of data in Hadoop files.


There were a number of talks in the afternoon portion of the conference. Two that interested me in particular were Jake Hofman’s talk on Yahoo! Research’s use of Hadoop for social network analysis, and Charles Ward’s talk on a joint effort by Stony Brook University and General Sentiment (a Stony Brook commercial spinoff) on analyzing entity references and sentiment in blogs and online media. One other from Paul Brown of Booz Allen discussed how they used Hadoop for calculating protein alignments. He also demonstrated a visualization of a Hadoop cluster in action that needs to be seen – I’ll see if I can find a link...


There is a rich set of tweets from the conference on Twitter. These give a detailed, minute-by-minute picture of what went on there. The one oddity about the conference – the yellow elephant in the room, if you will – was the absence of Google. There’s an interesting angle to that, I’m sure, but I have no idea what it might be.


This was a great conference and it looks like we are just at the start of a revolution in the way we deal with large volumes of data.

Saturday, March 28, 2009

"The Social Semantic Web – Where Web 2.0 Meets Web 3.0"

I attended the Association for The Advancement of Artificial Intelligence Spring Symposia, 2009, at Stanford University, from March 23 through March 25, 2009, and participated in the symposium “The Social Semantic Web – Where Web 2.0 Meets Web 3.0.” “Web 2.0” refers to applications and technologies that have emerged in the last few years on the Web that enable social networking, collaboration and user provided content. This includes sites such as Facebook and Twitter, as well as Web logs and wikis. “Web 3.0” is more or less synonymous with the notion of the Semantic Web, where structured metadata associated with Web content can be used for reasoning and inference. The idea of the Semantic Web goes back to a paper in Scientific American in 2001 by Tim Bernes-Lee, Jim Hendler and Ora Lassila. They described a world where agent based applications can use semantics-based metadata on the web to reason and infer and present choices for people as they go through their daily activity. Much of the technology for enabling this vision is based on the principles of logic programming paired with Web centric technology such as XML-based metadata.


The Symposium was organized by Li Ding and Jen Bao from Rensselaer Polytechnic Institute and Mark Greaves from Vulcan, Inc. Li Ding opened the discussion and described a situation where Semantic Web technologies may be poised to increase the range and effectiveness of Web 2.0 tools for information retrieval, social networking and collaboration. We spent the next two and a half days discussing examples of this technology and the issues their use introduce into how people interact with the Web.


A number of applications were described that bridge the gap between collaborative technology and semantics. Twine is a site that allows users to group links into what are called twines. A twine is a group of sites that are topically related. Tags are generated when a site is added to a twine and domain ontologies are used to link different twines together and recommend to a user other twines that may interest them. Radar Networks Inc. developed Twine and their CEO Nova Spivack gave the first presentation. Twine looks like a very useful application. It is somewhat similar to delic.io.us in concept, but with explicit semantics.


Denny Vrandecic from Insitut AIFB, Karlsruhe, Germany described ongoing work on Semantic MediaWiki. SMW is an extension of MediaWiki that allows for semantic annotation of wiki data. Vrandecic is one of the original developers of Semantic MediaWiki and spoke about adding automated quality checks to the application.

Semantic MediaWiki was the basis for a number of other applications discussed at the symposium. One was Metavid.org, an “open video archive of the US Congress.” Metavid.com captures video and closed captioning of Congressional proceedings. Semantic MediaWiki’s extensions allow for categorical searches of recorded speeches.


The Halo Project, funded by Paul Allen’s Vulcan Inc. and sponsored by Mark Greaves, has developed extensions to Semantic MediaWiki that go a long way toward showing the power of embedding semantics in applications. The work was done by Ontoprise and they have produced a video of its features that is worth viewing.


Some of the applications discussed provide collaborative, distributed development environments for authoring ontologies. Tania Tudorache of the Stanford Center for Biomedical Informatics Research described Collaborative Protégé. Collaborative Protege extends the Protege ontology development environment to support “collaborative ontology editing as well as annotation of both ontology components and ontology changes.” Natasha Noy, who is also one of the prime movers behind Protégé, presented BioPortal, a repository of biomedical ontologies that allows users to critique posted ontologies, collaborate on ontology development, and submit mappings between ontologies. The same codebase that is behind BioPortal also supports the OOR Open Ontology Repository which is a domain-independent repository of ontologies. Nova Spivack of Radar Networks also mentioned a new site that they plan on standing up called Tripleforge, which, like Sourceforge, will support open source development of ontologies.


In regard to architecting systems that use semantics to leverage Web 2.0 features, a number of approaches kept coming up. Ontologies for describing tagging behavior by users were mentioned by a few of the presenters. This is a way to capture the relationships between taggers (two users who tag the same site with the same or similar tags) and the temporal dimension of tagging (“who tagged what tag when?”). Another common thread was defining a semantic layer to describe the syntactic or functional layers of a system. Hans-George Fill of the University of Vienna described a model-based approach for developing “Semantic Information Systems” using model based tools that defined just such a layered architecture.


Some other applications described at the conference use existing collaborative technology, such as Wikipedia, to jumpstart Semantic Web applications. Tim Finin described an approach that he and his colleagues at the University of Maryland, Baltimore County developed that treats Wikipedia as an ontology. They call it Wikitology. They assert that Wikipedia represents a “consensus view” of topics arrived at via a “social process.” They use the existing categories defined in Wikipedia, along with links between articles to discover the concepts, and the relationships between concepts, that describe article topics. A similar approach was described by Maria Grineva, Maxim Grinev and Dimitry Lizorkin from the Russian Academy of Sciences where Wikipedia was used as a Knowledge Base to discover semantically related key terms for documents. In another paper, Jeremy Witmer and Jugal Kalita of the University of Colorado, Colorado Springs used a named entity recognizer to tag locations in Wikipedia articles and also used machine learning techniques to extract geospatial relations from the articles. They posit that disambiguated locations and extracted relations could then be used to add semantic, geospatial annotations to the articles to aid search or create map-based mashups of Wikipedia data.


Our team presented a paper that described how the location of bloggers could be inferred from location entity mentions in their blog posts. We described an experiment where we were able to correctly geolocate 61% of blogs based on a test set of ~800 blogs with known locations. While our work was somewhat tangential to the Semantic Web, it is a demonstration of the “inference problem,” where information not stated directly, can be inferred from other available information. This raises issues of privacy given the explosion of the use of social networking sites such as Facebook and the proliferation of personal Web logs. Three other papers presented at the symposium addressed privacy and access control issues. Mary-Ann Williams of the University of Technology, Sydney, Australia, gave an excellent overview of privacy as it relates to Web-based business. Paul Groth of the University of Southern California discussed privacy obligation policies and described how the users of a social networking site might use them to control access to their personal data from outside of the site. Ching-man Au Yeung, Lalana Kagal, Nicholas Gibbins, and Nigel Shadbolt of the University of Southampton and MIT described a method for controlling access to photos on Flickr based on how photos are tagged using a tagging ontology, FOAF, OpenID authentication and the AIR policy language.


Panels presented during the symposium addressed some cross-cutting issues for Web 2.0 and Semantic Web applications; usability, scale and privacy. On the 25th, the panel included Steve White of Radar Networks , Denny Vrandecic, Natasha Noy, Jaime Taylor, Minister of Information for Metaweb, the home of Freebase (an excellent open collaborative database), and Jeff Pollock of Oracle and the author of the recently published “The Semantic Web for Dummies.” This panel was dedicated to the topic of usability, but also addressed the issue of scale. All agreed that usability issues on the Semantic Web are the same as with Webs 1.0 and 2.0; simple is better, hide confusing bits like RDF and OWL tags, etc. Noy made the point, however, that there are different classes of users for semantic applications on the web, such as the users of BioPortal and those actually involved in ontology development. A lot of time was spent talking about users of applications such as Excel and how even a killer application like the Semantic Web can be overtaken by simple, inelegant solutions. The issue of scale came down to how Semantic Web applications will handle billions of triples, and the difficulty of doing anything more than simple reasoning over such large amounts of data. Taylor described the power law phenomena where some entities are overloaded with properties while most only have a few. This suggests the need for smart partitioning of resources based on their semantics. As far as the scalability of reasoning is concerned, full RDFS or OWL reasoning is probably too expensive, at least for large amounts of data. Though, as one participant said, “a little bit of semantics” goes a long way, so basic relations such as subsumption and transitivity may be all that is required for most reasoning.


The next day’s panel included Paul Groth, Denny Vrandecic, Tim Finin and Rajesh Balakrishnan and touched on issues of privacy and trust. One conclusion of this discussion was that the structured metadata that comes with the Semantc Web, along with ability to reason over the data – albeit, probably in small bites – will just multiply the inference problem. There was no real consensus on what can be done about that.


This symposium did a great job of framing how social computing and semantics are quickly coming together. There was quite a bit of excitement about Twine and the success of Semantic MediaWiki. There was no clear consensus whether this technology will revolutionize the user experience or just provide enabling technology to intelligently link applications and make current functionalities such as search more effective. For developers, however, there is a whole new universe of challenges here.

 
Creative Commons License
finegameofnil by Clay Fink is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.