Wednesday, October 14, 2009


Jure Leskovec, Lars Backstrom and Jon Kleinberg published the paper "Meme-tracking and the Dynamics of the News Cycle" earlier this year. Their work revolves about how memes flow through online news sources and social media. The "meme" is a notion suggested by biologist Richard Dawkins that describes a basic unit of thought - an idea - as it is transmitted through human culture and is modified in ways similar to how genes change over time in biological systems. Leskovec et al base their work on 90 million news stories and blog posts collected during the three months prior to the 2009 US presidential election. They extract memes by looking at short text phrases, and variations of those phrases, that appear with significant volume across news stores and blog posts. The most significant phrases from August through October, 2008 are shown in the in this visualization using the "ThemeRiver" technique:

Most mentioned phrases during the 2008 U.S. presidential campaign

The most significant meme found during this period was associated with then candidate Barak Obama, when he compared John McCain's policies to those of President George W. Bush, by saying that "But you know, you can -- you know, you can put lipstick on a pig; it's still a pig." They found some characteristic behavior for this and other memes. In the eight hours around the peak volume of a meme, the volume increases and decreases exponentially with time, with the decrease being somewhat slower. They also found that the peak volume in the blogosphere lags behind that of online media by about 2.5 hours. Another interesting phenomenon was how the media volume shows an additional peak after the blogs get a hold of a story, and then another peak in blog interest as the story bounces back into the blogosphere. They were also able to detect a small number of memes - 3.5% - that originated in the blogosphere and then spread to the online media.

There's much that's important about this paper. Their methodology is impressive and they show how you can work with this large volume of messy data and use scalable approaches - in this case, a relatively simple graph partitioning heuristic - to get interesting results. They also know the right questions to ask of the data. Based on what they saw in the ThemeRiver visualization, they developed a simple model of the news cycle, based on the volume and recency of news stories, that basically duplicates the phenomenon captures in the visualization.

Beyond these sorts of geeky accomplishments, the fact is that they demonstrated how you can extract open source, online data and do quantitative analysis on cultural phenomena such as the news cycle. Analysis at this scale was not possible before the advent of the Web. The tools we have today, such as the map/reduce methodology and cloud computing, along with being able to build years of work in NLP, graph theory and machine learning, make working with enormous amounts of text possible. What was previously the province of more qualitative disciplines such as Political Science, Sociology - not dissing you quantitative Soc guys - and Journalism, can now be integrated with more quantitative disciplines. They say it well in the paper:

Moving beyond qualitative analysis has proven difficult here, and
the intriguing assertions in the social-science work on this topic
form a significant part of the motivation for our current approach.
Specifically, the discussions in this area had largely left open the
question of whether the “news cycle” is primarily a metaphorical
construct that describes our perceptions of the news, or whether
it is something that one could actually observe and measure. We
show that by tracking essentially all news stories at the right level
of granularity, it is indeed possible to build structures that closely
match our intuitive picture of the news cycle, making it possible to
begin a more formal and quantitative study of its basic properties.

Check out the Meme-tracker site when you get a chance.

Tuesday, October 6, 2009

Hadoop World 2009

Hadoop is a Java framework for implementing the Map/Reduce programming model described in the 2004 paper by Jeffrey Dean and Sanjay Ghemawat of Google. Map/Reduce allows you to parallelize a large task by breaking it into small map functions that perform a transformation on a key/value pair and combine the results of the mappings using a reduce function. Using a cluster of nodes - either your own or those in a computing cloud like Amazon’s EC2 - you can do large scale parallel processing.

I attended the Hadoop World conference in NYC on October 1, 2009 and came back very excited about this technology. We have had an interest in exploring Map/Reduce to support our work analyzing blog text, but have not until now had the resources to devote to it. After getting a feel for what this technology makes possible, we are ready to dive in. A copy of Tom White’s “Hadoop:The Definitive Guide” was provided, so I am on my way.

The message I took away from thus conference is that with the vast amount of data available today, from genomic and other biological data to data in online social networks, we can use new computational tools to help us better understand people at the micro and macro levels. We need processing capabilities that scale to the pentabyte level to do this, however, and technologies like Hadoop, coupled with cloud computing, are one way to approach this problem. The notion that
“more data beats better algorithms” may or may not hold in all cases, but “big data” is here and we now have usable and relatively cheap ways to process it.

Hadoop grew out of the Lucene/Nutch community and became an official Apache project in 2006. Yahoo! soon afterward adopted it for their Web crawls. Since then a number of subprojects have started up under Hadoop, allowing for querying, analysis and data management.

Cloudera (“Cloud-era”, get it?) was the major organizer of Hadoop World, and Christophe Bisciglia, one of the principals of Cloudera, started off giving us a review of the history of Hadoop, and describing a number of the subprojects it has spawned. Cloudera’s business is based on providing Linux packages for deploying Hadoop on private servers, and maintaining Amazon EC2 instances for use in the cloud. Cloudera also has introduced a browser based GUI for managing Hadoop clusters: the Cloudera Desktop.

Peter Sirota, the manager of Amazon’s Elastic MapReduce, was up next and discussed
Pig, an Apache/Hadoop subproject that provides a high-level data analysis layer for Map/Reduce, and Karmasphere, their own NetBeans based GUI for managing EC2 instances and Map/Reduce jobs.

Eric Baldeschwieler of Yahoo! described their use of Hadoop. Yahoo!, he said, is the “engine
room” of Hadoop development and is the largest contributor to the open source project. They use Hadoop to process the data for the Yahoo! Web search, using a 10,000 core
Lunux cluster and with 5 Pentabytes of raw disk space. They also used Hadoop to win the Jim Gray Sort Benchmark competition, sorting one Terabyte in 62 seconds and one pentabyte
in 16 hours.

Other presenters included Ashish Thusoo of Facebook. The amount of new data added to Facebook on a daily basis is astounding. In March of 2008 it was a modest 200 gigabytes a day. In April of 2009 it was over two terabytes a day, and in October of ’09 it is over four terabytes per day. They have made Hadoop an integral part of their processing pipeline in order to deal with this rate of growth. One new Apache/Hadoop subproject that they have been using is Hive. Hive is a data warehouse infrastructure that allows for data analysis and querying of data in Hadoop files.

There were a number of talks in the afternoon portion of the conference. Two that interested me in particular were Jake Hofman’s talk on Yahoo! Research’s use of Hadoop for social network analysis, and Charles Ward’s talk on a joint effort by Stony Brook University and General Sentiment (a Stony Brook commercial spinoff) on analyzing entity references and sentiment in blogs and online media. One other from Paul Brown of Booz Allen discussed how they used Hadoop for calculating protein alignments. He also demonstrated a visualization of a Hadoop cluster in action that needs to be seen – I’ll see if I can find a link...

There is a rich set of tweets from the conference on Twitter. These give a detailed, minute-by-minute picture of what went on there. The one oddity about the conference – the yellow elephant in the room, if you will – was the absence of Google. There’s an interesting angle to that, I’m sure, but I have no idea what it might be.

This was a great conference and it looks like we are just at the start of a revolution in the way we deal with large volumes of data.

Creative Commons License
finegameofnil by Clay Fink is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.