Wednesday, October 14, 2009


Jure Leskovec, Lars Backstrom and Jon Kleinberg published the paper "Meme-tracking and the Dynamics of the News Cycle" earlier this year. Their work revolves about how memes flow through online news sources and social media. The "meme" is a notion suggested by biologist Richard Dawkins that describes a basic unit of thought - an idea - as it is transmitted through human culture and is modified in ways similar to how genes change over time in biological systems. Leskovec et al base their work on 90 million news stories and blog posts collected during the three months prior to the 2009 US presidential election. They extract memes by looking at short text phrases, and variations of those phrases, that appear with significant volume across news stores and blog posts. The most significant phrases from August through October, 2008 are shown in the in this visualization using the "ThemeRiver" technique:

Most mentioned phrases during the 2008 U.S. presidential campaign

The most significant meme found during this period was associated with then candidate Barak Obama, when he compared John McCain's policies to those of President George W. Bush, by saying that "But you know, you can -- you know, you can put lipstick on a pig; it's still a pig." They found some characteristic behavior for this and other memes. In the eight hours around the peak volume of a meme, the volume increases and decreases exponentially with time, with the decrease being somewhat slower. They also found that the peak volume in the blogosphere lags behind that of online media by about 2.5 hours. Another interesting phenomenon was how the media volume shows an additional peak after the blogs get a hold of a story, and then another peak in blog interest as the story bounces back into the blogosphere. They were also able to detect a small number of memes - 3.5% - that originated in the blogosphere and then spread to the online media.

There's much that's important about this paper. Their methodology is impressive and they show how you can work with this large volume of messy data and use scalable approaches - in this case, a relatively simple graph partitioning heuristic - to get interesting results. They also know the right questions to ask of the data. Based on what they saw in the ThemeRiver visualization, they developed a simple model of the news cycle, based on the volume and recency of news stories, that basically duplicates the phenomenon captures in the visualization.

Beyond these sorts of geeky accomplishments, the fact is that they demonstrated how you can extract open source, online data and do quantitative analysis on cultural phenomena such as the news cycle. Analysis at this scale was not possible before the advent of the Web. The tools we have today, such as the map/reduce methodology and cloud computing, along with being able to build years of work in NLP, graph theory and machine learning, make working with enormous amounts of text possible. What was previously the province of more qualitative disciplines such as Political Science, Sociology - not dissing you quantitative Soc guys - and Journalism, can now be integrated with more quantitative disciplines. They say it well in the paper:

Moving beyond qualitative analysis has proven difficult here, and
the intriguing assertions in the social-science work on this topic
form a significant part of the motivation for our current approach.
Specifically, the discussions in this area had largely left open the
question of whether the “news cycle” is primarily a metaphorical
construct that describes our perceptions of the news, or whether
it is something that one could actually observe and measure. We
show that by tracking essentially all news stories at the right level
of granularity, it is indeed possible to build structures that closely
match our intuitive picture of the news cycle, making it possible to
begin a more formal and quantitative study of its basic properties.

Check out the Meme-tracker site when you get a chance.

No comments:

Creative Commons License
finegameofnil by Clay Fink is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.