Cornify

Saturday, March 28, 2009

"The Social Semantic Web – Where Web 2.0 Meets Web 3.0"

I attended the Association for The Advancement of Artificial Intelligence Spring Symposia, 2009, at Stanford University, from March 23 through March 25, 2009, and participated in the symposium “The Social Semantic Web – Where Web 2.0 Meets Web 3.0.” “Web 2.0” refers to applications and technologies that have emerged in the last few years on the Web that enable social networking, collaboration and user provided content. This includes sites such as Facebook and Twitter, as well as Web logs and wikis. “Web 3.0” is more or less synonymous with the notion of the Semantic Web, where structured metadata associated with Web content can be used for reasoning and inference. The idea of the Semantic Web goes back to a paper in Scientific American in 2001 by Tim Bernes-Lee, Jim Hendler and Ora Lassila. They described a world where agent based applications can use semantics-based metadata on the web to reason and infer and present choices for people as they go through their daily activity. Much of the technology for enabling this vision is based on the principles of logic programming paired with Web centric technology such as XML-based metadata.


The Symposium was organized by Li Ding and Jen Bao from Rensselaer Polytechnic Institute and Mark Greaves from Vulcan, Inc. Li Ding opened the discussion and described a situation where Semantic Web technologies may be poised to increase the range and effectiveness of Web 2.0 tools for information retrieval, social networking and collaboration. We spent the next two and a half days discussing examples of this technology and the issues their use introduce into how people interact with the Web.


A number of applications were described that bridge the gap between collaborative technology and semantics. Twine is a site that allows users to group links into what are called twines. A twine is a group of sites that are topically related. Tags are generated when a site is added to a twine and domain ontologies are used to link different twines together and recommend to a user other twines that may interest them. Radar Networks Inc. developed Twine and their CEO Nova Spivack gave the first presentation. Twine looks like a very useful application. It is somewhat similar to delic.io.us in concept, but with explicit semantics.


Denny Vrandecic from Insitut AIFB, Karlsruhe, Germany described ongoing work on Semantic MediaWiki. SMW is an extension of MediaWiki that allows for semantic annotation of wiki data. Vrandecic is one of the original developers of Semantic MediaWiki and spoke about adding automated quality checks to the application.

Semantic MediaWiki was the basis for a number of other applications discussed at the symposium. One was Metavid.org, an “open video archive of the US Congress.” Metavid.com captures video and closed captioning of Congressional proceedings. Semantic MediaWiki’s extensions allow for categorical searches of recorded speeches.


The Halo Project, funded by Paul Allen’s Vulcan Inc. and sponsored by Mark Greaves, has developed extensions to Semantic MediaWiki that go a long way toward showing the power of embedding semantics in applications. The work was done by Ontoprise and they have produced a video of its features that is worth viewing.


Some of the applications discussed provide collaborative, distributed development environments for authoring ontologies. Tania Tudorache of the Stanford Center for Biomedical Informatics Research described Collaborative Protégé. Collaborative Protege extends the Protege ontology development environment to support “collaborative ontology editing as well as annotation of both ontology components and ontology changes.” Natasha Noy, who is also one of the prime movers behind Protégé, presented BioPortal, a repository of biomedical ontologies that allows users to critique posted ontologies, collaborate on ontology development, and submit mappings between ontologies. The same codebase that is behind BioPortal also supports the OOR Open Ontology Repository which is a domain-independent repository of ontologies. Nova Spivack of Radar Networks also mentioned a new site that they plan on standing up called Tripleforge, which, like Sourceforge, will support open source development of ontologies.


In regard to architecting systems that use semantics to leverage Web 2.0 features, a number of approaches kept coming up. Ontologies for describing tagging behavior by users were mentioned by a few of the presenters. This is a way to capture the relationships between taggers (two users who tag the same site with the same or similar tags) and the temporal dimension of tagging (“who tagged what tag when?”). Another common thread was defining a semantic layer to describe the syntactic or functional layers of a system. Hans-George Fill of the University of Vienna described a model-based approach for developing “Semantic Information Systems” using model based tools that defined just such a layered architecture.


Some other applications described at the conference use existing collaborative technology, such as Wikipedia, to jumpstart Semantic Web applications. Tim Finin described an approach that he and his colleagues at the University of Maryland, Baltimore County developed that treats Wikipedia as an ontology. They call it Wikitology. They assert that Wikipedia represents a “consensus view” of topics arrived at via a “social process.” They use the existing categories defined in Wikipedia, along with links between articles to discover the concepts, and the relationships between concepts, that describe article topics. A similar approach was described by Maria Grineva, Maxim Grinev and Dimitry Lizorkin from the Russian Academy of Sciences where Wikipedia was used as a Knowledge Base to discover semantically related key terms for documents. In another paper, Jeremy Witmer and Jugal Kalita of the University of Colorado, Colorado Springs used a named entity recognizer to tag locations in Wikipedia articles and also used machine learning techniques to extract geospatial relations from the articles. They posit that disambiguated locations and extracted relations could then be used to add semantic, geospatial annotations to the articles to aid search or create map-based mashups of Wikipedia data.


Our team presented a paper that described how the location of bloggers could be inferred from location entity mentions in their blog posts. We described an experiment where we were able to correctly geolocate 61% of blogs based on a test set of ~800 blogs with known locations. While our work was somewhat tangential to the Semantic Web, it is a demonstration of the “inference problem,” where information not stated directly, can be inferred from other available information. This raises issues of privacy given the explosion of the use of social networking sites such as Facebook and the proliferation of personal Web logs. Three other papers presented at the symposium addressed privacy and access control issues. Mary-Ann Williams of the University of Technology, Sydney, Australia, gave an excellent overview of privacy as it relates to Web-based business. Paul Groth of the University of Southern California discussed privacy obligation policies and described how the users of a social networking site might use them to control access to their personal data from outside of the site. Ching-man Au Yeung, Lalana Kagal, Nicholas Gibbins, and Nigel Shadbolt of the University of Southampton and MIT described a method for controlling access to photos on Flickr based on how photos are tagged using a tagging ontology, FOAF, OpenID authentication and the AIR policy language.


Panels presented during the symposium addressed some cross-cutting issues for Web 2.0 and Semantic Web applications; usability, scale and privacy. On the 25th, the panel included Steve White of Radar Networks , Denny Vrandecic, Natasha Noy, Jaime Taylor, Minister of Information for Metaweb, the home of Freebase (an excellent open collaborative database), and Jeff Pollock of Oracle and the author of the recently published “The Semantic Web for Dummies.” This panel was dedicated to the topic of usability, but also addressed the issue of scale. All agreed that usability issues on the Semantic Web are the same as with Webs 1.0 and 2.0; simple is better, hide confusing bits like RDF and OWL tags, etc. Noy made the point, however, that there are different classes of users for semantic applications on the web, such as the users of BioPortal and those actually involved in ontology development. A lot of time was spent talking about users of applications such as Excel and how even a killer application like the Semantic Web can be overtaken by simple, inelegant solutions. The issue of scale came down to how Semantic Web applications will handle billions of triples, and the difficulty of doing anything more than simple reasoning over such large amounts of data. Taylor described the power law phenomena where some entities are overloaded with properties while most only have a few. This suggests the need for smart partitioning of resources based on their semantics. As far as the scalability of reasoning is concerned, full RDFS or OWL reasoning is probably too expensive, at least for large amounts of data. Though, as one participant said, “a little bit of semantics” goes a long way, so basic relations such as subsumption and transitivity may be all that is required for most reasoning.


The next day’s panel included Paul Groth, Denny Vrandecic, Tim Finin and Rajesh Balakrishnan and touched on issues of privacy and trust. One conclusion of this discussion was that the structured metadata that comes with the Semantc Web, along with ability to reason over the data – albeit, probably in small bites – will just multiply the inference problem. There was no real consensus on what can be done about that.


This symposium did a great job of framing how social computing and semantics are quickly coming together. There was quite a bit of excitement about Twine and the success of Semantic MediaWiki. There was no clear consensus whether this technology will revolutionize the user experience or just provide enabling technology to intelligently link applications and make current functionalities such as search more effective. For developers, however, there is a whole new universe of challenges here.

Monday, December 29, 2008

MediaWiki Search Configuration Issues

MediaWiki is easy to set up, and the search capability out of the box is OK. Some of the tweeks described here, however, may help make searches more useful.

The default database back end for MediaWiki is MySQL and the search capability is based on MySQL's full-text search capability. Specifically, the MediaWiki database schema contains the searchindex table which defines FULLTEXT indicies on the si_title and si_text columns:

CREATE TABLE /*$wgDBprefix*/searchindex (
-- Key to page_id
si_page int unsigned NOT NULL,

-- Munged version of title
si_title varchar(255) NOT NULL default '',

-- Munged version of body text
si_text mediumtext NOT NULL,

UNIQUE KEY (si_page),
FULLTEXT si_title (si_title),
FULLTEXT si_text (si_text)

) TYPE=MyISAM;


See the link above for more details about FULLTEXT indices. The important point is that the article text (in the text table) is not searched. The munged text in the searchindex table is searched instead. Munged, in this case, means that Wiki tags, URLs and some language-specific characters are removed to facilitate searches. See the includes/UpdateSearch.php script in your MediaWiki distribution to see exactly what's done.

If you are loading pages programatically into your Wiki, make sure the searchindex table is updated appropriately. It's best to take advantage of the includes/Article.php script here since it takes care of all the necessary bookkeeping. I've not done this myself, so it's best to do some homework on your own before preoceeding.

By default, MySQL will only index words in a FULLTEXT index that are of 4-10 characters in length. The minimum length of 4 can be a problem if you have a lot of three letter abbreviations. MySQL also uses a large list of stop words. Stop words are very common words that are ignored by indexing programs. MySQL's default stop word list may be too restrictive for you, so a shorter list might improve search results.

The minimum indexed word length and the stop word list are configurable under MySQL. Changing these system settings requires a restart of the server, as well as a rebuild of the searchindex table's indicies. Rebuilding the index can take a long time if you have a lot of data in your Wiki, so I would consider making these changes before you load the data.

Making these changes in easy. I'm using MediaWiki on a Windows box, just so you know.

First, edit the my.ini file (my.cnf on a Unix box) in the MySQL installation directory. Add the following options to the file and then save the file:

ft_min_word_len=3
ft_stopword_file="mysqlhome/stop-words.txt"


In this case, ft_stopword_file is pointing to a file in the mySQL installation directory, stop-words.txt. For stop words I used the default set of english stop words used by Lucene:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

This is a compact and reasonable set of stop words and should improve upon the default MySQL list. This will increase the time required to index the searchindex table, however.

Next, restart the MySQL server. Do this via the mysqladmin command line tool or just open services under Windows Control Panel/Administrative Tools and restart the MySQL service.

Finally, reindex the searchindex table. The easiest way to do this is from the MySQL command line:

mysql> REPAIR TABLE searchindex QUICK;


Additional information about tweeking MySQL for full text searches can be found here, though changing the minimum indexed word size and the stop word list should improve your search capability well enough.

Thursday, December 4, 2008

Using Conditional Random Fields for Sentiment Extraction

I found this paper to be very helpful in understanding how to use Conditional Random Fields. Have a look. They are trying to extract the source of sentiment from sentences. Their approach also uses extraction patterns in addition to CRFs, but I'm not entirely convinced that the extraction patterns help all that much in increasing P&R. Especially helpful here is a good, detailed description of the features they use for the CRF. They used the Mallet toolkit for the CRFs, too.

Wednesday, December 3, 2008

Fun with Reification

Converting from one graph representation to another can be problematic, when properties are allowed on edges in one representation but not in the other. I had to implement a service that queried a graph store that allowed edge properties and serialized the result to OWL. The client, in turn, had to convert the returned OWL to another native graph representation that also allows edge properties. OWL does not allow edge properties, so I had to deal with the problem of preserving the edge properties somehow.

Enter reification. What's reification? Basically it's making statements about a statement. RDF-wise, it is turning a triple into the subject of another triple. If you have a triple <a,knows,b> you can reify the triple as S and say <S,isAbout,a>. I use Jena - a Java API for processing RDF and OWL- and have used its reification support to implement named graphs. There were some performance issues here with large numbers of reified statements, but for reifying a single statement, as long as there are not a large number of properties for the reified statement, there will probably not be too much of a performance hit. That assertion hasn't been tested, though, so take it with a grain of salt.

To deal with preserving edge properties in OWL, you need to reify the triple that represents the edge in the RDF graph and then add triples representing the edge properties that have that reified statment as the subject. When I came across an edge in the source graph, I created triple, or Statement in Jena parlance, describing the edge, <s,p,o>, where s is is the source node resource, p is a property, and o is the target node resource (I'm implicitly assuming a directed graph):

Statement stmt = model.createStatement(s, p, o);
// create statement doesn't add the statement to the
// model, so add it.
model.addStatement(stmt);


I then reified the statement and added statements that had the reified statement as the subject for each edge property:


// reify the statement
ReifiedStatement reifiedStmt=
  stmt.createReifiedStatement();
// Add "edge" propertes
Statement edgePropStmt=model.createStatement(reifiedStmt,
  someEdgeProperty, "foo");
model.addStatement(edgePropStmt);
...


On the client side, I checked any statement that had an object property as the predicate for reification. If it was a reified statement, I knew I was looking at an edge, so I extracted the property values and added them to the edge in the target representation:

// Check for reified statement
if (stmt.isReified()) {
RSIterator j=statement.listReifiedStatements();
while (j.hasNext()) {
  Statement statement2 = k.nextStatement();
  if (!statement2.getPredicate().equals(RDF.subject)
    && !statement2.getPredicate().equals(RDF.predicate)
    && !statement2.getPredicate().equals(RDF.object)
    && !statement2.getPredicate().equals(RDF.type)) {
    // Add edge property to native graph representation
  }
}


The one thing to note here is that when you reify a triple, <s,p,o> as S, it implies the triples <S,rdf:type,rdf:Statement>, <S,rdf:subject,s>, <S,rdf:predicate,p>, and <S,rdf:object,o>. You need to filter these properties out when proecessing the reified statement.

Monday, December 1, 2008

SAAJ Performance Issues

Java 1.6 comes with the SOAP with attachments API for Java (SAAJ). It's really easy to set up a stand alone web service endpoint using SAAJ and this tutorial (free registration required) tells you how to get one up and running.

For one of my projects I wanted a quick and dirty demo I could run from the command line, using Ant, that started a service and demonstrated a client call. What I ran into, though, was that it was taking forever for the client to access the response message body after the call. The result value was about 80K but it was still taking about three miniutes of wall time for the call to SOAPMessage::getSOAPBody to complete! It turns out that this is a bug in Java 1.6 (I'm running 1.6_0_06 but I believe I saw the same problem under release _10 as well). The fix posted here works. I now get my data back in milleseconds rather than minutes.

Tuesday, June 17, 2008

Unwanted "xmlns=" Attribute in Elements After Transformation

I was doing a simple XSLT transformation that involved renaming elements. Starting with a basic example like:


<?xml version="1.0" encoding="UTF-8"?>
<mydoc xmlns="http://finegameofnil.blogger.com/xml-examples/rename">
 <foo someattr1="a" someattr2="b">
 </foo>
</mydoc>


I want to change "foo" to "foobar".

I run the the stylesheet:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xsl="http://www.w3.org/1999/XSL/Transform" xsi="http://www.w3.org/2001/XMLSchema-instance" fgon="http://finegameofnil.blogger.com/xml-examples/rename" version="2.0">
<xsl:import href="copy.xsl">
<xsl:output method="xml" version="1.0" standalone="yes" indent="yes" encoding="UTF-8">
<xsl:template match="fgon:foo">
 <xsl:element name="foobar">
  <xsl:apply-templates select="@* | node()">
  </xsl:apply-templates>
 </xsl:element>
</xsl:template>


I get:


<?xml version="1.0" encoding="UTF-8"?>
<mydoc xmlns="http://finegameofnil.blogger.com/xml-examples/rename">
 <foobar xmlns="" someattr1="a" someattr2="b">
 </foobar>
</mydoc>


So what's with the xmlns=""? I'm not sure what the semantics of an empty namespace are. It seems to mean the that element and its children are not in any namespace. In a case where you are renaming a document to conform to a schema change you will get a schema validation error.

To prevent this I tried the following stylesheet which explicitly sets the namespace for the target element to the default namespace:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xsl="http://www.w3.org/1999/XSL/Transform" xsi="http://www.w3.org/2001/XMLSchema-instance" fgon="http://finegameofnil.blogger.com/xml-examples/rename" version="2.0">
<xsl:import href="copy.xsl">
<xsl:output method="xml" version="1.0" standalone="yes" indent="yes" encoding="UTF-8">
<xsl:template match="fgon:foo">
 <xsl:element name="foobar" namespace="{namespace-uri()}">
  <xsl:apply-templates select="@* | node()">
  </xsl:apply-templates>
 </xsl:element>
</xsl:template>


Finally, this gives me what I wanted:

<?xml version="1.0" encoding="UTF-8"?>
<mydoc xmlns="http://finegameofnil.blogger.com/xml-examples/rename">
 <foobar someattr1="a" someattr2="b">
 </foobar>
</mydoc>


The copy.xsl imported in the examples above is from Sal Magnano's "XSLT Cookbook - 2nd Edition".


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
 <!-- General purpose copy translation stylesheet.
Taken from XSLT Cookbook, 2nd Edition, page 275. -->
 <xsl:template match="node() | @*">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>
 
Creative Commons License
finegameofnil by Clay Fink is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.