finegameofnil: December 2008

Monday, December 29, 2008

MediaWiki Search Configuration Issues

MediaWiki is easy to set up, and the search capability out of the box is OK. Some of the tweeks described here, however, may help make searches more useful.

The default database back end for MediaWiki is MySQL and the search capability is based on MySQL's full-text search capability. Specifically, the MediaWiki database schema contains the searchindex table which defines FULLTEXT indicies on the si_title and si_text columns:

CREATE TABLE /*$wgDBprefix*/searchindex (
-- Key to page_id
si_page int unsigned NOT NULL,

-- Munged version of title
si_title varchar(255) NOT NULL default '',

-- Munged version of body text
si_text mediumtext NOT NULL,

UNIQUE KEY (si_page),
FULLTEXT si_title (si_title),
FULLTEXT si_text (si_text)

) TYPE=MyISAM;

See the link above for more details about FULLTEXT indices. The important point is that the article text (in the text table) is not searched. The munged text in the searchindex table is searched instead. Munged, in this case, means that Wiki tags, URLs and some language-specific characters are removed to facilitate searches. See the includes/UpdateSearch.php script in your MediaWiki distribution to see exactly what's done.

If you are loading pages programatically into your Wiki, make sure the searchindex table is updated appropriately. It's best to take advantage of the includes/Article.php script here since it takes care of all the necessary bookkeeping. I've not done this myself, so it's best to do some homework on your own before preoceeding.

By default, MySQL will only index words in a FULLTEXT index that are of 4-10 characters in length. The minimum length of 4 can be a problem if you have a lot of three letter abbreviations. MySQL also uses a large list of stop words. Stop words are very common words that are ignored by indexing programs. MySQL's default stop word list may be too restrictive for you, so a shorter list might improve search results.

The minimum indexed word length and the stop word list are configurable under MySQL. Changing these system settings requires a restart of the server, as well as a rebuild of the searchindex table's indicies. Rebuilding the index can take a long time if you have a lot of data in your Wiki, so I would consider making these changes before you load the data.

Making these changes in easy. I'm using MediaWiki on a Windows box, just so you know.

First, edit the my.ini file (my.cnf on a Unix box) in the MySQL installation directory. Add the following options to the file and then save the file:


ft_min_word_len=3
ft_stopword_file="mysqlhome/stop-words.txt"

In this case, ft_stopword_file is pointing to a file in the mySQL installation directory, stop-words.txt. For stop words I used the default set of english stop words used by Lucene:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

This is a compact and reasonable set of stop words and should improve upon the default MySQL list. This will increase the time required to index the searchindex table, however.

Next, restart the MySQL server. Do this via the mysqladmin command line tool or just open services under Windows Control Panel/Administrative Tools and restart the MySQL service.

Finally, reindex the searchindex table. The easiest way to do this is from the MySQL command line:


mysql> REPAIR TABLE searchindex QUICK;

Additional information about tweeking MySQL for full text searches can be found here, though changing the minimum indexed word size and the stop word list should improve your search capability well enough.

Thursday, December 4, 2008

Using Conditional Random Fields for Sentiment Extraction

I found this paper to be very helpful in understanding how to use Conditional Random Fields. Have a look. They are trying to extract the source of sentiment from sentences. Their approach also uses extraction patterns in addition to CRFs, but I'm not entirely convinced that the extraction patterns help all that much in increasing P&R. Especially helpful here is a good, detailed description of the features they use for the CRF. They used the Mallet toolkit for the CRFs, too.

Wednesday, December 3, 2008

Fun with Reification

Converting from one graph representation to another can be problematic, when properties are allowed on edges in one representation but not in the other. I had to implement a service that queried a graph store that allowed edge properties and serialized the result to OWL. The client, in turn, had to convert the returned OWL to another native graph representation that also allows edge properties. OWL does not allow edge properties, so I had to deal with the problem of preserving the edge properties somehow.

Enter reification. What's reification? Basically it's making statements about a statement. RDF-wise, it is turning a triple into the subject of another triple. If you have a triple <a,knows,b> you can reify the triple as S and say <S,isAbout,a>. I use Jena - a Java API for processing RDF and OWL- and have used its reification support to implement named graphs. There were some performance issues here with large numbers of reified statements, but for reifying a single statement, as long as there are not a large number of properties for the reified statement, there will probably not be too much of a performance hit. That assertion hasn't been tested, though, so take it with a grain of salt.

To deal with preserving edge properties in OWL, you need to reify the triple that represents the edge in the RDF graph and then add triples representing the edge properties that have that reified statment as the subject. When I came across an edge in the source graph, I created triple, or Statement in Jena parlance, describing the edge, <s,p,o>, where s is is the source node resource, p is a property, and o is the target node resource (I'm implicitly assuming a directed graph):

Statement stmt = model.createStatement(s, p, o);
// create statement doesn't add the statement to the 
// model, so add it.
model.addStatement(stmt);

I then reified the statement and added statements that had the reified statement as the subject for each edge property:


// reify the statement
ReifiedStatement reifiedStmt=
  stmt.createReifiedStatement();
// Add "edge" propertes
Statement edgePropStmt=model.createStatement(reifiedStmt, 
  someEdgeProperty, "foo");
model.addStatement(edgePropStmt);
...

On the client side, I checked any statement that had an object property as the predicate for reification. If it was a reified statement, I knew I was looking at an edge, so I extracted the property values and added them to the edge in the target representation:

// Check for reified statement
if (stmt.isReified()) {
RSIterator j=statement.listReifiedStatements();
while (j.hasNext()) {
  Statement statement2 = k.nextStatement();
  if (!statement2.getPredicate().equals(RDF.subject)
    && !statement2.getPredicate().equals(RDF.predicate)
    && !statement2.getPredicate().equals(RDF.object)
    && !statement2.getPredicate().equals(RDF.type)) {
    // Add edge property to native graph representation
  }
}

The one thing to note here is that when you reify a triple, <s,p,o> as S, it implies the triples <S,rdf:type,rdf:Statement>, <S,rdf:subject,s>, <S,rdf:predicate,p>, and <S,rdf:object,o>. You need to filter these properties out when proecessing the reified statement.

Monday, December 1, 2008

SAAJ Performance Issues

Java 1.6 comes with the SOAP with attachments API for Java (SAAJ). It's really easy to set up a stand alone web service endpoint using SAAJ and this tutorial (free registration required) tells you how to get one up and running.

For one of my projects I wanted a quick and dirty demo I could run from the command line, using Ant, that started a service and demonstrated a client call. What I ran into, though, was that it was taking forever for the client to access the response message body after the call. The result value was about 80K but it was still taking about three miniutes of wall time for the call to SOAPMessage::getSOAPBody to complete! It turns out that this is a bug in Java 1.6 (I'm running 1.6_0_06 but I believe I saw the same problem under release _10 as well). The fix posted here works. I now get my data back in milleseconds rather than minutes.

finegameofnil