The default database back end for MediaWiki is MySQL and the search capability is based on MySQL's full-text search capability. Specifically, the MediaWiki database schema contains the
searchindex
table which defines FULLTEXT indicies on the si_title
and si_text
columns:CREATE TABLE /*$wgDBprefix*/searchindex (
-- Key to page_id
si_page int unsigned NOT NULL,
-- Munged version of title
si_title varchar(255) NOT NULL default '',
-- Munged version of body text
si_text mediumtext NOT NULL,
UNIQUE KEY (si_page),
FULLTEXT si_title (si_title),
FULLTEXT si_text (si_text)
) TYPE=MyISAM;
See the link above for more details about FULLTEXT indices. The important point is that the article text (in the
text
table) is not searched. The munged text in the searchindex
table is searched instead. Munged, in this case, means that Wiki tags, URLs and some language-specific characters are removed to facilitate searches. See the includes/UpdateSearch.php
script in your MediaWiki distribution to see exactly what's done.If you are loading pages programatically into your Wiki, make sure the
searchindex
table is updated appropriately. It's best to take advantage of the includes/Article.php
script here since it takes care of all the necessary bookkeeping. I've not done this myself, so it's best to do some homework on your own before preoceeding.By default, MySQL will only index words in a FULLTEXT index that are of 4-10 characters in length. The minimum length of 4 can be a problem if you have a lot of three letter abbreviations. MySQL also uses a large list of stop words. Stop words are very common words that are ignored by indexing programs. MySQL's default stop word list may be too restrictive for you, so a shorter list might improve search results.
The minimum indexed word length and the stop word list are configurable under MySQL. Changing these system settings requires a restart of the server, as well as a rebuild of the
searchindex
table's indicies. Rebuilding the index can take a long time if you have a lot of data in your Wiki, so I would consider making these changes before you load the data.Making these changes in easy. I'm using MediaWiki on a Windows box, just so you know.
First, edit the
my.ini
file (my.cnf
on a Unix box) in the MySQL installation directory. Add the following options to the file and then save the file:
ft_min_word_len=3
ft_stopword_file="mysqlhome/stop-words.txt"
In this case,
ft_stopword_file
is pointing to a file in the mySQL installation directory, stop-words.txt
. For stop words I used the default set of english stop words used by Lucene:a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
This is a compact and reasonable set of stop words and should improve upon the default MySQL list. This will increase the time required to index the
searchindex
table, however.Next, restart the MySQL server. Do this via the
mysqladmin
command line tool or just open services under Windows Control Panel/Administrative Tools and restart the MySQL service.Finally, reindex the
searchindex
table. The easiest way to do this is from the MySQL command line:
mysql> REPAIR TABLE searchindex QUICK;
Additional information about tweeking MySQL for full text searches can be found here, though changing the minimum indexed word size and the stop word list should improve your search capability well enough.
2 comments:
Hey there,
could you say me how to solve the search problem with the * and %?
for example i got *house - the basic mediawiki can not search / handle that.
any ideas?
thanks for you
improve search results
Post a Comment