We heart ferrets
We recently added full archive search to Lingr, and I thought I'd take a moment to talk about the technical details of that, for those who are interested.
At Lingr, everything said in the chatrooms is saved into our database. This enables you to browse the archives of a room to recall some recent conversation, or to find out what someone else said about something. So, we've got around three million user utterances sitting in our database, and, we thought, why not unlock those and let people search them?
Our first thought was to just use MySQL's fulltext indexing system. But, as it turns out, fulltext indexing only works on MyISAM tables, and our utterance table is InnoDB, so, that was out the window.
So we started looking for a text indexing system that could work for us. What we found was Ferret, a ruby port of Apache Lucene. Combined with the excellent acts-as-ferret (AAF) plugin for ActiveRecord, we were able to integrate Ferret/AAF into Lingr in about two weeks.
The one issue that complicated matters the most is that Lingr hosts conversations in many different languages (English, Japanese, Farsi, etc.). This presents a unique challenge in terms of tokenizing the utterances before they are indexed. Ferret provides a very nice tokenizer for most languages based on the Latin alphabet, but other languages such as Japanese which do not delimit tokens by whitespace are problematic. Also consider the fact that it is quite common to have a single utterance that mixes languages (I guess you can thank the global ubiquity of English for that). And to put icing on this cake, for a given utterance, we have no idea which language (or languages) it is in- all we have are Unicode codepoints.
So, out of the two weeks required for integration, much of that time was spent writing and tuning our own tokenizer. Our tokenizer basically spots transitions between Latin text and non-Latin text (based on codepoint value), then applies Ferret's existing Latin tokenizer to the Latin parts, and a simple per-character tokenizer to the non-Latin parts. Because Ferret uses the same tokenizer when indexing an utterance as it uses when searching for an utterance, this means that your search terms can contain a mix of Latin and non-Latin "words", and we should handle that just fine.
For the metrics-obsessed among you, our Ferret index currently consists of approximately 3 million "documents" (user utterances). The on-disk size of this index is currently 909 megabytes. The index is updated once per minute, via a cron job, so there is some short period after an utterance is spoken when it is not indexed (maximum one minute).
Finally, I'd like to thank Jens Kraemer and everyone over at the Ferret Forum for their help and advice. If you are thinking about using Ferret, you can get some great information there. It helped us tremendously.
If you have any other questions about the implementation, feel free to ask them here in comments, or through our Feedback form.
Cheers,
Now this seems very useful! Would you mind posting a how-to or even your modified tokenizer? That would be invaluable to a lot of internationalized projects. Great work!
Posted by: AkitaOnRails | May 14, 2007 at 02:36 PM
@Akita- we plan on open-sourcing the tokenizer. I'm cleaning it up right now, and we'll post it at http://svn.lingr.com when it's ready. I'll also post a new comment here announcing it.
Posted by: Daniel Burkes | May 14, 2007 at 02:43 PM
What is the performance of the search? Thx.
Posted by: Roman Mackovcak | July 02, 2007 at 02:38 PM
We've got around 3.5 million utterances indexed, and the average query takes just a second or two (test it yourself at, for example, http://www.lingr.com/search/archives/iphone)
Posted by: Daniel Burkes | July 02, 2007 at 02:48 PM
Anything new about releasing this magig tokenizer?
Posted by: Sébastien | March 16, 2008 at 04:26 PM
Sebastien- it is available now at http://svn.lingr.com/plugins/multilingual_ferret_tools/
Posted by: Daniel Burkes | March 16, 2008 at 07:47 PM