We recently added full archive search to Lingr, and I thought I'd take a moment to talk about the technical details of that, for those who are interested.
At Lingr, everything said in the chatrooms is saved into our database. This enables you to browse the archives of a room to recall some recent conversation, or to find out what someone else said about something. So, we've got around three million user utterances sitting in our database, and, we thought, why not unlock those and let people search them?
Our first thought was to just use MySQL's fulltext indexing system. But, as it turns out, fulltext indexing only works on MyISAM tables, and our utterance table is InnoDB, so, that was out the window.
So we started looking for a text indexing system that could work for us. What we found was Ferret, a ruby port of Apache Lucene. Combined with the excellent acts-as-ferret (AAF) plugin for ActiveRecord, we were able to integrate Ferret/AAF into Lingr in about two weeks.
The one issue that complicated matters the most is that Lingr hosts conversations in many different languages (English, Japanese, Farsi, etc.). This presents a unique challenge in terms of tokenizing the utterances before they are indexed. Ferret provides a very nice tokenizer for most languages based on the Latin alphabet, but other languages such as Japanese which do not delimit tokens by whitespace are problematic. Also consider the fact that it is quite common to have a single utterance that mixes languages (I guess you can thank the global ubiquity of English for that). And to put icing on this cake, for a given utterance, we have no idea which language (or languages) it is in- all we have are Unicode codepoints.
So, out of the two weeks required for integration, much of that time was spent writing and tuning our own tokenizer. Our tokenizer basically spots transitions between Latin text and non-Latin text (based on codepoint value), then applies Ferret's existing Latin tokenizer to the Latin parts, and a simple per-character tokenizer to the non-Latin parts. Because Ferret uses the same tokenizer when indexing an utterance as it uses when searching for an utterance, this means that your search terms can contain a mix of Latin and non-Latin "words", and we should handle that just fine.
For the metrics-obsessed among you, our Ferret index currently consists of approximately 3 million "documents" (user utterances). The on-disk size of this index is currently 909 megabytes. The index is updated once per minute, via a cron job, so there is some short period after an utterance is spoken when it is not indexed (maximum one minute).
Finally, I'd like to thank Jens Kraemer and everyone
over at the Ferret Forum for their help and advice. If you are thinking about
using Ferret, you can get some great information there. It helped us
tremendously.
If you have any other questions about the implementation, feel free to ask them here in comments, or through our Feedback form.
Cheers,
Danny