« We heart ferrets | Main | New Archive Search API »

A new plugin

As promised over at Ruby Forum, I am proud to say that we have released our multilingual Ferret analyzer as open source.  It's available now at out public subversion repository, packaged as a Rails plugin.  Enjoy and do let us know if you find it useful!

- Danny

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/887070/18506302

Listed below are links to weblogs that reference A new plugin:

Comments

Great stuff!

Thanks for the work but one question .. what is the license?

hk- this is released to the public domain.

This looks wonderful! Just what I need. But I can't get it to work :( I'm a bit of a beginner with Unicode in Rails, and definitely a beginner with Ferret. I think I've added all the bits I know of, e.g. "$KCODE = 'u'" in environment.rb. I set Ferret.locale = "en_US.UTF-8" under /config/initializers (Rails 2.0). I have a meta tag specifying UTF-8 in my layout. I set the analyzer in my model as described in the readme file. I'm using Sqlite for development if that makes a difference--although it's displaying my Chinese text just fine when I display all rows. Is there any other setup I need to do? Many thanks if you can provide a hint!

@hk - You need both "$KCODE = 'u'" and "require 'jcode'" in your environment.rb. Then you need to make sure the text you are sending to Ferret is actually in UTF-8. Check the ferret_server.log on the Ferret machine to see exactly what it is indexing.

Thanks, I've added "require 'jcode'". But I don't know how to verify the Unicode. I don't have a separate Ferret machine or a "ferret_server.log". There's a "ferret_index.log" and it's showing my query correctly with Chinese characters - but I don't know what the encoding is.

@hk- so what is the symptom exactly? What isn't working?

When I perform a search for a Chinese character (it's a dictionary application), there are no results even though I know there must be results. I get even stranger results when I search for a multi-character string. In the index log it puts the string in quotes, inserts spaces between the characters, and adds "~3" at the end. It looks like an encoding problem but I don't see where.

Could you tell me the unicode codepoints for the multi-character string you're searching for? I'll check MultilingualTools::Analyzer to make sure it's parsing and chunking correctly.

I'm not sure exactly how to do that, but I'll try. Is this it?

4 characters, it shows the following at the Unihan site:

UTF-8 UTF-16
-------- ------
莫 E8 8E AB U+83AB
衷 E8 A1 B7 U+8877
一 E4 B8 80 U+4E00
是 E6 98 AF U+662F

I tested, and the chunker/analyzer seems to be working fine on this string. My guess is that you are not actually indexing what you think you are indexing. That is, the encoding for the strings you are indexing is screwed up somehow.

The next step I would recommend to you is that you post at http://www.ruby-forum.com/forum/5, asking for help on how to see what exactly is in your Ferret index.

OK, thanks for your help!!

Well, I switched to MySQL and everything worked perfectly on the first try. Easy! Thanks for your plugin -- it's a great help.

Post a comment

If you have a TypeKey or TypePad account, please Sign In