A new plugin
As promised over at Ruby Forum, I am proud to say that we have released our multilingual Ferret analyzer as open source. It's available now at out public subversion repository, packaged as a Rails plugin. Enjoy and do let us know if you find it useful!
- Danny
Great stuff!
Posted by: Jens Krämer | May 16, 2007 at 05:26 AM
Thanks for the work but one question .. what is the license?
Posted by: hk | January 25, 2008 at 05:08 AM
hk- this is released to the public domain.
Posted by: Daniel Burkes | January 25, 2008 at 02:38 PM
This looks wonderful! Just what I need. But I can't get it to work :( I'm a bit of a beginner with Unicode in Rails, and definitely a beginner with Ferret. I think I've added all the bits I know of, e.g. "$KCODE = 'u'" in environment.rb. I set Ferret.locale = "en_US.UTF-8" under /config/initializers (Rails 2.0). I have a meta tag specifying UTF-8 in my layout. I set the analyzer in my model as described in the readme file. I'm using Sqlite for development if that makes a difference--although it's displaying my Chinese text just fine when I display all rows. Is there any other setup I need to do? Many thanks if you can provide a hint!
Posted by: Rhywun | January 26, 2008 at 12:07 PM
@hk - You need both "$KCODE = 'u'" and "require 'jcode'" in your environment.rb. Then you need to make sure the text you are sending to Ferret is actually in UTF-8. Check the ferret_server.log on the Ferret machine to see exactly what it is indexing.
Posted by: Daniel Burkes | January 27, 2008 at 07:23 PM
Thanks, I've added "require 'jcode'". But I don't know how to verify the Unicode. I don't have a separate Ferret machine or a "ferret_server.log". There's a "ferret_index.log" and it's showing my query correctly with Chinese characters - but I don't know what the encoding is.
Posted by: Rhywun | January 27, 2008 at 10:48 PM
@hk- so what is the symptom exactly? What isn't working?
Posted by: Daniel Burkes | January 27, 2008 at 11:03 PM
When I perform a search for a Chinese character (it's a dictionary application), there are no results even though I know there must be results. I get even stranger results when I search for a multi-character string. In the index log it puts the string in quotes, inserts spaces between the characters, and adds "~3" at the end. It looks like an encoding problem but I don't see where.
Posted by: Rhywun | January 28, 2008 at 07:53 PM
Could you tell me the unicode codepoints for the multi-character string you're searching for? I'll check MultilingualTools::Analyzer to make sure it's parsing and chunking correctly.
Posted by: Daniel Burkes | January 28, 2008 at 08:59 PM
I'm not sure exactly how to do that, but I'll try. Is this it?
4 characters, it shows the following at the Unihan site:
UTF-8 UTF-16
-------- ------
莫 E8 8E AB U+83AB
衷 E8 A1 B7 U+8877
一 E4 B8 80 U+4E00
是 E6 98 AF U+662F
Posted by: Rhywun | January 29, 2008 at 05:37 PM
I tested, and the chunker/analyzer seems to be working fine on this string. My guess is that you are not actually indexing what you think you are indexing. That is, the encoding for the strings you are indexing is screwed up somehow.
The next step I would recommend to you is that you post at http://www.ruby-forum.com/forum/5, asking for help on how to see what exactly is in your Ferret index.
Posted by: Daniel Burkes | January 29, 2008 at 08:22 PM
OK, thanks for your help!!
Posted by: Rhywun | January 30, 2008 at 05:12 PM
Well, I switched to MySQL and everything worked perfectly on the first try. Easy! Thanks for your plugin -- it's a great help.
Posted by: Rhywun | January 30, 2008 at 06:10 PM