且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Lucene.Net 下划线导致令牌分裂

更新时间:2023-11-07 23:11:22

Yes, the StandardAnalyzer splits on underscore. WhitespaceAnalyzer does not. Note that you can use a PerFieldAnalyzerWrapper to use different analyzers for each field - you might want to keep some of the standard analyzer's functionality for everything except table/column name.

WhitespaceAnalyzer only does whitespace splitting though. It won't lowercase your tokens, for example. So you might want to make your own analyzer which combines WhitespaceTokenizer and LowercaseFilter, or look into LowercaseTokenizer.

EDIT: Simple custom analyzer (in C#, but you can translate it to Java pretty easily):

// Chains together standard tokenizer, standard filter, and lowercase filter
class MyAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
        StandardTokenizer baseTokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
        StandardFilter standardFilter = new StandardFilter(baseTokenizer);
        LowerCaseFilter lcFilter = new LowerCaseFilter(standardFilter);
        return lcFilter; 
    }
}