java - Lucene search special character -


in lucene index store names special characters (e.g. savić) in field 1 described below.

fieldtype fieldtype = new field(); fieldtype.setstored(true); fieldtype.setindexed(true); fieldtype.settokenized(false);<br> new field("name", "savić".tolowercase(), fieldtype); 

i use stopwordanalyzerbase analyzer , lucene version.lucene_45.

if search in field "savić" doesn't find it. how deal special characters?

@override protected tokenstreamcomponents createcomponents(final string fieldname, final reader reader) { patterntokenizer src; // diese zeichen werden nicht als trenner verwendet src = new patterntokenizer(reader, pattern.compile("[\\w&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);  tokenstream tok = new standardfilter(matchversion, src); tok = new lowercasefilter(matchversion, tok); tok = new stopfilter(matchversion, tok, tribuna_words_set);  return new tokenstreamcomponents(src, tok) {     @override     protected void setreader(final reader reader) throws ioexception {         super.setreader(reader);     } }; 

}

you have couple of choices:

  1. try adding asciifoldingfilter:

    src = new patterntokenizer(reader, pattern.compile("[\\w&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);  tokenstream tok = new standardfilter(matchversion, src); tok = new lowercasefilter(matchversion, tok); tok = new asciifoldingfilter(tok); tok = new stopfilter(matchversion, tok, tribuna_words_set); 

    this take simplistic approach of reducing non-ascii characters, such Ä, best match in ascii characters (a, in case), if reasonable ascii alternative character exists. won't fancy trying use language-specific intelligence determine best replacements though.

  2. for more linguistically intelligent, there tools handle sort of thing in many of language-specific packages. germannormalizationfilter 1 example, similar things asciifoldingfilter, apply rules in way appropriate german language, such 'ß' being replaced 'ss'. you'd use similar above code:

    src = new patterntokenizer(reader, pattern.compile("[\\w&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);  tokenstream tok = new standardfilter(matchversion, src); tok = new lowercasefilter(matchversion, tok); tok = new germannormalizationfilter(tok); tok = new stopfilter(matchversion, tok, tribuna_words_set); 

Comments

Popular posts from this blog

Fail to load namespace Spring Security http://www.springframework.org/security/tags -

sql - MySQL query optimization using coalesce -

unity3d - Unity local avoidance in user created world -