java - Lucene search special character -
in lucene index store names special characters (e.g. savić) in field 1 described below.
fieldtype fieldtype = new field(); fieldtype.setstored(true); fieldtype.setindexed(true); fieldtype.settokenized(false);<br> new field("name", "savić".tolowercase(), fieldtype);
i use stopwordanalyzerbase analyzer , lucene version.lucene_45.
if search in field "savić" doesn't find it. how deal special characters?
@override protected tokenstreamcomponents createcomponents(final string fieldname, final reader reader) { patterntokenizer src; // diese zeichen werden nicht als trenner verwendet src = new patterntokenizer(reader, pattern.compile("[\\w&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1); tokenstream tok = new standardfilter(matchversion, src); tok = new lowercasefilter(matchversion, tok); tok = new stopfilter(matchversion, tok, tribuna_words_set); return new tokenstreamcomponents(src, tok) { @override protected void setreader(final reader reader) throws ioexception { super.setreader(reader); } };
}
you have couple of choices:
try adding asciifoldingfilter:
src = new patterntokenizer(reader, pattern.compile("[\\w&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1); tokenstream tok = new standardfilter(matchversion, src); tok = new lowercasefilter(matchversion, tok); tok = new asciifoldingfilter(tok); tok = new stopfilter(matchversion, tok, tribuna_words_set);
this take simplistic approach of reducing non-ascii characters, such Ä, best match in ascii characters (a, in case), if reasonable ascii alternative character exists. won't fancy trying use language-specific intelligence determine best replacements though.
for more linguistically intelligent, there tools handle sort of thing in many of language-specific packages. germannormalizationfilter 1 example, similar things asciifoldingfilter, apply rules in way appropriate german language, such 'ß' being replaced 'ss'. you'd use similar above code:
src = new patterntokenizer(reader, pattern.compile("[\\w&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1); tokenstream tok = new standardfilter(matchversion, src); tok = new lowercasefilter(matchversion, tok); tok = new germannormalizationfilter(tok); tok = new stopfilter(matchversion, tok, tribuna_words_set);
Comments
Post a Comment