You are here: Foswiki>Tasks Web>Item14807 (31 Jan 2019, MichaelDaum)Edit Attach

Item14807: Solr language detection does not work

pencil
Priority: Normal
Current State: Closed
Released In: n/a
Target Release: n/a
Applies To: Extension
Component: SolrPlugin
Branches: master
Reported By: UlrichLeodolter
Waiting For:
Last Change By: MichaelDaum
-- UlrichLeodolter - 28 Dec 2018

Solr langauge detection does not work for me.

I tried to set CONTENT_LANGUAGE in SitePreferences, but after full reindexing still almost all documents have language en, except documents which have CONTENT_LANGUAGE explicitly set in meta preferences.

   * Set CONTENT_LANGUAGE = de

   * Set CONTENT_LANGUAGE = detect

What can be the reason ?

The main reason i found is that langid.whitelist in solrconfig.xml does not allow whitespace.

The second reason is that only title is solr.StrField, summary and text are solr.TextField.

This is DEBUG output after chainging typeClass to solr.TextField and removing title from langid.fl

2018-12-28 09:51:24.830 DEBUG (qtp99747242-14) [   x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Language fallback to value en
2018-12-28 09:51:24.830 DEBUG (qtp99747242-14) [   x:foswiki] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Appending field summary
2018-12-28 09:51:24.830 DEBUG (qtp99747242-14) [   x:foswiki] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Appending field text
2018-12-28 09:51:24.841 DEBUG (qtp99747242-14) [   x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Detected a language not in whitelist (de), using fallback en
2018-12-28 09:51:24.842 DEBUG (qtp99747242-14) [   x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Detected main document language from fields [summary, text]: en
2018-12-28 09:51:24.842 DEBUG (qtp99747242-14) [   x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Mapping field summary using document global language en
2018-12-28 09:51:24.842 DEBUG (qtp99747242-14) [   x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Doing mapping from summary with language en to field summary_en

This are the final changes i did, after that language detection works fine.

 diff -c solrconfig.xml.orig solrconfig.xml
*** solrconfig.xml.orig   2018-10-11 16:02:03.000000000 +0200
--- solrconfig.xml   2018-12-28 10:22:05.875754466 +0100
***************
*** 1741,1756 ****
  
    <updateRequestProcessorChain name="foswiki_chain">
      <processor class="solr.TruncateFieldUpdateProcessorFactory">
!       <str name="typeClass">solr.StrField</str>
        <int name="maxLength">32764</int>
      </processor>
      <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
      <!-- processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"-->
        <lst name="defaults">
!         <str name="langid.fl">title,summary,text</str>
          <str name="langid.langField">language</str>
          <!-- languages we've got a tokenizer for - minus da as it brings down accuracies for the other languages (wtf) -->
!         <str name="langid.whitelist">ar, bg, ca, cjk, ckb, cz, de, el, en, es, eu, fa, fi, fr, ga, gl, hi, hu, hy, id, it, ja, lv, nl, no, pt, ro, ru, sv, th, tr</str>
          <str name="langid.overwrite">false</str>
          <str name="langid.threshold">0.7</str>
          <str name="langid.fallback">en</str>
--- 1741,1756 ----
  
    <updateRequestProcessorChain name="foswiki_chain">
      <processor class="solr.TruncateFieldUpdateProcessorFactory">
!       <str name="typeClass">solr.TextField</str>
        <int name="maxLength">32764</int>
      </processor>
      <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
      <!-- processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"-->
        <lst name="defaults">
!         <str name="langid.fl">summary,text</str>
          <str name="langid.langField">language</str>
          <!-- languages we've got a tokenizer for - minus da as it brings down accuracies for the other languages (wtf) -->
!         <str name="langid.whitelist">ar,bg,ca,cjk,ckb,cz,de,el,en,es,eu,fa,fi,fr,ga,gl,hi,hu,hy,id,it,ja,lv,nl,no,pt,ro,ru,sv,th,tr</str>
          <str name="langid.overwrite">false</str>
          <str name="langid.threshold">0.7</str>
          <str name="langid.fallback">en</str>

-- UlrichLeodolter - 28 Dec 2018

I'd still like to have the title part of the language detection: how about adding the the field title_std to langid.fl?

-- MichaelDaum - 29 Dec 2018

Did not look into this, but title_std seems to be good choice. I have added it to our solrconfig.xml

-- UlrichLeodolter - 30 Dec 2018

Thanks! I'll add your improvements to the next release.

-- MichaelDaum - 30 Dec 2018

Experimenting with these settings and now it definitely works but I get too many false positives to the point automatic languate detection is almost useless.

Now testing with langid.fl=catchall as well as langid.threshold=0.8

-- MichaelDaum - 03 Jan 2019

For me language detection works pretty well (using solr 5.5.5), in our internal wiki we have about 3950 document (except System web and Web topics), most of them are German. Below is the language facet (from solr admin search :)

"language": [
        "en",
        7800,
        "de",
        3789,
        "it",
        10,
        "fr",
        9,
        "es",
        4,
        "id",
        4,
        "no",
        4,
        "fi",
        2,
        "hu",
        2,
        "nl",
        2,
        "pt",
        1,
        "sv",
        1
]

-- UlrichLeodolter - 03 Jan 2019
 
Topic revision: r8 - 31 Jan 2019, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy