Item9900: fall back to HTML::TreeBuilder if html2text not available

pencil
Priority: Normal
Current State: No Action Required
Released In: n/a
Target Release: n/a
Applies To: Extension
Component: StringifierContrib
Branches:
Reported By: WillNorris
Waiting For: Main.MichaelDaum
Last Change By: MichaelDaum
-- WillNorris - 26 Oct 2010

Note, I deliberately switched from HTML::TreeBuilder to html2text as the former has got significant encoding problems. Their output isn't equivalent either, which results in significant indexing differences. As the html converter is reused by other converters as well, the impact of changing this work horse is quite significant.

-- MichaelDaum - 27 Oct 2010

is there another pure perl module that could be used as a fallback?

-- WillNorris - 27 Oct 2010

What is the state of play with this?

I have a stringify problem in that there is no html2text when it tries to index a PPT file I have uploaded.

This is the full output from the Cron log when Kinoupdate ran:

Uncaught exception from user code:
   exec of html2text -ascii %FILENAME|F% failed: No such file or directory at /home2/mydomain/public_html/foswiki/lib/Foswiki/Sandbox.pm line 542.
 at /usr/lib/perl5/site_perl/5.8.8/Error.pm line 184
   Error::throw('Error::Simple', 'exec of html2text -ascii %FILENAME|F% failed: No such file or...') called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Sandbox.pm line 542
   Foswiki::Sandbox::sysCommand('Foswiki::Sandbox', 'html2text -ascii %FILENAME|F%', 'FILENAME', '/tmp/eOi3Kxb968') called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Contrib/StringifierContrib/Plugins/HTML.pm line 31
   Foswiki::Contrib::StringifierContrib::Plugins::HTML::stringForFile('Foswiki::Contrib::StringifierContrib::Plugins::HTML=HASH(0x2b...', '/tmp/eOi3Kxb968') called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Contrib/StringifierContrib.pm line 43
   Foswiki::Contrib::StringifierContrib::stringFor('Foswiki::Contrib::StringifierContrib', '/tmp/eOi3Kxb968') called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Contrib/StringifierContrib/Plugins/PPT.pm line 46
   Foswiki::Contrib::StringifierContrib::Plugins::PPT::stringForFile('Foswiki::Contrib::StringifierContrib::Plugins::PPT=HASH(0x2b0...', '/home2/mydomain/public_html/foswiki/pub/MyCompany/ZincRedox/Zi...') called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Contrib/StringifierContrib.pm line 43
   Foswiki::Contrib::StringifierContrib::stringFor('Foswiki::Contrib::StringifierContrib', '/home2/mydomain/public_html/foswiki/pub/MyCompany/ZincRedox/Zi...') called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Contrib/KinoSearchContrib/Index.pm line 568
   Foswiki::Contrib::KinoSearchContrib::Index::indexAttachment('Foswiki::Contrib::KinoSearchContrib::Index=HASH(0x2a1c990)', 'KinoSearch::InvIndexer=HASH(0x2afdfa0)', 'MyCompany', 'ZincRedox', 'HASH(0x170f430)') called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Contrib/KinoSearchContrib/Index.pm line 535
   Foswiki::Contrib::KinoSearchContrib::Index::indexTopic('Foswiki::Contrib::KinoSearchContrib::Index=HASH(0x2a1c990)', 'KinoSearch::InvIndexer=HASH(0x2afdfa0)', 'MyCompany', 'ZincRedox', 'WIKIWEBMASTER', 1, 'TopicSummary', 1, 'INCLUDEWARNING', ...) called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Contrib/KinoSearchContrib/Index.pm line 336
   Foswiki::Contrib::KinoSearchContrib::Index::addTopics('Foswiki::Contrib::KinoSearchContrib::Index=HASH(0x2a1c990)', 'MyCompany', 'ZincRedox') called at /home2/mydomain/public_html/foswiki/lib/Foswiki/Contrib/KinoSearchContrib/Index.pm line 116
   Foswiki::Contrib::KinoSearchContrib::Index::updateIndex('Foswiki::Contrib::KinoSearchContrib::Index=HASH(0x2a1c990)', '') called at ./kinoupdate line 38
The error left a load of files in the Kinosearch directory, so I cleared them out and ran Kinoupdate. That failed too. So now my index is missing.

-- BobCorless - 09 Nov 2010

I installed this Perl module http://search.cpan.org/~kryde/HTML-FormatExternal-18/ It appears to include Html2Text. However, I get the same error in Stringify at line 542.

-- BobCorless - 10 Nov 2010

Since installing HTML::FormatExternal, the cron job has been running without errors. However, the PPT is not getting indexed. Running stringify against the PPT from a command line results in the error "exec of html2text -ascii %FILENAME|F% failed: No such file or directory at /home2/mydomain/public_html/foswiki/lib/Foswiki/Sandbox.pm line 542"

-- BobCorless - 11 Nov 2010

Bob, please install html2text. This is a new dependency. The old way to convert html to text was flawed wrt utf8 encoding. That's why I changed the backend. Should be fine afterwards.

-- MichaelDaum - 11 Nov 2010

html2text is included in the HTML::FormatExternal which I have already installed. If I uninstall HTML::FormatExternal and specifically install just HTML::FormatText::Html2text (18) it installs HTML::FormatExternal.

I still get the error described above when running stringify from a prompt. Kinoupdate runs without errors (command line or rest) but does not index the ppt.

-- BobCorless - 11 Nov 2010

Please install the correct html2text This one is NOT a perl base one. See http://www.mbayer.de/html2text/

-- MichaelDaum - 12 Nov 2010

Ah OK. Easy when you know how......

..... however.....

I think I installed it correctly following the instructions because when I run the stringify command from the tools directory and nominate the PPT file, I do get output with text lines from the PPT file. But these are not showing up in the Kinosearch after both a kinoupdate and kinoindex.

-- BobCorless - 12 Nov 2010

Something is wrong with my Stringifier install because I've uploaded a DOC for the first time on 1.1.1 and I get errors about antiword path. This worked in 1.0.9 so I need to investigate where I've gone wrong.

-- BobCorless - 12 Nov 2010

Please let me know what came out of it so that I can update the docu if necessary in case this will help upgraders of stringifier or kino.

-- MichaelDaum - 12 Nov 2010

Micha, you forgot to add html2textCmd in configure so user can define it (useful in hosted environment when it's not in the PATH). I also changed the name so it's more aligned with the rest of the plugins.

-- OlivierRaginel - 12 Nov 2010

Thanks for fixing that smile

-- MichaelDaum - 12 Nov 2010
 

ItemTemplate edit

Summary fall back to HTML::TreeBuilder if html2text not available
ReportedBy WillNorris
Codebase
SVN Range
AppliesTo Extension
Component StringifierContrib
Priority Normal
CurrentState No Action Required
WaitingFor MichaelDaum
Checkins StringifierContrib:0e82b35ab6a6
TargetRelease n/a
ReleasedIn n/a
Topic revision: r17 - 12 Nov 2010, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy