Item14319: PublishPlugin needs to encode output

pencil
Priority: Normal
Current State: Closed
Released In: n/a
Target Release:
Applies To: Extension
Component: PublishPlugin
Branches: master
Reported By: LynnwoodBrown
Waiting For:
Last Change By: CrawfordCurrie
Initially GeorgeClark and I were looking only at html output. George identified a change to BackEnd/file.pm which corrected output to html files. The change was around line 87, changing print $fh $string; to print $fh Foswiki::encode_utf8($string);.

George also noted issues with encoding of file names but could not find associated code.

We didn't look into output to pdfs.

Here's our discussion on irc.

-- LynnwoodBrown - 03 Feb 2017

The "index.html" file has major encoding issues.

Also, it appears to be somehow entity encoding filenames. For ex, I have a topic with a file named test+blah.txt attached. The code looks on disk for:

Testing read of (/var/www/foswiki/distro/core/pub/Litterbox/ANewTopicFromTmpl/test%2bblah.txt)

-- GeorgeClark - 03 Feb 2017

The encoded attachment filename happens because the code renders the topic, and then looks for the URLs ... which are html entity encoded.

-- GeorgeClark - 03 Feb 2017

Actually it's a bit worse. See the code in PubLinkFixupPlugin ... Links inserted by "attach" by older foswiki are first utf8 encoded and then url encoded, Links from recent versions are just unicode. So the routine at _copyResource needs to determine if the link scraped from the generated HTML is unicode, or utf-8 and if needs to be url decoded. I'm not completely sure of the fix yet. But just adding a brute force Foswiki::urlDecode() is insufficient. It fixes some links and breaks others. Doing the URL decode without the utf-8 encoding fixes a different subset of the links.

Other issues:
  • _PUBLISHERS_CONTROL_CENTRE does directory and filename processing, and needs to do utf-8 encoding/decoding.
  • Code is needed to detect and handle NFD vs NFC normalization, for portability on OSX systems.
  • There are unescaped braces in one regex. Breaks on recent perl
  • The "tied hash" used to access the query params dies due to multi_param checks in Request when FOSWIKI_ASSETS is enabled.

-- GeorgeClark - 04 Feb 2017

Note that links are inconsistently encoded, even on the same page. In the Attachments table. there are two links The attachment, and the manage link In my test cases, I see:
<a href="http://..../pub/Litterbox/%c3%9a%c5%88%c3%ad%c4%8d%c3%b4%c4%8f%c4%9b/AttachmentsWithSpaces/AśčÁ%20ŠŤ%20śěž.dat">AśčÁ ŠŤ śěž.dat</a>  in an inserted link,
<a href="/pub/Litterbox/%c3%9a%c5%88%c3%ad%c4%8d%c3%b4%c4%8f%c4%9b/AttachmentsWithSpaces/A%c5%9b%c4%8d%c3%81%20%c5%a0%c5%a4%20%c5%9b%c4%9b%c5%be.dat">AśčÁ ŠŤ śěž.dat</a>  in the filename column
<a href="/bin/attach/Litterbox/Úňíčôďě/AttachmentsWithSpaces?filename=A%c5%9b%c4%8d%c3%81%20%c5%a0%c5%a4%20%c5%9b%c4%9b%c5%be.dat;revInfo=1" title="change, update, previous revisions, move, delete..." rel="nofollow">manage</a> 

There might even be a task in here for inconsistent encoding of links in our generated html.
  • In the "inserted" link, the webname is encoded, the filename is not.
  • In the attachment table filename, the webname is encoded and the filename is encoded
  • In the attachment table manage link, the webname is not encoded, only the filename.

But in the publish plugin, all of these cases need to be handled in recovering the filename.

-- GeorgeClark - 04 Feb 2017

Discovered this morning that the change made to BackEnd/file.pm resulted in corruption of some external images being saved as __extraXX files. The following code worked as short-term fix. As George notes, probably all of this code needs revisiting.
if (substr($file, -5) eq '.html') {
          print $fh Foswiki::encode_utf8($string);
        }
           else {print $fh $string;
        }

-- LynnwoodBrown - 04 Feb 2017

I just checked in on branch Item14319. There should be no need for encoding to/from utf8 beyond what is there.

-- Main.CrawfordCurrie - 01 Mar 2017 - 17:02
 
Topic revision: r7 - 18 Apr 2017, CrawfordCurrie
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy