Item14319: PublishPlugin needs to encode output
Priority: Normal
Current State: Closed
Released In: n/a
Target Release:
Initially
GeorgeClark and I were looking only at html output. George identified a change to
BackEnd/file.pm
which corrected output to html files. The change was around line 87, changing
print $fh $string;
to
print $fh Foswiki::encode_utf8($string);
.
George also noted issues with encoding of file names but could not find associated code.
We didn't look into output to pdfs.
Here's our
discussion on irc.
--
LynnwoodBrown - 03 Feb 2017
The "index.html" file has major encoding issues.
Also, it appears to be somehow entity encoding filenames. For ex, I have a topic with a file named
test+blah.txt
attached. The code looks on disk for:
Testing read of (/var/www/foswiki/distro/core/pub/Litterbox/ANewTopicFromTmpl/test%2bblah.txt)
--
GeorgeClark - 03 Feb 2017
The encoded attachment filename happens because the code renders the topic, and then looks for the URLs ... which are html entity encoded.
--
GeorgeClark - 03 Feb 2017
Actually it's a bit worse. See the code in
PubLinkFixupPlugin ... Links inserted by "attach" by older foswiki are first utf8 encoded and then url encoded, Links from recent versions are just unicode. So the routine at _copyResource needs to determine if the link scraped from the generated HTML is unicode, or utf-8 and if needs to be url decoded. I'm not completely sure of the fix yet. But just adding a brute force Foswiki::urlDecode() is insufficient. It fixes some links and breaks others. Doing the URL decode without the utf-8 encoding fixes a different subset of the links.
Other issues:
- _PUBLISHERS_CONTROL_CENTRE does directory and filename processing, and needs to do utf-8 encoding/decoding.
- Code is needed to detect and handle NFD vs NFC normalization, for portability on OSX systems.
- There are unescaped braces in one regex. Breaks on recent perl
- The "tied hash" used to access the query params dies due to multi_param checks in Request when FOSWIKI_ASSETS is enabled.
--
GeorgeClark - 04 Feb 2017
Note that links are inconsistently encoded, even on the same page. In the Attachments table. there are two links The attachment, and the manage link In my test cases, I see:
<a href="http://..../pub/Litterbox/%c3%9a%c5%88%c3%ad%c4%8d%c3%b4%c4%8f%c4%9b/AttachmentsWithSpaces/AśčÁ%20ŠŤ%20śěž.dat">AśčÁ ŠŤ śěž.dat</a> in an inserted link,
<a href="/pub/Litterbox/%c3%9a%c5%88%c3%ad%c4%8d%c3%b4%c4%8f%c4%9b/AttachmentsWithSpaces/A%c5%9b%c4%8d%c3%81%20%c5%a0%c5%a4%20%c5%9b%c4%9b%c5%be.dat">AśčÁ ŠŤ śěž.dat</a> in the filename column
<a href="/bin/attach/Litterbox/Úňíčôďě/AttachmentsWithSpaces?filename=A%c5%9b%c4%8d%c3%81%20%c5%a0%c5%a4%20%c5%9b%c4%9b%c5%be.dat;revInfo=1" title="change, update, previous revisions, move, delete..." rel="nofollow">manage</a>
There might even be a task in here for inconsistent encoding of links in our generated html.
- In the "inserted" link, the webname is encoded, the filename is not.
- In the attachment table filename, the webname is encoded and the filename is encoded
- In the attachment table manage link, the webname is not encoded, only the filename.
But in the publish plugin, all of these cases need to be handled in recovering the filename.
--
GeorgeClark - 04 Feb 2017
Discovered this morning that the change made to
BackEnd/file.pm
resulted in corruption of some external images being saved as
__extraXX
files. The following code worked as short-term fix. As George notes, probably all of this code needs revisiting.
if (substr($file, -5) eq '.html') {
print $fh Foswiki::encode_utf8($string);
}
else {print $fh $string;
}
--
LynnwoodBrown - 04 Feb 2017
I just checked in on branch
Item14319. There should be no need for encoding to/from utf8 beyond what is there.
--
Main.CrawfordCurrie - 01 Mar 2017 - 17:02