You are here: Foswiki>Tasks Web>Item13696 (10 Oct 2015, GeorgeClark)Edit Attach

Item13696: Some attachments can be unreachable with non-UTF8 store encoding.

pencil
Priority: Urgent
Current State: Closed
Released In: 2.0.2
Target Release: patch
Applies To: Engine
Component: FoswikiStore
Branches: master
Reported By: GeorgeClark
Waiting For: CrawfordCurrie
Last Change By: GeorgeClark
When an attachment is uploaded, the filename is saved into the Store with the configured {Store}{Encoding} However browser links are always generated in UTF-8, So on systems not using viewfile for attachments, attachments with non-ascii characters in the filenames will become unreachable.

This can be addressed in 3 ways:
  1. Convert store to utf-8. This is the recommended solution.
  2. use viewfile for all attachment access
  3. Use a plugin completePageHandler to rewrite any pub urls from UTF8 to the configured Store encoding.
-- GeorgeClark - 10 Sep 2015

There is a bit more to that quote. It starts out with a caveat, that the URI scheme ... represents textual data consists of characters from the UCS. If we start from the native URI that Apache understands http://somesite/pub/Sandbox/iso-8859-1-location, it does not contain characters from the UCS, It is consistently encoded as and needs to be requested as an ISO-8859-1 URI.
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".
So the initial assumption that the data consists of UCS characters is not met. The problem is that Foswiki is generating a URI as if the server location did contain UCS characters. So the generated URI is incorrect. Trying to modify Apache, (and lighttpd and nginx and IIS ...) to convert from the utf-8 location back to the real ISO-8859-1 location doesn't make sense. We need to generate the correct URLs in the first place.

Ignoring Foswiki:
  • The URI of the file is pub/Sandbox/TestUtf8Attach/_Andr%e9.jpg
  • It does not contain any UCS characters.
But Foswiki generates:
  • URI pub/Sandbox/TestUtf8Attach/_Andr%c3%a9.jpg

-- GeorgeClark - 10 Sep 2015

See also Item13697.

-- GeorgeClark - 11 Sep 2015

ok, i removed all my stupid comments. sorry for bothering.

-- JozefMojzis - 11 Sep 2015

On a 2.0 system, iso-8859-1 encoding, an IMG link is generated
<img src="/pub/Sandbox/TestUtf8Attach/_Andr%c3%a9.jpg" alt="_André.jpg" width='113' height='85' />

On a 1.1.9 system, same encoding, same file, same topic, the link in the html source is:
 <img src="/pub/Litterbox/TestUTF8Attach/_Andr%e9.jpg" alt="_André.jpg" width='113' height='85' />

-- GeorgeClark - 11 Sep 2015

Attached a plugin module which appears to fix the URLs up on a 2.0 system with iso-8859-1 store.

-- GeorgeClark - 11 Sep 2015

Crawford, I've checked this in as part of Foswik.pm, rather than using a plugin completePageHandler. Please take a moment to review it if you could. I think I've covered the tag types that will refer to pub url locations. Unit tests are all still passing.

-- GeorgeClark - 12 Sep 2015

I think that this still needs some work. It would be better to reverse back to the store filename, and determine if it exists with either encoding. Only replace the path if one of them is actually reachable. Not sure how easy it is to reliably recover the filename from the url.

-- GeorgeClark - 12 Sep 2015

Implemented as PubLinkFixupPlugin. Shipped as a default extension.

-- GeorgeClark - 19 Sep 2015

Still an issue. On a new site running with a non-utf-8 store, we should not need to be "fixing" up URLs. Attachment pub URLs should be inserted with the correct encoding.

in Foswiki::Attach::getAttachmentLink(), we should use the Store encoding, and not utf-8 when generating a link. Since Foswiki::encodeUrl is hardcoded to utf-8, we either need to extend that to optionally request the Store Encoding, or consider a encodePubUrl which uses store encoding.

-- GeorgeClark - 20 Sep 2015

Releasing this as is. My rationale is that any Store encoding other than utf-8 is probably transitional. Writing links in utf-8 avoids future fixup once conversion happens.

-- GeorgeClark - 29 Sep 2015
 

Topic revision: r18 - 10 Oct 2015, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy