Topic Case Sensitivity

Simple enhancement to make TWiki linking and jumping to topics more forgiving. Basically, if you specify a link to a topic and the topic exists with a different case, it should find the topic.

A less simple aspect is to ensure that SEARCH gets things as the user expects - ordering, both topics as relevant results etc.

Pseudo algorithm:

  • Check if topic exists
    • If yes: Link to it (existing spec)
    • Else:
      • Check if topic exists with different case (new feature)
        • If yes: Link to it
        • Else:
          • For linking: Render a "question mark link" (existing spec)
          • For jump box: Show list of similar topics (existing spec)

This should also work across webs, e.g. TWiki should be forgiving to case of webs too.

Examples for linking and jumping to topic:

Text entered Existing Web.Topic Action for linking Action for jump to topic
WebStatistics Thisweb.WebStatistics Link to it Jump to it
webstatistics Thisweb.WebStatistics Link to it Jump to it
webSTATISTICS Thisweb.WebStatistics Link to it Jump to it
DevConsequences Thisweb.DevConsequences
and
Thisweb.DevconSequences
Link to exact case match Jump to exact case match
DevConseQUEnces Thisweb.DevConsequences
and
Thisweb.DevconSequences
Link to first one found (KISS solution) Show "similar topics" screen if there is more than one choice and no exact match
Otherweb.WebStatistics Otherweb.WebStatistics Link to it Jump to it
Otherweb.webstatistics Otherweb.WebStatistics Link to it Jump to it
OTHERWEB.WebStatistics Otherweb.WebStatistics Link to it Jump to it
WebStat (no match) Show question mark link to create Thisweb.WebStat Show similar topics
webstat (no match) Show question mark link to create Thisweb.WebStat Show similar topics
Otherweb.WebStat (no match) Show question mark link to create Otherweb.WebStat Show similar topics
otherweb.webstat (no match) Show question mark link to create Otherweb.Webstat Show similar topics

As an added bonus, where possible, re-do the urls in links and the url bar (on redirects, or in topics) to match the actual topic. That way we reduce the amount of time a less accurate url to a topic exists.

-- Contributors: SvenDowideit, PeterThoeny - 08 May 2007

Discussion

(Merged content from TWikiOnWindows and TWikiOnCygwin)

The main difference that TWikiOnWindows (at least for Dec 00 release) will collapse (albeit contrived) topics such as "GangsterConSequences" (How Gangsters Sequence their Cons) into the same topic as GangsterConsequences (The consequences of being a gangster) whereas TWikiOnUnix will not.

It can be handy though, because it means you do not need to preserve the case when typing the topic in the Go box.

It is inherent in TWikiStore's use of the underlying operating systems filesystem and that Windows ignores but preserves such things.

-- MartinCleaver - Probably June 2001

Some parts of TWiki will remain case insensitive of course, since they are not based on file system lookups - and if you create pages depending on case insensitive behaviour you may have difficulty re-hosting them on Unix/Linux at a later date.

-- RichardDonkin - 30 Oct 2001

I've been thinking about making TWiki case-insensitive even on Unix. I'm not sure whether it's a good idea (it's certainly alien to Unix users), but case sensitivity in general is a Bad Thing: I've been bitten by case mismatches numerous times, and case (in)sensitivity in topic names tends to crop up in several places here in Codev.

-- JoachimDurchholz - 31 Oct 2001

I agree, even if it means not being able to distinguish between terms that differ only by case, for instance GangsterConSequences and GangsterConsequences.

-- MartinCleaver - 31 Oct 2001

That's a perfect example why case sensitivity should not be used to distinguish between topics. You can't read the names over a phone and be sure that you're understood, so you have to explicitly say "gangsterconsequences with an upper-case S". Same in meetings or wherever you verbally communicate.

I see two problems though:
  1. How to preserve letter case. Wiki names are defined to have internal capitalization, so letter case is important and must be retained.
  2. How to make sure that TWiki finds a topic regardless of entered case. I.e. TWiki should know that GangsterConSequences and GangsterConsequences are different names for the same topic. I suspect that hunting up and finding all places where this requires a change in the TWiki code is not a small task, and maybe not worth the results.

-- JoachimDurchholz - 01 Nov 2001

Actually on Windows this is not an issue - the operating system folds them to be equivalent.

TWiki would still need to have the code to make this work under Unix.
Since such case folding code wouldn't hurt unter Windows, there's no point in making this OS dependent. [Main.JoachimDurchholz - 02 Nov 2001]

-- MartinCleaver - 01 Nov 2001

It would be possible to store the names in lower case and have a meta data item that gives the correct rendering. I've alluded to this before, in a conversation about spaced out wiki words to allow the proper spaced out rendering of names such as Ronald MacDonald (which because of the blanket space-where-capitalisation-change rule becomes Ronald Mac Donald).

Implementing this would, however, cause the topic names to show in searches as lower case. To avoid opening every topic to find its correct capitalisation, an initial idea would be to have a cache automatically maintained by TWiki giving the correct capitalisations.

-- MartinCleaver - 01 Nov 2001

Unless meta data is readily accessible, I wouldn't recommend doing that. Imagine a Search Results panel - TWiki would have to open and interpret every file in the result.

Yet another way would be to store the "true" capitalization in the file name and make TWiki smart enough to find the topic regardless of the capitalization of the topic.

-- JoachimDurchholz - 02 Nov 2001

I've started working on this problem from the URL point of view, since this is where it matters more often for my users. My current apache-based solution is at ShorterCaseInsensitiveURLs (this is blended together with some URL-shortening tricks I'm doing at ShorterURLs). By just adding /i somewhere in search, that should pretty much give you total case insensitivity. Of course it requires crazy apache tricks, but it's a start for now (I assume IIS has URL rewriting?)

-- RobNapier - 30 May 2002

I think the use of case sensitivity/insensitivity should be left to the discretion of the site administrator. Personally i feel, articles ought to be saved with small case filenames, search terms converted to lower case before being searched for, etc. Here's why.

Let's take for example 2 users (john and jane) on a project who like all wiki users are keen on adding content to their KMS. If a topic doesnt exist, they create it, if it exists, they build on it. John and Jane are working on a project called 'Wikitiquette' and John decides to enter the word 'caseSenisitivity' (note the typo - the small c) in the Go/Jump box, finding no topic he creates a new one, spends an hour elegantly crafting his article and saves it when he is done. Jane at the meantime is unaware that John's made a new article (yes, she either didn't check WebChanges or was in a hurry) and decides to do a topic on 'CaseSensitivity' (correct spelling this time), she enters the word in the Go/Jump box, finding none, finding no topic she creates a new one, spends an hour elegantly crafting her article and saves it when she is done.

We've just seen 2 people waste roughly one manhour of labour time between them because chances are a majority of content in both the user's topics is the same. Had Jane been careful enough to search for all possible case combinations e.g. Casesensitivity, caseSensitivity or casesensitivity (why should she anyway?) she would have noticed John's post and worked on it. She probably would have taken the steps to correct his mistake to avoid further hassle in the future (if she was as smart woman but let's face it, most users just simply arent forward thinking or diligent enough to pick out these flaws). To make the problem worse, John and Jane might not ever find each other's topics. Perhaps they do, John might one day discover Jane's post and say to himself , "hell no, i didnt do all this..." and vice-versa.

Now if twiki saved the filenames as lowercase, the impact of john's typo is diminished, jane would have found his topic and worked on it (now, that's what i call collaboration). If I the site administrator was in a hurry and was searching for their topic, it wouldnt matter whether i made a case typo or not. We'd all find our stuff and we'd all have our mops of hair intact at the end of the day.

Why why why all this rigid complexity?

-- ShalomBhooshi - 28 Aug 2006

see DontUseCaseToFindWikiTopics, ShorterCaseInsensitiveURLs, WikiWordSyntax, and MapUserToWikiNameCaseInsensitive, LoginNamesShouldNotBeWikiNames...

FileSystemNameClash has some code i did in 2001 for it, that I will adapt for FreetownRelease

In essense, there is a unique identifier for a topic (the spelling of that topic's name) and a preffered presentation of that topics name (the chosen capitalisation).

If the end user happens to miss guessing the presentation, we can still successfully match to the topic, using the spelling.

The user name and user login cases are more dificult, but I think I can take care of them here and in AddTWikiAdminUser.

for those that have, and want 2 seperate topics with the same spelling but different cases, it should still be possible to create the second, through some kind of intentional 'force' mechanism, and if you have 2 such topics, you'll need to get the case right.

Things for me to watch out for - that both topics are successfully found in a search, or that only the correct one is found.

Another reason this is important, is that Mac OSX uses a not quite case sensitive file system, where having 2 topics with different case make a hell of a mess.

-- SvenDowideit - 26 Mar 2007

I think we should handle TopicDisplayName and topic case sensitivity as two separate features. On topic case insensitivity, I believe we should keep the current spec of a physical BumptyWord topic name, and when linking to it search for a case insensitive match in case there is no direct match.

-- PeterThoeny - 26 Mar 2007

It is not clear to me what you are proposing Sven. Can you explain it seen from the user side?

-- KennethLavrsen - 08 Apr 2007

I want to make it so that (assuming the only one topic with the spelling Codev.TopicCaseSensitivity exists), all URL's that have that spelling, irrespective of case will goto that topic.

so
http://twiki.org/cgi-bin/view/Codev/TopicCaseSensitivity
http://twiki.org/cgi-bin/view/codev/topiccaseSensitivity
http://twiki.org/cgi-bin/view/Codev/TOPICCaseSensitivity
http://twiki.org/cgi-bin/view/Codev/ToPiCcAsEsEnsitivity
http://twiki.org/cgi-bin/view/Codev/TopicCaseSensitivITY
all goto the same topic.

If the user decides that they need 2 seperate topics with the same spelling, but different case, then TWiki will allow that (through some kind of force mechanism), and then the url matching will revert back to how it works today.

-- SvenDowideit - 09 Apr 2007

OK. Simple and clear spec. Since there has been a lot of discussions previously I think it is best to take this to release meeting though I support the proposal personally.

-- KennethLavrsen - 23 Apr 2007

FreetownReleaseMeeting2007x04x23 decided to defer decision since scope will be 5.0 for resource / time reasons.

-- KennethLavrsen - 23 Apr 2007

Sven's latest example makes the feature clear and understandable.

I find this feature useful and small enough, e.g. if someone comes forward and implements it I would welcome it for 4.2. What is missing for this proposal (like many other proposals) is a clearly defined spec at the top of this topic (in the doc section.)

-- PeterThoeny - 24 Apr 2007

FreetownReleaseMeeting2007x05x07 decision was to ask Sven to clarify spec before we can make a decision

-- KennethLavrsen - 07 May 2007

I added the first cut of a functional spec. It would be nice to get this into the FreetownRelease.

-- PeterThoeny - 08 May 2007

Thanks Peter, that represents exactly what I implemented in 2000, and what the above example is smile

-- SvenDowideit - 09 May 2007

We are looking for an implementer. This would be a nice usability enhancement for 4.2, but the train is leaving the station. Sven is occupied buy other stuff. Anyone interested in helping out?

-- PeterThoeny - 21 May 2007

Spec accepted at FreetownReleaseMeeting2007x05x21. As Peter writes. We need an implementer that can do it before the feature code freeze 28 May 2007 or it gets deferred to 5.0.

-- KennethLavrsen - 21 May 2007

One I18N comment - would be good to ensure this is tested with European accented characters as a minimum. Given suitable locale setup, which is easy on any Linux box, this is not hard to do. There should be no extra code at all, providing that InternationalisationGuidelines are followed - the Perl case-conversion features all work with locales transparently, and the TWiki I18N code (for page contents and WikiWords) already does some case conversion within regexes using \u etc.

-- RichardDonkin - 25 May 2007

I have a need to define wikiWords beginning with a mixed-case alphabetic. TWiki.pm (version 4.1.2) defines wikiWordRegex so this seems like the perfect place to make a change; however, a simple change like this:
# TWiki concept regexes
#TDV#    $regex{wikiWordRegex} = qr/[$regex{upperAlpha}]+[$regex{lowerAlphaNum}]+[$regex{upperAlpha}]+[$regex{mixedAlphaNum}]*/o;
#TDV# Support initial lower-case alphabetic
    $regex{wikiWordRegex} = qr/[$regex{mixedAlpha}]+[$regex{lowerAlphaNum}]+[$regex{upperAlpha}]+[$regex{mixedAlphaNum}]*/o;
produces rendering side-effects. For example in tables twikiLast gets rendered as a non-existant page. In a FormattedSearch twikiSRRev, twikiSRAuthor and twikiGrayText are rendered as non-existant pages. At the CommentPlugin the commentPluginPromptBox is rendered as a non-existant page. The common thread of these side-effects is that the last class of a %lt; someHTMLtag class="..." %gt; parameter inside an HTML tag, is not ignored as it should be. My only examples have two classes; perhaps its not the "last" class but any class after a space - might not even be class specific - just any parameter following a space; or, shutter, HTML tags do not disable wikiWord processing.

anyIdeas?

-- DewayneVanHoozer - 11 Jun 2007

thats pretty much why I wasn't willing to make this change without lots of time for testing - its not a simple thing. The most problematic thing you can do is to allow WikiWords to start with a lowercase letter - as that will hit css and all sorts f things like you have found.

-- SvenDowideit - 11 Jun 2007

I've had a crack at this myself. It's not clever, but it seems to work (not tested very much either!) It's only designed to help with URL mis-spelling, not with people actually making topics with the same name and different caps. If you can ensure that won't happen, this might help you out.

You can paste the snippet in earlyInitPluginHandler.pl.txt into a plugin.

It first checks to see if the web/topic existed, and returns as quickly as possible. If it didn't, it'll go and open up the directories and search for the first possible matched web and topic. It doesn't attempt to do anything clever: if you've got multiple topics with different caps, it just picks the first one and hopes for the best.

If it didn't find anything that looked right, it lets TWiki carry on as usual and display the "Topic Not Found" page.

-- ChrisFLewis - 19 Feb 2008

Is there a supported way to do this? I looked at ShorterCaseInsensitiveURLs, but I just want this for the jumps. (IMHO, ambiguous links should give a ? rather than first match, especially as the match could change when a new topic is created.)

-- JayenAshar - 30 Oct 2011

I don't think ambiguous topic names are a good idea. In fact, this breaks with one very important and very basic assumption: that all topic names are unique, unique in the sense of a database id.

Also, my guts tell me: potentially will slow down rendering pages while trying to be too intelligent, not working as people would have expected?

-- MichaelDaum - 30 Oct 2011

The "topic does not exist" page could be more intelligent by proposing topics with a similar spelling. It doesn't matter much if that page is a little bit slower.

-- ArthurClemens - 30 Oct 2011

It's true, case-insensitive regex when utf-8/unicode is enabled is significant. But I reckon most of this will be less necessary once we get link editing into the editor(s) (like NatEdit).

And I agree with Arthur, we should certainly be doing a better job when suggesting similarly spelt pages; would be nice to make it pluggable to leverage SolrPlugin, too.

-- PaulHarvey - 30 Oct 2011

Actually a foswiki 404 "topic does not exist" page's performance does matter: it potentially can DoS your server using the standard way of searching for spell-alike topics. It is hit not only by users that mistyped the topic but also by spiders trying to crawl (re)moved topics.

Instead, there should be a button "search for similar" (with rel="nofollow") and only then the user might be expecting to wait a bit more as he asked for a more complex task, no matter if that search-for-similar is done with standard means or Solr.

-- MichaelDaum - 31 Oct 2011

The button could be there for non-js (like crawlers). Then load the similar results using AJAX.

-- ArthurClemens - 31 Oct 2011

Why not let users that do really want to search for more click on a button?

  • The principle of "don't make me think" -- ArthurClemens - 31 Oct 2011
  • Sorry I don't buy this. A friendly "sorry not found. Click here to search for similar stuff." is perfect in my eyes. Check this (http://queenofsubtle.com/404/) for more examples for that pattern. -- MichaelDaum - 01 Nov 2011
  • The user has entered a url or topic name that he thinks is correct. But he made a mistake in the case - that he is unaware of. Why is the page not there? Providing a link "search for similar stuff" does not match the user expectation at that point. Better to show "perhaps this is what you meant". -- ArthurClemens - 01 Nov 2011
  • The situation that the page is not there isn't what the user expected in the first place. He expected to get to exactly the topic he was asking for. But instead he gets something different: an error page. So the system doesn't meet his expectations in that sense anyway no matter why the page is not there (user error vs system error). Now the user finds himself in the next situation: having to find out what this error means and plan his next action. The key is to be most simple here to make his decision as easy as possible. There are three possible next actions: (1) going back in the browser (2) create the missing page (3) search for the right page that is potentially available under a different name now, given it was (3a) not removed but (3b) renamed or (3c) spelled somewhat differently. Any funky search-for-similar will only lead to a successful next action in cases (3b) and (3c). In any other case a search-for-similar is not applicable and only eating system resources. Even in case the system might in fact have some topics that are named similarly, this output will only distract any next action plan in cases (1), (2) and (3a) as it adds extra noise on the screen that the user does not require for his next action. This all only on usability, leaving aside the total waste of system resources to do more than required beforehand just in case some users really might need this extra information on the screen. That's why a simple set of links what the user can now do next helps more than actually filling the screen with a list of topic names. -- MichaelDaum - 01 Nov 2011
  • With "eating system resources" you can discard almost anything, including Perl. It comes down to: will we notice the difference, and if so, how much of it (system resources) can we spend for proactively helping the user, given that we don't want to distract the user that wants to create a topic. -- ArthurClemens - 01 Nov 2011

-- MichaelDaum - 31 Oct 2011

Unfortunately adding the rel=nofollow doesn't help. The crawlers are treating that now to mean "don't index" not "don't follow" So they hit the pages and then ignore the results. Not sure there is any way to prevent this without adding a path to the robots.txt, and that can't wildcard.

-- GeorgeClark - 31 Oct 2011

George is right; nofollow is a mess. I agree with Arthur that the suggestions should be displayed - I like the compromise of loading via AJAX.

-- PaulHarvey - 01 Nov 2011

I simply am against wasting server resources where you could easily save them. That's why I am against doing a SEARCH on any "topic not found" by default. This is not only a waste of server resources but also a waste of user patience waiting for the page to finish rendering of information that they actually dont use. The trickier that SEARCH becomes the more this becomes a PITA. Clicking a button to load more content into the page is most probably done via ajax, so the rel=nofollow situation does not apply anyway.

-- MichaelDaum - 01 Nov 2011
 
Topic revision: r17 - 17 Jul 2015, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy