HierarchicalOrTreeSearches

Headline

It's a common feature of databases to have tables with a parent child relationship. We can create these with nested searches but they are slow, hard to write and limited to 16 levels.

therefore let's extend search to support this case:
  1. Special sort of structure
  2. Some extra formatting sugar

Motivation and Suggestion

I was reviewing a number of feature proposals/brainstorming topics and ended up back at EasierNestedSearch which in turn was arguably my attempt at some sort of UseSyntaxToChangeEvaluationOrder.

However, what struck me was that I was really trying to generate a hierarchical output and nested search was the only (bad) way to do it.

So, I fired up crystal reports to see how I could do that. It was esay and would equate roughly to:

%QUERY{"form='TeamForm'" 
    groupby="hierarchy(ParentTeam)"
    format="$level(   )* $topic"
}%

The process is as follows:
  1. Read all the team forms (any standard query will do)
  2. Sort the teams into hierarchy order (using ParentTeam field from parameter)
  3. Format: I've used $level(xxx) to simply mean repeat xxx as many times as required at the relevant level (in this case three spaces)

Compared to the monstrosity in EasierNestedSearch the above is so simple. Indeed, I think I would need a day to work out how the nested search version hangs together before I dare any maintenance.

I almost created a feature proposal, but this needs quite a bit of thought from anybody mucking about with store and search. Indeed, I suspect that groupby is the right place for the hierarchy() option, and as groupby is itself in it's early days, this probably needs to wait before an actual proposal would make sense.

Furthermore, there will be some extra work on the formatting sugar required, $level() will not be enough by a long way. Some conditional formatting may well be required, possible $level and $max an integer height in the hierachy is enough in conjunction with SSP. groupby will probably generate other formatting requirements as well.

A nested search may still be required. In the above scenario, when $level == $max then I would want another search to show all team members; however that would then be standard fare (and only one level of nesting)

The Sort Code

Note that constructing the hierarchy (a special sort) should be left to Foswiki. Most SQL stores do not support recursive joins to do this natively (Oracle is I believe an exception). However, it's pretty trivial to do this.

Wrong code example removed — see earlier versions if you must

Discussion

Maybe a much easier way to approach this is to remember that this all is boiling down towards traversing a graph with nodes being topics and edges any relation between two topics, like a formfield pointing to other topics. This of course is a recursive process that you don't know how it unfolds when it starts.

That's why our current SEARCH most probably does not fit the problem as soon as you treat wiki topics as graph. It needs a different way to approach the data and their relations.

All previous attempts to use SEARCH for it tried to nest two searches into each other to some degree. This of course isn't able to traverse a recursive data structure in full using the right ordering as defined by the hierarchy.

Basically, it goes like this: at each node in the graph the algorithm is making a specific choice where to go next. It collects all child nodes for instance by fetching a formfield holding a list of related topics. These are used to seed an initial agenda of nodes still to be visited. Then the loop starts to fetch the first node off the agenda, process it, collect related child nodes not yet visited, push them onto the agenda to be processed in the next round of the loop. Processing each node within the loop comprises of formating the found node using the normal foswiki formating means and collecting all results in a list of html fragments, which are finally combined to format one final result.

There are two related plugins that can deal with recursive data structure: (1) ClassificationPlugin: it has got a HIERARCHY macro to traverse a taxonomy and (2) DBCachePlugin: it has got a DBRECURSE that somewhat resembles the thing outlined above.

ClassificationPlugin traverses a special type of topics - categories - making up the hierarchy. The relation between them is defined using a Category formfield holding one or more parent categories.

DBCachePlugin's DBRECURSE is most probably coming closest to be able to deal with arbitrary relations between topics. It might be worth looking at it for your needs. It does not use an agenda based graph traversal though, but something similar.

Some other properties of a graph might be of interest such as %DISTANCE{from="topic1" to="topic2}%, the length of shortest path in graph between two nodes.

These kind of information do need additional persistent data structures to cache the topology between topics, foremost a distance matrix. There are algorithms to maintain a distance matrix of sparse graphs over time. Yet still this needs some maintenance step to keep in sync with changes over time happening in the wiki. The one used in ClassificationPlugin is based on the

@article{477,
  author = {M. Wallace and S. Kollias},
  title = {Two Algorithms For Fast Incremental Transitive Closure Of Sparse Fuzzy Binary Relations},
  year = {2007},
  publisher = {International Journal of Computational Methods, accepted for publication },
  url = {http://www.image.ece.ntua.gr/publications.php}
}

Once you have a distance matrix, traversing the underlying graph becomes a lot more efficient while collecting child topics in each step of the hierarchical search.

-- MichaelDaum - 14 Dec 2011

oh boy, yes - where was that sort and groupby feature request.

-- SvenDowideit - 14 Dec 2011

This is something we've been agonizing over too - traversing, for example, multiple (competing) taxonomies of biological entities. To that end we were using DBCachePlugin, and then later I cooked up TopicRecursePlugin as a SEARCH-based replacement capable of recursing up (or down) the tree. Performance still wasn't adequate, so for most things it's been sufficient to cache ancestor links into each topic; this gives a very fast way to query for child/descendents with mongo (even without indexes).

I did want to do some kind of nicer graph query thing (we use SPARQL in other work), but that would mean a lot of software engineering effort and decision making. Especially SPARQL. Probably cypher is a more Foswiki-esque language to implement.

We certainly should not be attempting yet-another-failed-attempt-at-a-general-purpose-graph-query-language - so either we keep it VERY simple (and limited) to the tree-type structures we see in wiki apps (which I think DBCachePlugin and TopicRecursePlugin try to do), and perhaps work on making those better at what they do, or pick an existing technology and try to integrate that with Foswiki.

-- PaulHarvey - 14 Dec 2011

My whole point is that their is a common simple case which can easily be supported. That case is:
  • A database table where a child has only one parent (i.e not a number of competing parents or other complex networks)

Therefore you can simplify to:
  1. A single, and ordinary, everyday, query which returns all the topics of interest (no traversing/recursion required)
  2. This will of course return a flat list
  3. Run special sort algorithm as above (iteration, not recursion)

Paul in in TopicRecursePlugin you state
  • It would be highly desirable to enhance QUERY and FORMAT, however neither are able to work with hierarchical (only flat) datasets.

The above technique uses a flat dataset (the simple base query)
  • Yes the sort algorithm is working hierarchically
  • In a sense this flattened list is a cache (in perl memory), but I suspect even thousands would sort quickly
  • The query will not support where one of my (great)*grand-parents ancestors is X
    • This can be rectified by building a full ancestry during the sort algorithm, but that's after the original query, so ...
    • That probably needs elevating into the original query

Hmm ...

My original case was an OrgChart, and I was displaying the whole chart. Imagine a company with 10,000 employees. The initial query would be:

  • SELECT employee, boss from staff; i.e. select all employees

Followed by:

  • Run thru sort algorithm so you get:
    • Company Boss
    • Boss Dept A
    • Boss Div A-1
    • Emp# 123
    • Emp# 213
    • Boss Div A-2
    • Emp # 456
    • Boss Dept B
    • Boss Div B1
    • Emp# 789
    • Emp# 980
    • ...

If and only IF the query includes a clause like: where 'Boss Dept A' is an ancestor (i.e. just view the OrgChart of Dept A), then during the sort you would need to build and ancestor field for each employee and use that to refine the search further. (Actually, the sorted list almost contains that info already (start at Boss Dept A), but how do you know you moved past the end of Dept A?)

(I must admit that I had not considered the need to only show Dept-A or Dept-B etc in my original musings. Our Company is quite small and I would rather show the full OrgChart quickly than some department slowly.)

Nonetheless part of my original thinking remains, and this boils down to the following question:
  • If I read 10,000 rows from the database then sort and refine in memory, will this be faster than recursing through the database starting at Dept-A?

Of course the answer is: it depends

I would still be interested to experiment with this, there is probably a cross-over point, which will vary between the back-end stores

Paul's point about caching the ancestry in the store is certainly a good one where necessary; I've used that technique myself in a WikiApplication (without a supporting Plugin and that was painful)

An AncestorFormFieldPlugin which lets you select a parent but actually stores the full hierarchy as the value might be an option.

-- JulianLevens - 15 Dec 2011

About your question "If I read 10,000 rows from the database then sort and refine in memory,..." there's a short answer: don't do that. Databases can do better, ie. they are specifically tailored in optimizing the combined sort+limit steps while fetching a bunch of records. Second, your user won't probably be able to digest 10,000 rows on the screen, nor would your browser be able to efficiently render a tree representing that amount of information. That's why you will need a sensible way to let the user drill down the org chart step by step, filter and search to limit the amount of information to be presented. To do this you will need rest handlers that return a specific area of the tree showing the ancestors of a node. This will relieve you from the initial need that you felt having to get all of the hierarchy in one step, and then see later what to do with that amount of information. Instead, just fetch one level at a time while the user climbs down the tree. And bam, no more recursion....almost, cus the next question your users will ask is "can we have an unfold-all button"...

-- MichaelDaum - 15 Dec 2011

SQL stores actually do not handle this use-case very well - no recursive join.

I realise that the rcs store has no caching and performance is therefore average at best and with recursion shocking.

An Sql store will eliminate much of this even if the recursion has to be written In perl.

Anyway I'm much clearer about the situation, and will leave well alone for now.

Btw my rexx code above is a hierarchical sort, but nort the sort of sort I sort it was

-- JulianLevens - 15 Dec 2011

Databases do quite different. Here's a comparison of different databases with a hands-on tutorial how to do it.

Baseline: all databases except mysql do best using adjacency lists, while mysql performs better on a nested sets. Drawback is nested sets are more complicated to implement. The other considered databases have features that lend towards hierarchies nowadays.

-- MichaelDaum - 15 Dec 2011

Julian, I didn't want to discourage you smile

Michael has very good points about avoiding perl-side sort/skip/limit/group - it certainly is a huge drain on performance which can't be understated. However, let's ignore such implementation details and focus on what an actual wiki-app'er such as yourself would write - and then try to see if that usage is amenable to DB optimization.

Two points about VarQUERY:
  • Presently, only works on single topics-at-a-time. Comparable to DBCachePlugin's $expand().
  • Isn't supposed to have a format param - its philosophy is more as a selector only, kinda like jQuery if you're familiar with that. Its output is supposed to be fed into dedicated decorator macros like VarFORMAT or FilterPlugin's FORMATLIST.

Now, let's pretend SEARCH has got Tasks.Item11022 and/or SupportMultiKeySorting - i.e. a more expressive groupby and order parameters

Then it seems to me that as you say, we "just" need a special DataForms field type to indicate that a given formfield is a topic name which participates a link in hierarchy.

This ties in with AllowTypedData, which is aimed at DB back-ends like DBIStoreContrib & MongoDBPlugin to make better use of indexes

In my own experiments, I've been using a field type called fwaddress which normalises its field contents according to Foswiki::Address so that I can index such fields reliably in MongoDB

What if we had a fwaddress+parent field which registered a new META:TYPE, resulting in something like this ending up in the topic either physically (pollutes the topic.txt as my current experiments do) or virtually (automatically assembled & stored transparently by the back-end store implementation, without contaminating the topic.txt):
 %META:FIELD{name="Manager" value="Main.Alice"}%
 %META:HIERARCHY{name="1" key="Manager" value="Main.Alice" level="3"}%
 %META:HIERARCHY{name="2" key="Manager" value="Main.Bob" level="2"}%
 %META:HIERARCHY{name="3" key="Manager" value="Main.Charlie" level="1"}%

From a dataform that looks something like:
| *Name* | *Type* | *Size* | *Values* | *Tooltip* | *Attributes* |
| Manager | fwaddress+parent | | | | |

Obviously, a naive implementation would be tedious to maintain - updating one of the "higher up" links would require propagation to potentially thousands of topics - but it does simplify querying enormously.

Our own hierarchies are only ~7-10 deep, so this doesn't add a terribly huge data storage overhead for us, although it does mean I need a cron job to run a rest handler to update many topics.

BUT, if we promote field types to be something more than just a schema, and actually enhance them with enough hooks/API to implement MODELs, then perhaps this can be optimized on a per-backend basis.

Anyway, queries might look like (sorry for conflating teams & managers, I hope the example can still be followed):
%SEARCH{
    "form.name='TeamForm'" 
    groupby="Manager"
    format="$percntQUERY{'$topic'/hierarchy[key='Manager'][0].level x '   '}$percnt* $topic"
}%

This of course assumes VarQUERY understands the 'x' (string repeat) operator, perhaps we can come up with a more succinct $token(), but that's the general idea

In an ideal world, where META:TYPEs can have rich models that may be optimized on a per-DB-implementation basis:
  • In NoSQL (non-relational) fashion, MongoDB would implement this by secretly/transparently maintaining the META:HIERARCHY datums.
  • A DBIStoreContrib implementation OTOH, depending on its SQL flavour I guess, would implement this by maintaining adjacency tables as Michael said.

Thus avoiding pollution of topic.txt with any actual META:HIERARCHY datums (which also pollutes topic history when you're using a cron job to propagate changes to many thousands)

-- PaulHarvey - 15 Dec 2011

Incidentally, here's the system topic for our TopicRecursePlugin

-- PaulHarvey - 15 Dec 2011

Paul: I'm not discouraged exactly. I thought there was some low hanging fruit. In any case I'm looking at my own DBIStoreContrib (mostly reading and thinking at the moment), which will also include versioning support and a regex cache. This will necessarily be quite different to the existing contrib.

I will not build in any hierarchic support (or groupby/having) at least initially. It's quite a large chunk of work as it is, and this is an extra feature that can come later.

Thanks for all the info, the TopicRecursePlugin will be food for thought. Your Trin wiki looks good BTW — I've a lot of work to do on ours.

Michael: your info about the different flavours of SQL and their hierarchic support were very interesting and I'm now convinced to let SQL do it's job; at least when it's ready for prime time of Foswiki.

It also goes to show that I need to be more careful when searching for info — even one or two years old can mean your quite out of date.

-- JulianLevens - 16 Dec 2011

 
Topic revision: r13 - 16 Dec 2011, JulianLevens
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy