QueryUniqueValues

Query for unique values

It'd be brilliant to query for unique values. The new stores - DBIStoreContrib (SQL DISTINCT clause) and MongoDBPlugin (via distinct() or some map/reduce collection) could provide this functionality quite cheaply.

Introduce a 'distinct' function. Example below builds on the idea of a 'topics' collection, see QueryCustomCollections, so that VarQUERY isn't tied to one-topic-at-a-time

Get distinct list of authors for all topics beginning with Foo

Sven's syntax suggestion:
%QUERY{"distinct(topics[name~'Foo*'].author)"}%

This doesn't allow for reporting of distinct combinations of more than one key; OTOH, VarQUERY can only emit a flat array of values for one field at a time... For example we can't do this:
%QUERY{"topics[name~'Foo*'].distinct(author, Project)"}%

When would you use this in a SEARCH... Maybe with IN. This might search for extensions that have never had a bug filed against them (flaw: assumes Component holds only one value):
%SEARCH{
  "NOT (name IN (distinct(topics[form.name='ItemTemplate'].Component)))"
  type="query"
}%

.. probably a bad example.

I want to apply the distinct() operator on some meta key across a set of topics... but I'm not sure how to construct that in the querysearch language.

-- PaulHarvey - 26 Oct 2010

Discussion

I prefer the functional version of this:

%QUERY{"distinct(topics[name~'Foo*'].author)"}%

-- SvenDowideit - 26 Oct 2010

Cool. I updated the brainstorm (?)

-- PaulHarvey - 29 Oct 2010

The functional version is more in line with the rest of the query language (d2n etc). My main question (and the essential starting point for any more work) is, how would you implement this in OP_distinct?

-- CrawfordCurrie - 29 Oct 2010

you mean the brute force OP_distinct::evaluate()? It would require a state - a list of values previously found. And that would be depressingly non-paralellisable.

mind you, I'm poking out of order execution, so it could be worse.

-- SvenDowideit - 29 Oct 2010

Yes, I mean exactly that. However it doesn't require a state; it requires an array, which is the result of the inner evaluate. It should then uniq that set. I think. Forcing ourselves to always implement an OP_ version helps stop us from getting carried away.

-- CrawfordCurrie - 29 Oct 2010

dammit. would you please stop reaching for an array when a hash is more appropriate? Same issue in the Logger API - I'm going to have to amend that to use hashes too frown, sad smile

stick out tongue

-- SvenDowideit - 30 Oct 2010

Sorry arrayset

-- CrawfordCurrie - 30 Oct 2010

The main use-case is building facted search UIs; allowing users to "drill down" just by clicking on (automatically discovered) refinement options.

-- PaulHarvey - 10 Mar 2011

It might be worth looking at SolrPlugin and DBCachePlugin how they both implement this feature. For Solr it is a natural thing to compute facets along with a result as that's what this search engine has been invented for mainly. In DBCachePlugin there's the DBSTATS macro that allows to analyze certain properties of a query provided, like frequencies of individual values, their min, max, and mean values. Solr, of course allows for a lot more than that using function facets that are computed dynamically. Values returned are then available for further refinements of a subsequent query.

-- MichaelDaum - 01 May 2011
Topic revision: r12 - 01 May 2011, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy