Item10352: Query optimizer regexes beyond simple plain text fail
Priority: Urgent
Current State: Closed
Released In: 1.1.3
Target Release: patch
Applies To: Engine
Component:
Branches:
Given there's a web with topics and DataForms that have a TopicType field in it and there's at least one topic
that has got a formfield
TopicType="BlogEntry, CategorizedTopic"
Let's search these.
You type:
%SEARCH{"TopicType=~'BlogEntry'" type="query" format=" 1 [[$web.$topic][$topic]"}%
You get: zero results
ERROR
Now let's try something different:
You type:
%SEARCH{"TopicType=~'.*BlogEntry.*'" type="query" format=" 1 [[$web.$topic][$topic]"}%
You get the expected search hits.
Now, that's wrong, according to the specs. Tracing the evaluation order it turns out that in the first try
OP_match
is never executed.
Digging deeper there is a Query::Node optimizer that tries to find out which parts of the query are "constant" and considers the 'BlogEntry' being
a static string not worth checking using
OP_match
anymore. Serious bug.
Not sure this analysis is correct. But there's something seriously going wrong in that area but I lost enthusiasm to dig any deeper.
--
MichaelDaum - 14 Feb 2011
Thats a pretty big bug - and clearly needs a failing unit test.
--
SvenDowideit - 15 Feb 2011
Shouldn't the query optimizer be moved from Query::Node to the search algorithm? An sql or xquery backend already does a pretty nice job in optimizing the query before executing it. Trying to do something similar in perl is more specific to the current perl/grep search algorithm and not needed on real search engines.
Another observation:
$query->simplify
is called multiple times on the same query. That might be inevitable looking at the current code paths. So a
isOptimized
flag would speed this up. Not sure how intensive this is judging on the current code. But
$query->simplify
at least
sounds frightening.
--
MichaelDaum - 15 Feb 2011
I recently tried using the new =~ feature with a simple regex containing parathesises. That did not work either. In my experince that new regex query search feature does not work at all in practical use. I was in the middle of a work peak load and forgot to raise a bug report.
--
KennethLavrsen - 22 Feb 2011
FWIW
=~
works fine for me in practical use, but I use trunk, and not 1.1.2. I think we had an
OP_match
fix that Sven did which I can't remember if it made it into 1.1.2 or not.
Anyway, on trunk you can query for literal parens if you escape them with
\(
. Added
Item10399
--
PaulHarvey - 23 Feb 2011
There is something very broken, even on trunk:
This here doens't work either:
%SEARCH{"TopicType=~'.*\bBlogEntry\b.*'" type="query" format=" 1 [[$web.$topic][$topic]"}%
So basic regexing using the
OP_match
operator is not there anymore, even though the code in
OP_match.pm
is just fine. That's why the best guess is something is going on
before the operators are called.
--
MichaelDaum - 23 Feb 2011
The problem I had with (...) was not searching for literal ( or ). I used them the regex way. Specifically I tried to OR two words in a regex like
field =~ "Something(this|that)".
I could not get any regex to work with =~ operator unless it was dud simple.
Example: %SEARCH{"TopicTitle =~ '.*Topic (1|2).*'" type="query"}% will not find the two topics that have the values 'Topic 1' and 'Topic 2'.
I have tried to analyse this. What is it it does not understand
It does not understand the (). TopicTitle =~ '.*Topic (1).*' fails to find anything
It does not understand the |. TopicTitle =~ '.*Topic 1|2.*' fails to find anything
It does not understand a character class. TopicTitle =~ '.*Topic [12].*'
In my view the query regex feature is totally broken. I cannot make any use of it. And several of my users have tried for hours to get it to work. A lot of users are wasting a lot of time with this.
I will not even build a 1.1.3 release candidate until we have found what breaks this new important 1.1 feature.
--
KennethLavrsen - 01 Mar 2011
It's worth noting that
fields[value=~'<regex>']
works
--
PaulHarvey - 06 Mar 2011
As Sven pointed out - this needs a unit test. It's a lot easier for others to fix when a unit test exists. Otherwise we need to create both a failing test and figure out the fix.
--
GeorgeClark - 06 Mar 2011
done
--
PaulHarvey - 06 Mar 2011
I don't see the same diagnosis as Michael.
OP_match
is called, and returns true. Something else is going on.
--
PaulHarvey - 06 Mar 2011
Right, it gets called for
QUERY... something bad happens with
SEARCH.
--
PaulHarvey - 06 Mar 2011
My adventure led me to bruteforce search algo before I had to stop. I'm hoping Sven or Crawford might have time to offer suggestions. The interesting thing is that
fields[name='Blah' value=~'<regex>']
works as both
QUERY and
SEARCH, whereas
Blah=~'<regex>'
works in a
QUERY but not in a
SEARCH.
--
PaulHarvey - 07 Mar 2011
sven is planning on working on this tomorrow (9/3/2011 my time)
--
SvenDowideit - 08 Mar 2011
fixed, i think - but given the lack of docco and unit test - there essentially is no spec (see
QuerySearch and
RegexExpression) its pretty hard to be sure
--
SvenDowideit - 09 Mar 2011