Sunday, July 6, 2008

Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>>> Tom Lane wrote:
>> Hmm, I had just assumed without looking too closely that it was stats
>> target times a fudge factor. What is the rationale for doing it as
>> above? I don't think I like the idea of the limit varying over the
>> course of the scan --- that means that lexemes in different places
>> in the input will have significantly different probabilities of
>> surviving to the final result.

> will never have a chance to climb up the list. I'm not sure where the
> "times two" figure comes from, maybe it's just a fudge factor, but the
> bottom line is that the minimum size needed depends on the size of the
> longest tsvector.

Thank you for the review.

The whole algorithm was based on the original analyze routine from
analyze.c (compute_minimal_stats() ). The "times two" factor is there, I
think it's purpose is to reduce the chances of an element dropping off
the list before it's number of occurrences gets boosted. It doesn't look
like a factor of 1.5 or 3 would be much better/worse, I just kept it as
I saw it.

It's true that making the limit change in the course of the scan makes
the set of surviving lexemes dependent on the order in which entries are
fetched. I'm inclined to believe that most of the times it won't change
the overall result - some entries will drop off eariler, some later,
some will be discarded when it comes to the final pass at the end of the
algorithm. Maybe I should do a test with - run ANALYZE several times
over a large set of documents and see if the result stays the same?

Since the number of tracked lexemes needs to be correlated with the
length of the longest tsvector, the only one possibility I see is doing
two passes through the statistics sample (determine the length first,
then do the real work). I wasn't sure if rows fetched by the sampling
algorithm are guaranteed to be in memory. If so, then another pass is
just a linear cost (and asymptotically insignificant).

About the array vs dllist issue. If the "two passes" idea is better than
"dynamically expanding list" then an array is of course better. I
thought about using a one-way list, but having it two-way helps with
with inserting and element at the end of the 1-occurrence group (and I
think that's important, read on) and with the final pass, in which you
go through the list backward and stop when you get enough entries to
fill the pg_statistics tuple.
Maybe these could be dealt with efficiently using a one-way list, it was
easier to do with a double-linked one.

Keeping the list sorted at all times has an advantage: if you insert new
entries at the *end* of the 1-occurrence block and have the ones at the
begginning fall off, you are discarding elements that haven't been seen
for the longest time.
Consider a list:
cat:1 dog:1 duck:2 rhino:3
If the next lexeme is "goat", and you'd put it at the begginning of the
list (or put in anywhere and qsort() it, you could get)
goat:1 cat:1 dog:1 duck:2 rhino:3
And if this would've trigger an overflow, you could lose the goat, when
you'd want to lose the cat.

Once again: the whole algorithm is a ripoff from analyze.c, with the
dllist instead of an array because you don't know how big tracking list
will you need and with a hashtable, because the tracking list will
probably be large and looking up things linearly wouldn't be fun.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: