Monday, July 7, 2008

Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

Tom Lane wrote:
> target would typically be around 10 or so. It really wasn't designed to
> be industrial-strength code. (It was still a big improvement over what
> we'd had before :-(.) So I'm not very comfortable with taking it as the
> design base for something that does need to be industrial-strength.

I did some googling and found a paper describing a method of determining
most frequent elements from a stream called Lossy Counting:
http://www.vldb.org/conf/2002/S10P03.pdf
The following is an adaptation of an algorithm described in section 4.2
of that paper. I'll summarize it and propose an implementation using
existing PG data structures.

The data structure (called D) is a set of entries in the form (e, f, d),
where e is the lexeme, f is an integer representing the estimated
frequency and d is the maximum possible error in f. We process tsvectors
in batches. After each w tsvectors we will be pruning our data
structure, deleting entries that are not frequent enough. Let b be the
number of the current batch (tsvectors 1 to w are processed with b = 1,
tsvectors w+1 to w*2 are processed with b = 2 and so on). Observe that w
is an arbitrary number.
Initially, D is empty. For each lexeme e we look up its entry in D. If
it's found, we increment f in that entry. If not, we add a new entry in
the form of (e, 1, b - 1).
After each w tsvectors we delete all entries from D, that have f + d <= b.

For example, consider such a list of tsvectors (numbers stand for
lexemes, each line is a tsvector):
1 2 3
2 3 4 5
1 2
1 3 4
2 3 4 5
2 4 5 6

Let w = 3.
After processing the first tsvector D would look like so:
(1, 1, 0), (2, 1, 0), (3, 1, 0)
After processing the second tsvector D would be:
(1, 1, 0), (2, 2, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)
After the third tsvector we'd have:
(1, 2, 0), (2, 3, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)

And now we'd do the pruning. b = 1, so all entries with f + d <= 1 are
discarded. D is then:
(1, 2, 0), (2, 3, 0), (3, 2, 0)

After processing the fourth tsvector we get:
(1, 3, 0), (2, 3, 0), (3, 3, 0), (4, 1, 1)
And then:
(1, 3, 0), (2, 4, 0), (3, 4, 0), (4, 2, 1), (5, 1, 1)
And finally:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1), (6, 1, 1)

Now comes the time for the second pruning. b = 2, to D is left with:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1)

Finally we return entries with the largest values of f, their number
depending on statistics_target. That ends the algorithm.

The original procedure from the paper prunes the entries after each
"element batch" is processed, but the batches are of equal size. With
tsvectors it seems sensible to divide lexemes into batches of different
sizes, so the pruning always takes place after all the lexemes from a
tsvector have been processed. I'm not totally sure if that change does
not destroy some properties of that algorithm, but it looks reasonably sane.

As to the implementation: like Tom suggested, we could use a hashtable
as D and an array of pointers to the hashtable entries, that would get
updated each time a new hashtable entry would be added. Pruning would
then require a qsort() of that table taking (f + d) as the sort key, but
other than that we could just add new entries at the end of the array.
An open question is: what size should that array be? I think that by
knowing how long is the longest tsvector we could come up with a good
estimate, but this needs to be thought over. Anyway, we can always
repalloc().

Getting the max tsvector length by looking at the datum widths would be
a neat trick. I just wonder: if we have to detoast all these tsvectors
anyway, is it not possible to do that and do two passes over already
detoasted datums? I've got to admit I don't really understand how
toasting works, so I'm probably completely wrong on that.

Does this algorithm sound sensible? I might have not understood some
things from that publication, they give some proofs of why their
algorithm yields good results and takes up little space, but I'm not
all that certain it can be so easily adapted.

If you think the Lossy Counting method has potential, I could test it
somehow. Using my current work I could extract a stream of lexemes as
ANALYZE sees it and run it through a python implementation of the
algorithm to see if the result makes sense.

This is getting more complicated, than I thought it would :^)

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: