Tuesday, July 8, 2008

Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

Jan Urbański wrote:
> If you think the Lossy Counting method has potential, I could test it
> somehow. Using my current work I could extract a stream of lexemes as
> ANALYZE sees it and run it through a python implementation of the
> algorithm to see if the result makes sense.

I hacked together a simplistic python implementation and ran it on a
table with 244901 tsvectors, 45624891 lexemes total. I was comparing
results from my current approach with the results I'd get from a Lossy
Counting algorithm.
I experimented with statistics_target set to 10 and 100, and ran pruning
in the LC algorithm every 3, 10 or 100 tsvectors.
The sample size with statistics_target set to 100 was 30000 rows and
that's what the input to the script was - lexemes from these 30000
tsvectors.
I found out that with pruning happening every 10 tsvectors I got
precisely the same results as with the original algorithm (same most
common lexemes, same frequencies). When I tried pruning after every 100
tsvectors the results changed very slightly (they were a tiny bit more
distant from the ones from the original algorithm, and I think a tiny
bit more precise, but I didn't give it much attention).

Bottom line seems to be: the Lossy Counting algorithm gives roughly the
same results as the algorithm used currently and is also possibly faster
(and more scalable wrt. statistics_target).

This should probably get more testing than just running some script 5
times over a fixed set of data, but I had trouble already sucking ~300
MB of tsvectors from one of my production sites, putting it on my laptop
and so on.
Do you think it's worthwhile to implement the LC algorithm in C and send
it out, so others could try it out? Heck, maybe it's worthwhile to
replace the current compute_minimal_stats() algorithm with LC and see
how that compares?

Anyway, I can share the python script if someone would like to do some
more tests (I suppose no-one would, 'cause you first need to apply my
ts_typanalyze patch and then change it some more to extract lexemes from
the sample).

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: