Thursday, July 10, 2008

Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

Oleg Bartunov wrote:
> On Wed, 9 Jul 2008, Jan Urbaski wrote:
>
>> Jan Urbaski wrote:
>> Do you think it's worthwhile to implement the LC algorithm in C and
>> send it out, so others could try it out? Heck, maybe it's worthwhile
>> to replace the current compute_minimal_stats() algorithm with LC and
>> see how that compares?
>
> I and Teodor are using LC for phrase estimation in one application and
> from our understanding of the original paper this algorithm might be
> not good for sampling, since all theory behind was about streaming of
> FULL data. As for technique we use suffix tree, which should be fine for
> typical sample size.

Hm, that's a good point. I'm only reasurred by the fact, that it yields
roughly the same results as the original algorithm, which means that if
LC is bad for sampling, then the current implementation is just as bad.
Come to think of it, the current code is in a way a variant of Lossy
Counting, it's just doing the pruning after each and every new element,
isn't it?

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: