Thursday, August 14, 2008

Re: [HACKERS] gsoc, oprrest function for text search take 2

Heikki Linnakangas wrote:
> Jan Urbański wrote:
>> So right now the idea is to:
>> (1) pre-sort STATISTIC_KIND_MCELEM values
>> (2) build an array of pointers to detoasted values in tssel()
>> (3) use binary search when looking for MCELEMs during tsquery analysis
>
> Sounds like a plan. In (2), it's even better to detoast the values
> lazily. For a typical one-word tsquery, the binary search will only look
> at a small portion of the elements.

Hm, how can I do that? Toast is still a bit black magic to me... Do you
mean I should stick to having Datums in TextFreq? And use DatumGetTextP
in bsearch() (assuming I'll get rid of qsort())? I wanted to avoid that,
so I won't detoast the same value multiple times, but it's true: a
binary search won't touch most elements.

> Another thing is, how significant is the time spent in tssel() anyway,
> compared to actually running the query? You ran pgbench on EXPLAIN,
> which is good to see where in tssel() the time is spent, but if the time
> spent in tssel() is say 1% of the total execution time, there's no point
> optimizing it further.

Changed to the pgbench script to
select * from manual where tsvector @@ to_tsquery('foo');
and the parameters to
pgbench -n -f tssel-bench.sql -t 1000 postgres

and got

number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 12.238282 (including connections establishing)
tps = 12.238606 (excluding connections establishing)

samples % symbol name
174731 31.6200 pglz_decompress
88105 15.9438 tsvectorout
17280 3.1271 pg_mblen
13623 2.4653 AllocSetAlloc
13059 2.3632 hash_search_with_hash_value
10845 1.9626 pg_utf_mblen
10335 1.8703 internal_text_pattern_compare
9196 1.6641 index_getnext
9102 1.6471 bttext_pattern_cmp
8075 1.4613 pg_detoast_datum_packed
7437 1.3458 LWLockAcquire
7066 1.2787 hash_any
6811 1.2325 AllocSetFree
6623 1.1985 pg_qsort
6439 1.1652 LWLockRelease
5793 1.0483 DirectFunctionCall2
5322 0.9631 _bt_compare
4664 0.8440 tsCompareString
4636 0.8389 .plt
4539 0.8214 compare_two_textfreqs

But I think I'll go with pre-sorting anyway, it feels cleaner and neater.
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: