Sunday, September 14, 2008

Re: [PATCHES] hash index improving v3

I did some testing of my own on the hash index patch, and mostly seem to
have confirmed Alex's results. I used a table like this:

create table tab (f1 serial primary key, f2 some-datatype);
create index ind on tab using hash (f2);

and populated it with 2 million rows of data; then timed queries
like this:

select * from tab a join tab b using(f2)
where f2 = (select f2 from tab c
where c.f1 = (select int4(random() * 2e6)));

using pgbench like this:

pgbench -n -c 1 -T 60 -M prepared -f query.sql hashdb

To test "wide" indexed columns I used a text column with entries of 100
random characters (identical to Alex's test). I saw a performance gain
of about 50% with an empty kernel cache (26.9 vs 41.9 tps), dropping to
about 14% once the table and index were fully swapped in (4185 vs 4764
tps). This was in a C-locale database. Presumably the win would have
been significantly greater in a non-C-locale test, but I didn't try it.

To test "narrow" indexed columns I made f2 a bigint containing 2000000
consecutive integers. Here I saw about a 5% improvement with either
empty cache (48.1 vs 50.5 tps) or full cache (4590 vs 4800 tps).
This is not too surprising since about the same amount of I/O is needed
either way, and bigint comparison is very cheap. (This is a 64-bit
machine, and it's defaulting to pass-by-value for bigints, so value
comparisons are hardly more expensive than hashcode comparisons.)
But the patch still wins a bit by being able to do binary search within
index pages.

In both of the above cases there were only a negligible number of hash
collisions. I made up another test case, still 2 million bigints, but
with values chosen so that almost every entry had a hash collision with
another entry (a small fraction had two or even 3 collisions). This
showed about a 25% slowdown compared to CVS HEAD with empty cache
(49.9 tps down to 37.2), decreasing to 3% with full cache (4609 vs 4482
tps).

I experimented with some variant queries that did more hash index
fetches per query, and saw penalties as high as 50%. However, this is
surely by far the worst case for the patch: no savings in index size,
and a ridiculously high collision rate.

Lastly, for comparison's sake I tried the "wide column" case with a
btree instead of hash index. This gave me 31.5 tps with empty cache,
4749 tps with full cache. Note that the btree is losing significantly
to the patched hash index in empty-cache conditions --- this presumably
reflects the much larger index size causing more I/O.

I'm thinking that we should go ahead and apply the patch. AFAIR this is
the first example we've ever seen of hash beating btree for fetch
performance, and it's pretty obvious that removing the indexed value
from the index contents is the reason for the win. So I've lost
interest in the alternative that involved storing both hashcode and
indexed value. We now see a possible production use-case for hash,
namely indexing wide column values.

BTW, one thing I noticed was that the hash index build time for the
"wide column" case got a lot worse after applying the patch (from 56 to
237 sec). The reason for this turned out to be that with the smaller
predicted index size, the code decided not to use the pre-sorting method
that was recently added. Reducing effective_cache_size to less than the
index size brought the time back down, to about 54 sec. So it would
seem that effective_cache_size is too large a cutoff value. I'm
considering changing hashbuild to switch over at shared_buffers instead
of effective_cache_size --- any thoughts about that?

regards, tom lane

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

No comments: