Tuesday, June 3, 2008

[GENERAL] bloom filter indexes?

I've been working on partitioning a rather large dataset into multiple
tables. One limitation I've run into the lack of cross-partition-table
unique indexes. In my case I need to guarantee the uniqueness of a
two-column pair across all partitions -- and this value is not used to
partition the tables. The table is partitioned based on a insert date
timestamp.

To check the uniqueness of this value I've added an insert/update
trigger to search for matches in the other partitions. This trigger is
adding significant overhead to inserts and updates.

This sort of 'membership test' where I need only need to know if the
key exists in the table is a perfect match for bloom filter. (see:
http://en.wikipedia.org/wiki/Bloom_filter).

The Bloom filter can give false positives so using it alone won't
provide the uniqueness check I need, but it should greatly speed up
this process.

Searching around for "postgresql bloom filter" I found this message
from 2005 along the same lines:
http://archives.postgresql.org/pgsql-hackers/2005-05/msg01475.php

This thread indicates bloom filters are used in the intarray contrib
module and the tsearch2 (and I assume the built-in 8.3 full-text
search features).

I also found this assignment for CS course at the University of
Toronto, when entails using bloom filters to speed up large joins:
http://queens.db.toronto.edu/~koudas/courses/cscd43/hw2.pdf

So, my question: are there any general-purpose bloom filter
implementations for postgresql? I'm particularly interested
implementations that would be useful for partitioned tables. Is anyone
working on something like this?

thanks,
- Mason

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

No comments: