Sunday, August 3, 2008

Re: [GENERAL] bytea encode performance issues

On Aug 3, 2008, at 12:12 PM, Sim Zacks wrote:

> This is the query that is used (I know it is not as efficient as it
> could be, but this is the query it comes with):
>
> SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
> dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
> ON p.id = m.physmessage_id WHERE mailbox_idnr = 8 AND status IN (0,1 )
> AND k.is_header = '0' GROUP BY m.message_idnr,k.messageblk HAVING
> ENCODE(k.messageblk::bytea,'escape') LIKE '%John%'

That LIKE operator is probably your problem. An unbounded LIKE like
that (with a wildcard on both sides) means no index can be used,
hence you get a sequential scan.

There are apparently some possibilities with the new GIN indexes (or
maybe even using GIST), but I haven't had an opportunity to try those
yet. There were some messages about just that on this list recently.

If you create an index, make sure you create a _functional_ index
over ENCODE(messageblk, 'escape').

> The messageblk field is a bytea in postgresql and a longblob in
> mysql.
> The only difference in the query is the MySQL does not need the encode
> function.

Since when is e-mail binary data? I don't quite see why you'd use a
bytea field instead of text. If your problem is character encoding,
then just don't store that ("encode" the DB using SQLASCII).

> Can someone make a suggestion for tuning the database?

An explain analyze would have been a bit more useful, but the biggest
issue is probably the seqscan.

> The explain of the query is:
> "HashAggregate (cost=43648.11..43648.85 rows=74 width=753)"
> " -> Nested Loop (cost=42999.83..43647.74 rows=74 width=753)"
> " -> Merge Join (cost=42999.83..43043.35 rows=74 width=769)"
> " Merge Cond: (k.physmessage_id = m.physmessage_id)"
> " -> Sort (cost=39264.12..39267.59 rows=1388
> width=753)"
> " Sort Key: k.physmessage_id"
> " -> Seq Scan on dbmail_messageblks k
> (cost=0.00..39191.68 rows=1388 width=753)"
> " Filter: ((is_header = 0::smallint) AND
> (encode(messageblk, 'escape'::text) ~~ '%John%'::text))"

Here is your problem, a sequential scan over a presumably large
table. It's either caused by the LIKE expression or by the lack of a
functional index on messageblk, or both.

If you change the type of the messageblk field to text you won't need
a functional index anymore (although that only saves time on index
creation and inserts/updates).

> " -> Sort (cost=3735.71..3754.59 rows=7552 width=16)"
> " Sort Key: m.physmessage_id"
> " -> Bitmap Heap Scan on dbmail_messages m
> (cost=385.98..3249.26 rows=7552 width=16)"
> " Recheck Cond: ((mailbox_idnr = 8) AND
> (status
> = ANY ('{0,1}'::integer[])))"
> " -> Bitmap Index Scan on dbmail_messages_8
> (cost=0.00..384.10 rows=7552 width=0)"
> " Index Cond: ((mailbox_idnr = 8) AND
> (status = ANY ('{0,1}'::integer[])))"
> " -> Index Scan using dbmail_physmessage_pkey on
> dbmail_physmessage p (cost=0.00..8.15 rows=1 width=8)"
> " Index Cond: (k.physmessage_id = p.id)"


I notice some other oddities in that query/design. Why is is_header a
smallint instead of a boolean? I'm assuming this is for compatibility
with an other database?

Why use status IN (0, 1) instead of more descriptive keys? Is it even
constrained this way, or could arbitrary numbers end up as status
(say 99) and if so, what happens to those messages?

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,48958f34243483105918576!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

No comments: