Sunday, August 3, 2008

Re: [GENERAL] bytea encode performance issues

On Aug 3, 2008, at 2:36 PM, Sim Zacks wrote:

> The LIKE operator is likely the problem, but it is a critical part
> of an
> email application. Searches are done by, "Show me all emails
> containing
> the following word."
>
> I've tried using TSearch2's full text index. It made the query 50%
> faster, taking 5 minutes. This is still not even close to the less
> then
> a minute in MySQL.
>
>>
>> That LIKE operator is probably your problem. An unbounded LIKE
>> like that
>> (with a wildcard on both sides) means no index can be used, hence you
>> get a sequential scan.
>>
>> There are apparently some possibilities with the new GIN indexes (or
>> maybe even using GIST), but I haven't had an opportunity to try those
>> yet. There were some messages about just that on this list recently.
>>
>
> I don't think a functional index (or anything other then a FTI) would
> accomplish anything, being that I am doing unbounded Likes.

That's why I suggested to use a text field instead of bytea.
IIRC, You can have an index on word triplets and use tsearch. I don't
have intimate knowledge on how that works though, hopefully other
people will chime in here.

Without the need to convert each row before comparing it, and with an
appropriate index, that should significantly speed up your queries.

>> If you create an index, make sure you create a _functional_ index
>> over
>> ENCODE(messageblk, 'escape').
>>
>
> Email is binary when it contains attachments. I actually planned on
> using an ASCII encoding, but the dbmail people specifically said
> not to.
> I don't know if they were speaking from experience, or because ASCII
> sounds bad.

It shouldn't be; those attachments are MIME or UU encoded, are they not?
Don't confuse ASCII and SQLASCII. The latter accepts characters from
any encoding, which is probably what you want.

>> Since when is e-mail binary data? I don't quite see why you'd use a
>> bytea field instead of text. If your problem is character
>> encoding, then
>> just don't store that ("encode" the DB using SQLASCII).
>>
>
> As I mentioned, this is the system that came with dbmail. It runs on
> both PostGresql and MySQL, so they may have done some compatibility
> things. There are 4 statuses possible, 0,1,2,3 if you use the database
> through the software then a 99 could never appear there.

The software isn't the only client that might connect to the database.
It is usually bad practice to put data constraint logic in the client
instead of in the database. Especially since in client code there are
usually multiple sections of code that have to deal with those
constraints, which tends to result in small differences in their
handling.

Next to that, if those statuses would have a proper foreign key
constraint, it would be very easy to add labels to each status in a
way they would make a bit more sense than 0, 1, 2, 3.
I expect the label would be a sufficient foreign key by itself
though, no need for those silly numbers.

Well, there's probably not much you can do about that, being just a
user of dbMail.

>> I notice some other oddities in that query/design. Why is is_header a
>> smallint instead of a boolean? I'm assuming this is for compatibility
>> with an other database?
>>
>> Why use status IN (0, 1) instead of more descriptive keys? Is it even
>> constrained this way, or could arbitrary numbers end up as status
>> (say
>> 99) and if so, what happens to those messages?

You have a very odd way of quoting. You don't top post as such, but
you top post in context... Haven't seen that one before. Usually
people reply _below_ a section, not above it ;) It made my reply a
bit harder to write.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4895b34b243488085013917!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

No comments: