Monday, July 7, 2008

Re: [HACKERS] PATCH: CITEXT 2.0

"David E. Wheeler" <david@kineticode.com> writes:

> On Jul 7, 2008, at 12:21, David E. Wheeler wrote:
>
>> My question is: why? Shouldn't they all use the same function for
>> comparison? I'm happy to dupe this implementation for citext, but I don't
>> understand it. Should not all comparisons be executed consistently?
>
> Let me try to answer my own question by citing this comment:
>
> /*
> * Since we only care about equality or not-equality, we can avoid all
> the
> * expense of strcoll() here, and just do bitwise comparison.
> */
>
> So, the upshot is that the = and <> operators are not locale-aware, yes? They
> just do byte comparisons. Is that really the way it should be? I mean, could
> there not be strings that are equivalent but have different bytes?

There could be strings that strcoll returns 0 for even though they're not
identical. However that caused problems in Postgres so we decided that only
equal strings should actually compare equal. So if strcoll returns 0 then we
do a bytewise comparison to impose an arbitrary ordering.

Of course the obvious case of two equivalent strings with different bytes
would be two strings which differ only in case in a collation which doesn't
distinguish based on case. So you obviously can't take this route for citext.

I don't think you have to worry about the problem that cause Postgres to make
this change. IIRC it was someone comparing strings like paths and usernames
and getting false positives because they were in a Turkish locale which found
certain sequences of characters to be insignificant for ordering. Someone
who's using a citext data type has obviously decided that's precisely the kind
of behaviour they want.

--
Gregory Stark
EnterpriseDB

http://www.enterprisedb.com

Ask me about EnterpriseDB's RemoteDBA services!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: