Saturday, August 2, 2008

Re: [HACKERS] [WIP] patch - Collation at database level

Hello,

the main reason why I've submitted the patch was to start a discussion and know other people's opinion on this problem.

On Tue, Jul 29, 2008 at 10:41 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

Where are the collations going to come from?  

There will be two new catalogs - pg_collate and pg_charset. Each of them will be filled with ANSI standard collations and charsets (ISO8BIT, LATIN1, UTF-8..) and alternatively with default collation set when creating. For instance if you create database cluster with initdb and specify en_US.utf8 there will be standard rows (ISO8BIT, LATIN1, UTF-8..) + one row with en_US.utf8 in template0. Then you can connect to template0 and create other collations if your POSIX locales support them and use them one per each database.

Have the various build and distributions issues been thought about?

Yes. Since POSIX locales doesn't guarantee any collation there will be hard-coded collations implemented regarding ANSI collation standard. Others can be set by command CREATE COLLATION.

 How are they going to be configured (not the SQL syntax, but how will the configuration be applied)?

pg_type, pg_attribute, pg_namespace of each database will be extended with collation oid column that will be specifying collation.

 How are the collations going to be applied at run-time?
 
Collation will be set when connecting to the database with setlocale(LC_COLLATION, XXX) and setlocale(LC_CTYPE, XXX)
 
 How are you going to handle locale and encoding conflicts?

Since I'm currently implementing collation support per database I don't think this is an issue. (It will be in the future I know.)
 
 I also think that the clauses you have attached to your CREATE COLLATION statement (case-insensitive,
accent-insensitive) are an oversimplification of reality.  I suggest you look
up the Unicode collation algorithm to learn about who collations work in
practice.

I already did in the very beginning of the development. The reason why I'm not implementing the whole Unicode collation algorithm is that this patch shold be sort of framework. You'll be able to use different collation functions not only POSIX locales so further development towards full Unicode collation algorithm is possible.

At the end of the next week I'll publish my bachelor thesis concerning this topic where everything will be explained in details so stay tuned.
 
Regards

Radek Strnad

No comments: