Wednesday, June 25, 2008

Re: [GENERAL] 0xc3 error Text Search Windows French

Sorry one last detail.

All of my databases are in utf-8 format. My Windows XP is en_AU and
defaults to ISO-8859-1 character sets. My postgresql.conf is set to the
default for the client_encoding setting, which should then default to
the database utf-8 format.

Andrew wrote:
> One additional aspect. I just ran the create text search dictionary
> command without the stopfile declaration using the OO dictionaries,
> and it worked fine with the select ts_lexize('public.fr_ispell',
> 'catalogue'); command executing with no problems. However, after
> creating an associated catalogue based on a copy of the
> pg_catalog.french catalogue, calls to ts_debug against my custom
> French config result in the 0xc3 error. So it is looking like the
> problem is restricted to the parsing of the stop file.
> I ran through the other out of the box supplied stemmers, which I have
> not touched in anyway and it is also occurring with the portuguese
> catalogue.
>
> Cheers
>
> Andy
>
> Andrew wrote:
>> I have a feeling that an issue I'm running into is related to this:
>> http://archives.postgresql.org/pgsql-bugs/2008-06/msg00113.php
>>
>> On Windows XP running PgAdmin III 1.8.4 against either PostgreSQL
>> 8.3.0 or 8.3.3 DB, when attempting to do a:
>>
>> select * from ts_debug('french', 'catalogue');
>>
>> getting the following error:
>>
>> ERROR: invalid byte sequence for encoding "UTF8": 0xc3
>> HINT: This error can also happen if the byte sequence does not match
>> the encoding expected by the server, which is controlled by
>> "client_encoding".
>> CONTEXT: SQL function "ts_debug" statement 1
>>
>> I have replaced the french.stop file with the one from the snowball
>> web site
>> (http://snowball.tartarus.org/algorithms/french/stemmer.html) to see
>> if that would make any difference. But the same issue. I have also
>> attempted to load the French Hunspell dictionary from the Open Office
>> web site (http://wiki.services.openoffice.org/wiki/Dictionaries),
>> using the following command:
>>
>> CREATE TEXT SEARCH DICTIONARY public.fr_ispell (
>> TEMPLATE = pg_catalog.ispell,
>> DictFile = fr_FR,
>> AffFile = fr_FR,
>> StopWords = french
>> );
>>
>> But getting the same error. I have successfully loaded the English
>> and Arabic dictionaries and an Arabic stop file I sourced from
>> elsewhere, and they work fine with the various text search function
>> calls, so it appears to be specifically related to a French character
>> occurring in the stop file and the dictionaries. To use the French
>> OO dictionaries, I had to convert them from an ISO-8859-15 character
>> set encoding to UTF-8. As it still had the same result as with the
>> packaged stop file when converting on Windows, I downloaded them and
>> converted the encoding on a Linux machine before copying them across
>> to windows to see if that would help, but it didn't.
>>
>> However, if I run the ts_debug('french', 'catalogue'); against a
>> Linux version of PostgreSQL 8.3.1, it works fine. I have not tried
>> version 8.3.1 on Windows. While there are a lot more combinations to
>> exhaust before I can make a categorical statement, at this stage it
>> appears to be pointing towards an issue with the UTF-8 parser of
>> PostgreSQL on Windows.
>>
>> Is this an outstanding defect, or is there something that I'm doing
>> wrong in my environment? I have attempted to find anything related
>> on the Internet, but other than the introductory reference, I have
>> not found anything, which for what I would imagine to be, of the size
>> of the French user base surprises me. Hence, I'm thinking that
>> perhaps it may be something in my environment causing the issue. If
>> others could also reproduce the error on their XP machines, that
>> would indicate that the issue was not something specific just to me.
>>
>> At this stage, it is not that important to me, as I'm just playing
>> around with text search for my own curiosity and French was just a
>> language I have randomly picked, along with Arabic (for which I'm
>> lacking a snowball stemmer). I don't actually read, much less speak
>> those languages. However, it would still be nice to have them working.
>>
>> An additional related topic. OO have for some languages, thesaurus
>> files which are not in the same format as supported by Pg Full Text
>> Search. Are there any plans to support the OO thesaurus file
>> formats? They also have hyphenation files. Are there any plans to
>> extend the current dictionary files to include hyphenation rules as
>> captured in the OO hyphenation files? I'm not sure how, if at all
>> hyphenation rules would improve on indexing and searches, but I
>> thought as the files exist, I would pose the question.
>>
>> Thanks,
>>
>> Andy
>>
>>
>>
>>
>>
>
>


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

No comments: