Saturday, May 24, 2008

Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

Now I understand the code much better. A few more questions on headline
generation that I was not able to get from the code:

1. Why is hlparsetext used to parse the document rather than the
parsetext function? Since words to be included in the headline will be
marked afterwords, it seems more reasonable to just use the parsetext
function.

The main difference I see is the use of hlfinditem and marking whether
some word is repeated.

The reason this is important is that hlparsetext does not seem to be
storing word positions which parsetext does. The word positions are
important for generating headline with fragments.

2.
> I would prefer the signature ts_headline( [regconfig,] text, tsquery
>[,text] )and function should accept 'NumFragments=>N' for default
>parser. Another parsers may use another options.

Does this mean we want a unified function ts_headline and we trigger the
fragments if NumFragments is specified? It seems that introducing a new
function which can take configuration OID, or name is complex as there
are so many functions handling these issues in wparser.c.

If this is true then we need to just add marking of headline words in
prsd_headline. Otherwise we will need another prsd_headline_with_covers
function.

3. In many cases people may already have TSVector for a given document
(for search operation). Would it be faster to pass TSVector to headline
function when compared to computing TSVector each time? If that is the
case then should we have an option to pass TSVector to headline
function?

-Sushant.

On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote:
> [moved to -hackers, because talk is about implementation details]
>
> > I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
> > (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php)
> Thank you.
>
> 1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c
> now contrib/tsearch2 is compatibility layer for old applications - they don't
> know about new features. So, this part isn't needed.
>
> 2 solution to compile function (ts_headline_with_fragments) into core, but
> using it only from contrib module looks very odd. So, new feature can be used
> only with compatibility layer for old release :)
>
> 3 headline_with_fragments() is hardcoded to use default parser, but what will be
> in case when configuration uses another parser? For example, for japanese language.
>
> 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] )
> and function should accept 'NumFragments=>N' for default parser. Another parsers
> may use another options.
>
> 5 it just doesn't work correctly, because new code doesn't care of parser
> specific type of lexemes.
> contrib_regression=# select headline_with_fragments('english', 'wow asd-wow
> wow', 'asd', '');
> headline_with_fragments
> ----------------------------------
> ...wow asd-wow<b>asd</b>-wow wow
> (1 row)
>
>
> So, I incline to use existing framework/infrastructure although it may be a
> subject to change.
>
> Some description:
> 1 ts_headline defines a correct parser to use
> 2 it calls hlparsetext to split text into structure suitable for both goals:
> find the best fragment(s) and concatenate that fragment(s) back to the text
> representation
> 3 it calls parser specific method prsheadline which works with preparsed text
> (parse was done in hlparsetext). Method should mark a needed
> words/parts/lexemes etc.
> 4 ts_headline glues fragments into text and returns that.
>
> We need a parser's headline method because only parser knows all about its lexemes.
>
>
> --
> Teodor Sigaev E-mail: teodor@sigaev.ru
> WWW: http://www.sigaev.ru/
>
>


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: