Thursday, July 3, 2008

Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j.urbanski@students.mimuw.edu.pl> writes:
> attached are two patches against HEAD. The smaller one is meant to be
> commited - it adds some functions that manipulate double-linked lists,
> namely inserting a new cell after or before another cell and swapping
> two adjacent cells.

> The gzipped one is WIP for my GSoC project. I've reworked the algorithm
> for determing most common lexemes.

I looked over this a bit. I'm not excited about adding functionality to
Dllist --- that data structure is barely used at all in the backend,
and I think a better case could be made for getting rid of it than
adding code to it. The analyze patch doesn't change my mind on the
point, because I don't think that Dllist is really helping you there
anyway. The data structure I'd suggest is a simple array of pointers
to the underlying hash table entries. Since you have a predetermined
maximum number of lexemes to track, you can just palloc the array once
--- you don't need the expansibility properties of a list. The only
operations you need are "add an entry at the end" (if you keep the
array sorted by descending count not ascending), "remove the end
entry", and "swap adjacent entries", all of which are actually cheaper
on an array than on a Dllist.

Another point is that you don't really need the array to be sorted all
the time. Instead of using what is basically an O(N^2) incremental
sort, you could consider applying qsort() when you do need it to be
sorted, which is at the end or when the table overflows and you need to
discard some entries. If you are willing to discard multiple entries
per overflow event, this could be quite cheap --- although I think in
the worst case where there are many overflows, it'd go back to being
O(N^2). Note BTW that adding a count=1 entry at the end cannot make the
array unsorted if it was sorted before. The only event that renders the
array unsorted is increasing an item's count to more than the count of
its predecessor --- so it might be worth keeping a predecessor pointer
in each hashtable entry that identifies its predecessor as of the time
of the last array sort, so you could check on-the-fly and avoid
unnecessary re-sorts. I'm not totally sure that this idea is a win,
but it seems worth investigating.

Other than that, the analyze patch looked generally sane to me,
and I think you're on the right track. Please rework and resubmit.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: