Thursday, September 25, 2008

Re: [HACKERS] Review Report: propose to include 3 new functions into intarray and intagg

No problem, I have time for clearing. But are these functions
guaranteed to be included in the contrib? If there is no guarantee,
seems the time of clearing will be wasted. (5 years ago I have already
cleaned one open-source library on demand and after that it was not
approved for PEAR repository, so now I am more circumspect, sorry :-).

So - is the approval procedure finished? If not, who could make the
final decision ("be or not to be")? Sorry, I don't yet know well the
patch proposal procedure...

P.S.
1. Unfortunately GROUP BY + ORDER BY is sometimes 1000 times (!)
slower than _int_group_count_sort. Because of that I have created this
function.
2. I have to assume that the input is sorted in functions, because
else the performance is lowered very much. Sort operation is quite
expensive; checking if an array is sorted or not is also quite
expensive... So I think it is not a good idea to add
sorting-independency to functions.
3. Seems the conversion of functions to anyarray/anyelement is much
more time-expensive than simply including it to specialized
intarray/intagg. Is it absolutely necessery?


On Mon, Sep 15, 2008 at 4:38 PM, Markus Wanner <markus@bluegap.ch> wrote:
> Hi,
>
> sorry for not having completed this review, yet. As you are obviously
> looking at the patch as well, I'll try to quickly write down my points so
> far.
>
> Trying to compile the intarray module, I now receive an error:
>
> error: 'INT4OID' undeclared (first use in this function)
>
> That can be solved by including "catalog/pg_type.h" from
> contrib/intarr/_int_op.c.
>
>
> The PG_FUNCTION_INFO_V1 and prototype definition certainly belong to the top
> of the file, where all others are.
>
> Some lines are longer than 80 columns and again comments are a bit sparse or
> even useless (no "additional things", please).
>
>
> Heikki Linnakangas wrote:
>>
>> I find it a bit unfriendly to have a function that depends on sorted
>> input, but doesn't check it. But that's probably not a good enough reason to
>> reject an otherwise simple and useful function. Also, we already have uniq,
>> which doesn't strictly speaking require sorted input, but it does if you
>> want to eliminate all duplicates from the array.
>
> I think it's a performance optimization which is absolutely required in some
> cases. Some time ago I've also had to rip out the sorting step from certain
> intarray module functions to save processing time.
>
> One option already mentioned somewhere would be saving a 'sorted' property
> for the array. Then again, I think such a thing would certainly have to be
> done globally, for all kinds of arrays.
>
>> _int_group_count_sort seems a bit special purpose. Why does it bother to
>> sort the output? That's wasted time if you don't need sorted output, or if
>> you want the array sorted by the integer value instead of frequency. If you
>> want sorted output, you can just sort it afterwards.
>
> Agreed. IMO the normal GROUP BY and ORDER BY stuff of the database itself
> should be used for such a thing. However, that means turning an array into a
> set of rows...
>
>> Also, it's requiring sorted input for a small performance gain, but
>> there's a lot more precedence in the existing intarray functions to not
>> require sorted input, but to sort the input instead (union, intersect, same,
>> overlap).
>
> ..and exactly these are the functions I had to wrap again to strip the
> sorting step, due to poor performance for known-sorted arrays.
>
>> I realize that the current implementation is faster for the use case where
>> the input is sorted, and output needs to be sorted, but if we go down that
>> path we'll soon have dozens of different variants of various functions, with
>> different ordering requirements of inputs and outputs.
>
> Agreed.
>
> However, given the OP is using that in production, there seems to be a use
> case for the optimization, where we have none for the same function without
> it.
>
>> So, I'd suggest changing _int_group_count_sort so that it doesn't require
>> sorted input, and doesn't sort the output. The binary search function looks
>> good to me (I think I'd prefer naming it bsearch(), though, though I can see
>> that it was named bidx in reference to the existing idx function). Also, as
>> Markus pointed out, the SGML docs need to be updated.
>
> As is, it should probably also carry the '_int' prefix, because it's not a
> general purpose array function. So propose to name it '_int_bsearch'.
>
> Overall I think these functions are overly specialized and should be
> replaced be more general counterparts in core. However, until we have that,
> it's hard to refuse such a thing for contrib.
>
> By that reasoning, however, the intarray would have to provide methods for
> sorted as well as asorted input arrays as well.
>
> I'm closing my part of reviewing of this patch now. Dmitry, how do you want
> to proceed with these patches? Do you have time for some cleaning up and
> writing documentation?
>
> Regards
>
> Markus Wanner
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: