Saturday, July 5, 2008

Re: [HACKERS] introduction of WIP window function patch

On Sun, 2008-07-06 at 03:40 +0900, H.Harada wrote:
> Hi,
>
> 2008/7/6 Simon Riggs <simon@2ndquadrant.com>:
> >
> > On Sat, 2008-07-05 at 16:20 +0200, Martijn van Oosterhout wrote:
> >> On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote:
> >
> >> > http://umitanuki.net/pgsql/wfv01/design.html
> >> >
> >> > The problem is, as written in the "Things to discussed" section of the
> >> > document, how you define window functions (e.g. RANK()). My idea is to
> >> > treat them as specialized functions such as SET OF functions and mark
> >> > it in pg_proc. But this doesn't resolve RANK() boundary problem.
> >>
> >> Actually, I would make RANK() and ROW_NUMBER() act more like
> >> aggregates. ISTM you have two kinds of window functions:
> >>
> >> - aggregation: a result is calculated over a set and the result copied
> >> across all the rows.
> >> - order depenadant: same as above, but the result is different for each
> >> row.
> >>
> >> I think you could make the latter work using the current aggregation
> >> setup, just by calling the final_func for each row rather than just
> >> once at the end.
> >
> > AFAICS there's no overlap between windowed aggregates and normal
> > aggregates, so we can different infrastructure for each. I like the
> > suggestion of doing it very similarly to current aggregates, but I would
> > introduce a new function hook for windowed aggregates, wfunc.
>
> I think there are two types of functions for windowed mode.
> - windowed aggregate
> this type of function is exactly same as normal aggregate. So we use
> functions that have been in pgsql already. Actually in my patch above,
> I didn't introduce any new function. This type of function includes
> simply sum(), avg(), etc. which returns same values on a partition or
> a window frame.
>
> - windowed function
> this is the NEW type of function. I guess we should add a new function
> type to pgsql. This type of function includes rank(), rank_dense(),
> row_number(), etc. Windowed functions returns different values per
> tuple.
>
> The difference between two types is if the function returns the same
> value during a partition or different values.
>
> So, windowed aggregate and normal aggregate overlap each other. How
> you know which one is that you see OVER clause in SQL just after the
> function call. When you see OVER after func(), and pg_proc says it's
> an aggregate, it's a windowed aggregate. Otherwise, it's a windowed
> function.
>
> If I misunderstood about those definitions please correct me.

Yes, I understand that and I think Martijn does also.

I've done some thinking and rooting around on this and I think I have a
different proposal for you, different to what we just discussed.

SQL2008 specifies window functions as

* rank functions
* distribution functions: percent_rank() and cume_dist()
* rownumber()
* ntile()
* lead() and lag()
* first, last and n-th value functions
* inverse distribution functions (similar to n-th value, based upon
distribution function results)

plus window aggregate functions (the normal aggregates COUNT, SUM etc)

Now looking through all of those, I don't see *any* window functions
that need access to different datatypes, or actually need to see the
values of the attributes.

The normal aggregates work with windows identically to the way they do
without windows, so no change needed there.

AFAICS we could define all of the non-aggregate window functions on the
above list *without* defining them as functions in pg_proc. That would
be a benefit because the window functions are very powerful and we'd
need to give them access to any/all tuples in the window.

So that would mean we don't provide a mechanism for user-defined
windowed aggregate functions at all. Which solves the discussion about
how to pass generic info through to them (at least long enough to get
the first implementation done).

We do already have such functions in code, e.g. greatest(). Sure they
need to be defined in code, but we don't need to come up with a generic
API for them.

If you disagree, think about how we'd implement lag() or ntile() and
what info we'd need to pass them.

--
Simon Riggs

www.2ndQuadrant.com

PostgreSQL Training, Services and Support


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: