Sunday, August 24, 2008

[HACKERS] Extending error-location reports deeper into the system

We currently have the ability to generate error location pointers, such
as

regression=# select nosuchcolumn from int8_tbl;
ERROR: column "nosuchcolumn" does not exist
LINE 1: select nosuchcolumn from int8_tbl;
^

for grammar-detected syntax errors and for errors during first-level
semantic analysis of certain constructs (variables, operators, functions
are about all IIRC). Beyond that, it's not possible because we don't
include location fields in any node types used in post-parse-analysis
query trees.

There was some discussion about this back in March in connection with
a patch proposed by Peter, and in
http://archives.postgresql.org/pgsql-hackers/2008-03/msg00631.php
I wrote

> the current parser location mechanism stores locations only for
> nodes that appear in raw grammar trees (gram.y output), *not* in
> analyzed expressions (transformExpr output). This was an intentional
> choice based on a couple of factors:
>
> * Once we no longer have the parser input string available, the location
> information would be just so much wasted space.
>
> * It would add a weird special case to the equalfuncs.c routines:
> should location fields be compared? (Probably not, but it seems a bit
> unprincipled to ignore them.) And other places might have comparable
> uncertainties what to do with 'em.

It occurred to me today that one of the foundational assumptions of that
old decision has changed, namely that the original query text isn't
available after parsing. We have recently fixed things so that it
almost always *is* available. So I think a reasonable case can be
made for extending most or all querytree node types to include a
location field, and then using these fields in the way sketched in
the above-referenced message to include location pointers in many more
error messages than we do now.

The point about equalfuncs behavior isn't bothering me a lot at the
moment. It seems clear that we *do* want equal() to ignore location
fields, because one of the main purposes it's used for is to note
whether, eg, "ORDER BY x" and "GROUP BY x" are referring to the same
variable, and of course those two occurrences aren't going to have the
same syntactic position.

Another interesting point is outfuncs/readfuncs behavior. I think that
we'd want outfuncs to print the location fields, because they're
possibly useful for debugging; but readfuncs should ignore the data and
set the fields to -1 (unknown) when reading nodes back in. The reason
for this is that the only application for reading nodes in is sucking in
stored rules, default expressions, etc. And these are exactly the cases
where we indeed no longer have the original source text handy. If we
didn't set the locations to unknown, then errors complaining about
problems arising within a rule would try to print pointers to locations
in the calling query's text having the same offsets as the problematic
item had in the original CREATE RULE or similar command. Not what we
want.

There is going to be some small distributed overhead from adding an
additional integer field to so many common Node types, but I find it
hard to believe that it'd be measurable.

So this all seems doable and probably not a very large task to get the
infrastructure in place, though of course actually extending many error
messages to include location pointers will probably happen piecemeal
over time.

Thoughts, objections? If there are none I'm tempted to work on this
in the week remaining before September commitfest.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: