Sunday, June 22, 2008

Re: [pdxpug] Meeting recap - Logic and Databases with Jeff Davis

On Sat, 2008-06-21 at 15:29 -0700, David E. Wheeler wrote:
> Congrats on the new job, Jeff!
>

Thanks!

> I think it's more useful, but if you're basing the schema on an
> existing terminology used by your purchasing department, you can't go
> too far wrong.
>

I disagree. Using status codes has several disadvantages:

(1) They tie you to a specific business process. A business process is
just an algorithm implemented by people, and therefore the status codes
they use are just implementation details of an imperative process.

(2) States can be mutually exclusive, overlapping, or one state can
imply another. This is just confusing.

(3) They inherently require branching, which is complicated from a
logical standpoint. In SQL terms, this means using the CASE statement
more often, and requiring exclusions and other special treatment in
queries.

(4) It also means that any almost-correct query will provide an answer
that *looks* correct, even if it has subtle bugs because some state
should have been excluded, but wasn't.

Compare all those problems with something simple like just having the
tables: approval, purchase, and shipment_arrival. In that case, it's
simple to draw arbitrary implications from the data without awkward
exclusions and CASE statements (or any of the problems above).

Someone querying the data only needs to understand the predicates of the
relations, they don't need to understand what the states mean, nor do
they need to work backward from the meaning of the states to some kind
of query that properly handles the various states.

The problems with states stem from the fact that relational expressions
(and SQL) are declarative, but states imply some kind of state machine.
The states pull us away from declarative language and force us into
imperative language.

Maybe I should write a blog entry about this.

> Excellent points. I've never been anywhere where we had a data
> dictionary. Still, "pending" isn't too bad -- especially if you've
> ever looked at schemas in commercial products, where there can
> sometimes seem to be a use of security through obscurity -- that is,
> some vendors seem to try to protect their IP by giving their database
> tables and columns completely meaningless names.

I would argue that your data dictionary should match your relations. If
you have useful definitions in your data dictionary, why not make the
relations match, so that you can easily make logical inferences using
the relational operators?

Even a CSV file can have a data dictionary. I don't think a data
dictionary is a justification for a weakness in a database design
(although it's certainly better than nothing).

> Boy, that sure seems like a bug. This should be legal, though:

It's not just a bug, it's a standardized bug.

SQL is confusing because it uses NULLs in at least two senses:
(1) Unknown. This is a value: the third truth value. Operators,
functions, and IN all think this is what NULL means.
(2) Nothingness. This is not a value. Aggregates and outer joins think
this is what NULL means.

COALESCE can obviously help you switch between those two senses of NULL,
but why should you have to? Why should you get a result that *looks*
correct from a query that *looks* correct when it's actually wrong?

My talk slides have a full example that illustrates such a query:
http://www.pgcon.org/2008/schedule/events/83.en.html

Any language can do anything, so I don't think of a workaround like
COALESCE as a justification for the bad standard behavior.

> Your explanation makes perfect sense. It's just that the first case
> seems wrong (to me, at least).

It is wrong -- or at least horribly inconsistent. If SQL really wanted
to have both unknown and nothingness, it should have called them two
separate things.

> Yeah. COALESCE() is your friend. This is also why I try to make
> columns NOT NULL as often as possible. NULLs are pretty evil.

I'll take this opportunity to point out that COALESCE has similar
problems to using states: it requires special cases and the result
always looks right even when it's wrong.

Compare to nil in Ruby. Every operator in SQL is defined for NULL input
in SQL, but virtually no operators are defined for nil input in Ruby.
This means that a wrong handling of the special value nil in Ruby will
almost always result in an error, but in SQL will produce a
correct-looking result.

Of course, using special cases (like nil) is still imperative
programming, but at least it's slightly less error prone.

When NULL is used in two very different senses like that in SQL, it
effectively makes it an untyped system like assembly.

Regards,
Jeff Davis


--
Sent via pdxpug mailing list (pdxpug@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pdxpug

No comments: