Tuesday, September 9, 2008

[GENERAL] Various intermittent bugs/instability - how to debug?

Dear PostgreSQL community,

I hope you can help me with a problem I'm having - I'm stuck and
don't know how to debug this further.

I have a rather large nightly process that imports a lot of data from
the OpenStreetMap project into a PostGIS database, then proceeds doing
all sorts of things - creating spatial indexes, computing bounding
boxes, doing simplification of geometries, that kind of stuff. The whole
job usually takes about five hours.

I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine
with 8 GB RAM.

Every other night, the process aborts with some strange error message,
and never at the same position:

ERROR: invalid page header in block 166406 of relation "node_tags"

ERROR: could not open segment 2 of relation 1663/24253056/24253895
(target block 1421295656): No such file or directory

ERROR: Unknown geometry type: 10

When I continue the process after the failure, it will usually work.

I know you all think "hardware problem" now. Of course this was my first
guess as well. I ran a memory test for a night, no results; I downgraded
do "failsafe defaults" for all BIOS timings, again no change. Ran
"cpuburn" and all sorts of other things to grill the hardware - nothing.

Then I bought an entirely new machine; similar setup, but using a
Gigabyte instead of Asus mainboard, different chipset, slightly faster
Quad-Core processor, and again 8 GB RAM and Ubuntu "Hardy" with
PostgresSQL 8.3 and matching PostGIS.

Believe it or not, this machine shows the *same* problems. It is not
100% reproducible, sometimes the job works fully, but every other day it
just breaks down with one of the funny messages like above. No memtest
errors here either.

Both machines are "consumer" quality, i.e. normal Intel processors and
not the "server" (Xeon) stock.

I am at a loss - how can I proceed? This looks like a hardware problem
alright, but so simliar problems on two so different machines? Is there
something wrong with Intel's Quad-Core CPUs?

What could I do to have a better chance of reproducing the error and
ultimately identifying the component responsible? Is there some kind of
"PostgresSQL load test", something like "cpuburn" for PostgreSQL?

Have there been other reports of intermittent problems like mine, and
does anybody have any blind guesses...?

Thanks
Frederik

--
Frederik Ramm www.geofabrik.de
Geofabrik GmbH Handelsregister: HRB Mannheim 703657
Rueppurrer Strasse 4 Geschaeftsfuehrung: Frederik Ramm
76137 Karlsruhe Tel: 0721-1803560-0
frederik.ramm@geofabrik.de Fax: 0721-1803560-9

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

No comments: