Tuesday, August 26, 2008

[BUGS] Aborted VACUUM FULL -> crash + corruption (xlog non-existent)

Hello,

[note: in pastes below, the only thing changes are database names and
hostnames, for privacy reasons; otherwise plain cut'n'paste including
typos...)

Very short version: I cancelled a VACUUM FULL; server crashed; won't
start again because it tries to access an obsolete WAL log file that
is likely to be from the point in time when VACUUM FULL started.

Long version:

The following refers to PostgreSQL 8.2.4 running on Linux 2.6.19-3 on
an old sarge system. I realize 8.2.4 is too old a release, but the bug
seems distinct enough that I felt I should definitely report it. I did
not find anything that looked relevant in the point release changelog
for 8.2.

Background is that I have a couple of databases on the system with >
1000 relations (if that is at all relevant). Because of autovacuum
issues (likely having to do with a too small max_fsm_relations) I
wanted to perform a VACUUM FULL.

I did, but realized I actually wanted VACUUM VERBOSE FULL. So after a
few seconds I hit ctrl-c in the psql client (run locally on the
machine as the postgres user; thus ident authed and connected as the
postgres PostgreSQL user). Output was:

contentsharing=# \timing
Timing is on.
contentsharing=# vacuum full
contentsharing-# ;
Cancel request sent
ERROR: canceling statement due to user request
PANIC: cannot abort transaction 830706491, it was already committed
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
!> \q

After this I tried re-starting the database which failed (i.e., it was
not up after the start attempt). Checking the log file I had:

Aug 26 17:14:52 host postgres[5094]: [3116-1] user=,db= LOG: archived transaction log file "0000000100000018000000ED"
Aug 26 17:16:06 host postgres[28467]: [2-1] user=postgres,db=databasename ERROR: canceling statement due to user request
Aug 26 17:16:06 host postgres[28467]: [2-2] user=postgres,db=databasename STATEMENT: vacuum full
Aug 26 17:16:06 host postgres[28467]: [2-3] ^I;
Aug 26 17:16:06 host postgres[28467]: [3-1] user=postgres,db=databasename PANIC: cannot abort transaction 830706491, it was already committed
Aug 26 17:16:06 host postgres[5091]: [2-1] user=,db= LOG: server process (PID 28467) was terminated by signal 6
Aug 26 17:16:06 host postgres[5091]: [3-1] user=,db= LOG: terminating any other active server processes
Aug 26 17:16:06 host postgres[16307]: [2792-1] user=databasename,db=databasename WARNING: terminating connection because of crash of another server process
Aug 26 17:16:06 host postgres[16307]: [2792-2] user=databasename,db=databasename DETAIL: The postmaster has commanded this server process to roll back the current
Aug 26 17:16:06 host postgres[16307]: [2792-3] transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Aug 26 17:16:06 host postgres[16307]: [2792-4] user=databasename,db=databasename HINT: In a moment you should be able to reconnect to the database and repeat your
Aug 26 17:16:06 host postgres[16307]: [2792-5] command.
Aug 26 17:16:06 host postgres[5091]: [4-1] user=,db= LOG: all server processes terminated; reinitializing
Aug 26 17:16:07 host postgres[28483]: [5-1] user=postgres,db=databasename FATAL: the database system is starting up
Aug 26 17:16:07 host postgres[28482]: [5-1] user=,db= LOG: database system was interrupted at 2008-08-26 17:15:52 CEST
Aug 26 17:16:07 host postgres[28482]: [6-1] user=,db= LOG: could not open file "pg_xlog/0000000100000018000000ED" (log file 24, segment 237): No such file or directory
Aug 26 17:16:07 host postgres[28482]: [7-1] user=,db= LOG: invalid primary checkpoint record
Aug 26 17:16:07 host postgres[28482]: [8-1] user=,db= LOG: could not open file "pg_xlog/0000000100000018000000ED" (log file 24, segment 237): No such file or directory
Aug 26 17:16:07 host postgres[28482]: [9-1] user=,db= LOG: invalid secondary checkpoint record
Aug 26 17:16:07 host postgres[28482]: [10-1] user=,db= PANIC: could not locate a valid checkpoint record
Aug 26 17:16:07 host postgres[5091]: [5-1] user=,db= LOG: startup process (PID 28482) was terminated by signal 6
Aug 26 17:16:07 host postgres[5091]: [6-1] user=,db= LOG: aborting startup due to startup process failure

And the following is presumably from my manual start-up attempt (based on timestamps):

Aug 26 17:17:11 host postgres[28511]: [1-1] user=,db= LOG: could not bind socket for statistics collector: Cannot assign requested address
Aug 26 17:17:11 host postgres[28511]: [2-1] user=,db= LOG: disabling statistics collector for lack of working socket
Aug 26 17:17:11 host postgres[28511]: [3-1] user=,db= WARNING: autovacuum not started because of misconfiguration
Aug 26 17:17:11 host postgres[28511]: [3-2] user=,db= HINT: Enable options "stats_start_collector" and "stats_row_level".
Aug 26 17:17:11 host postgres[28512]: [4-1] user=,db= LOG: database system was interrupted at 2008-08-26 17:15:52 CEST
Aug 26 17:17:11 host postgres[28512]: [5-1] user=,db= LOG: could not open file "pg_xlog/0000000100000018000000ED" (log file 24, segment 237): No such file or directory
Aug 26 17:17:11 host postgres[28512]: [6-1] user=,db= LOG: invalid primary checkpoint record
Aug 26 17:17:11 host postgres[28512]: [7-1] user=,db= LOG: could not open file "pg_xlog/0000000100000018000000ED" (log file 24, segment 237): No such file or directory
Aug 26 17:17:11 host postgres[28512]: [8-1] user=,db= LOG: invalid secondary checkpoint record
Aug 26 17:17:11 host postgres[28512]: [9-1] user=,db= PANIC: could not locate a valid checkpoint record
Aug 26 17:17:11 host postgres[28511]: [4-1] user=,db= LOG: startup process (PID 28512) was terminated by signal 6
Aug 26 17:17:11 host postgres[28511]: [5-1] user=,db= LOG: aborting startup due to startup process failure

The pg_xlog directory contains:

-rw------- 1 postgres postgres 16777216 Aug 26 17:16 0000000100000018000000EE
drwx------ 2 postgres postgres 305232 Aug 26 17:14 archive_status

Note that the archival of the ....ED xlog file started at 17:14:52,
and I cancelled the VACUUM FULL at 17:16:06.

It is entirely likely, though I cannot say for sure, that I started
the VACUUM FULL prior to the archival of ...ED.

I still have the entire pgsql database directory untouched except for
the manual start-up attempt. I am afraid I cannot dump the thing
somewhere for inspection (it contains confidential information), but I
can assist in inspecting the contents.

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org

1 comment:

PKHunter said...

I did the same. Just FYI, I did not see the problems you mention.

I am on PG 8.3.

I was frustrated with the VACUUM FULL, which was taking hours for me. It is slow enough to sometimes even take days for some people.

Luckily for me, after canceling the VACUUM FULL process, I immediately reindexed the entire database using this command line utility:

> reindexdb DBNAME

This took about 10 minutes as my indexes are quite large, but this was a great boon as the DB is humming along just fine after this.