[wal-e] Recover to latest and promote

Discussion:

Stuart Bishop

2017-03-10 17:48:35 UTC

Hi.

Is there a way to recover to the latest available target_recovery_time
or target_recovery_xid, and automatically promote? The best I seem to
be able to do is watch the logs and manually promote once wal fetching
starts failing and I've run out of files to replay.

I believe that to do automatic promotion after recovery I need to
specify a target_recovery_time or target_recovery_xid and I can't see
any way to determine that. Unless perhaps I download the wal files,
navigating timeline switches, and analyze them.

--
Stuart Bishop <***@stuartbishop.net>
http://www.stuartbishop.net/
--
You received this message because you are subscribed to the Google Groups "wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wal-e+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Farina

2017-03-10 19:49:54 UTC

Permalink

Post by Stuart Bishop
Hi.
Is there a way to recover to the latest available target_recovery_time
or target_recovery_xid, and automatically promote? The best I seem to
be able to do is watch the logs and manually promote once wal fetching
starts failing and I've run out of files to replay.
I believe that to do automatic promotion after recovery I need to
specify a target_recovery_time or target_recovery_xid and I can't see
any way to determine that. Unless perhaps I download the wal files,
navigating timeline switches, and analyze them.

You can turn standby_mode off, and then the first WAL-E download failure
will cause a promotion.

I am moderately cautious on this: if WAL-E, or any wrapping program exit
with an unexpected status code, the system will leave recovery and start up.

Postgres treats all exit codes greater than 125 as such a case:

[....]
* However, if the failure was due to any sort of signal, it's best to
* punt and abort recovery. (If we "return false" here, upper levels will
* assume that recovery is complete and start up the database!) It's
* essential to abort on child SIGINT and SIGQUIT, because per spec
* system() ignores SIGINT and SIGQUIT while waiting; if we see one of
* those it's a good bet we should have gotten it too.
*
* On SIGTERM, assume we have received a fast shutdown request, and exit
* cleanly. It's pure chance whether we receive the SIGTERM first, or the
* child process. If we receive it first, the signal handler will call
* proc_exit, otherwise we do it here. If we or the child process received
* SIGTERM for any other reason than a fast shutdown request, postmaster
* will perform an immediate shutdown when it sees us exiting
* unexpectedly.
*
* Per the Single Unix Spec, shells report exit status > 128 when a called
* command died on a signal. Also, 126 and 127 are used to report
* problems such as an unfindable command; treat those as fatal errors
* too.
*/
if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)
proc_exit(1);

signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

I have modestly tried to make WAL-E safe to this purpose, but it has never
quite sat right with me to trust this mechanism in Postgres to promote a
database. Many programs, e.g. wrappers like envdir or whathaveyou, are not
guaranteed to emit status codes > 125 in all non-archive-recovery-failure
cases.

That said, it will work nearly 100% of the time.

--
You received this message because you are subscribed to the Google Groups "wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wal-e+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stuart Bishop

2017-03-13 06:08:44 UTC

Permalink

Post by Daniel Farina

You can turn standby_mode off, and then the first WAL-E download failure
will cause a promotion.
I am moderately cautious on this: if WAL-E, or any wrapping program exit
with an unexpected status code, the system will leave recovery and start up.
[....]
I have modestly tried to make WAL-E safe to this purpose, but it has never
quite sat right with me to trust this mechanism in Postgres to promote a
database. Many programs, e.g. wrappers like envdir or whathaveyou, are not
guaranteed to emit status codes > 125 in all non-archive-recovery-failure
cases.
That said, it will work nearly 100% of the time.

Ah, great. Thanks for all this. Its really helped clarify my design,
and will avoid automatic promotion when I can.

I believe it would also work if I could download all the WAL files
after downloading the backup, ensuring I had them all in place before
restarting. I think I would need to teach WAL-E to navigate timeline
changes to do that reliably (unless I just waste bandwidth and disk
space and download all future timelines, which is probably good enough
for real life use cases)

Daniel Farina

2017-03-13 18:27:24 UTC

Permalink

Post by Stuart Bishop
I believe it would also work if I could download all the WAL files
after downloading the backup, ensuring I had them all in place before
restarting. I think I would need to teach WAL-E to navigate timeline
changes to do that reliably (unless I just waste bandwidth and disk
space and download all future timelines, which is probably good enough
for real life use cases)

Another thing you might find useful is the default behavior for this
promotion is to merely "pause" recovery. Thus, you have an opportunity to
check timestamps and the data for suitability. If you want to continue
archive retries, you then can call a function to continue function.

See https://www.postgresql.org/docs/9.6/static/functions-admin.html,
xlog_replay_resume.