XS backup restore: Difference between revisions
Jump to navigation
Jump to search
Line 119: | Line 119: | ||
== XO side == |
== XO side == |
||
1. Issue a HTTP GET to the XS with path |
1. Issue a HTTP GET to the XS with path |
||
/backup/''<protocol version>''/restore/''<this_XO_serial_number>'' |
/backup/''<protocol version>''/restore/''<this_XO_serial_number>'' |
||
Line 133: | Line 134: | ||
otherwise proceed. |
otherwise proceed. |
||
2. rsync the directory provided in step 1, restoring mode and |
|||
2. Let variable index_path be the concatenation of the path variable |
|||
times. Retry 3 times; if still failing, abort restore and |
|||
from step 1 and the string '''restore.idx'''. Rsync the file whose path |
|||
report to user. |
|||
is index_path from the XS to the XO. This file is a set of lines |
|||
formatted like the contents of a '''backup.idx''' file -- produced |
|||
(Do we need to remove the fetched files in case of a dropped |
|||
rsync? rsync guarantees we won't get partial files in place, so |
|||
first line repeats the protocol version, and every next line |
|||
it is reasonably safe, and makes retries "incremental". As long |
|||
describes a single DS object. If the rsync fails, retry 3 times; |
|||
as the metadata is restored only once step 2 succeeds, the Journal |
|||
if still failing, abort restore and report to user. |
|||
should be ok...) |
|||
3. |
3. Rebuild the metadata in Xapian, based on the metadata.json file |
||
that should have been restored by rsync in #2. Might make sense |
|||
and write each one (e.g. one for the binary object, one for the |
|||
to apply some checks. |
|||
thumbnail) to a local file called '''restore-files.idx''', one per line. |
|||
Mainly: do the files named by the metadata file exist? |
|||
received in step 2 are absolute paths '''on the schoolserver''', |
|||
e.g. /backups/''<SN>''/''<filename>'', and those paths MUST be preserved |
|||
when writing to '''restore-files.idx'''. |
|||
The race conditions that exist during the backup generation mean that |
|||
4. Run a rsync on the XO, going from the schoolserver to the XO, and |
|||
the document may have changed or vanished after the metadata was created. |
|||
pass it '''restore-files.idx''' as the list of files to rsync. |
|||
4. We have succeeded with the restore. Inform |
|||
5. Check the rsync exit value. If non-zero, retry 3 times. If still |
|||
user. '''Eat some ice cream. Dance salsa''' |
|||
non-zero, abort and report failure to the user. |
|||
6. Go back through the list received in step 2 line by line. For |
|||
every file path in the current line (there might be several for |
|||
e.g. binary object, thumbnail, etc), strip everything except the |
|||
filename -- remove the directory components. Verify that the files |
|||
exist locally on the XO. |
|||
If they don't exist, rsync didn't get all the files back, but there |
|||
should have been some (because we didn't get 0 for timestamp in |
|||
step 1) AND rsync thinks it succeeded (because of step 5). Abort |
|||
restore and report to user that something is wrong. |
|||
If the files exist, issue a request to the DS to create the object |
|||
based on the metadata in the line, and pass in stripped file paths |
|||
for the contents/thumbnail. |
|||
(Note: if the DS does not support setting creation timestamps or |
|||
thumbnails through the present API, another function might have to |
|||
be added specifically for the restore system to use, where such |
|||
functions are allowed.) |
|||
7. If the last line in the list returned in step 1 is processed and |
|||
stored in the DS, we have succeeded with the restore. Inform |
|||
user. '''Eat some ice cream. Do the macarena.''' |
|||
== XS side == |
== XS side == |
||
Line 192: | Line 167: | ||
only body contents is '''0'''. Otherwise, proceed. |
only body contents is '''0'''. Otherwise, proceed. |
||
3. Check |
3. Check for system and network traffic load metrics. Return 503 for "not now". |
||
for the XO. If so, return absolute path to this XO's files in the |
|||
4. Find the latest complete backup - it should be the most recent directory |
|||
server backup hierarchy (e.g. /backups/''<SN>''/) as sole body of a 200 |
|||
following this format in the home directory for the laptop: |
|||
OK response. If it doesn't exist, proceed. |
|||
~/datastore-YYYY-MM-DD |
|||
Note: 'Most recent' should be intepreted on the parsed datestamp from the |
|||
directory name, not the FS ctime/mtime. |
|||
Return the directory path in a '200' response. |
|||
4. Check if a file called '''restore-state.idx''' in the backup hierarchy |
|||
for the XO exists. If not, return error 500. For some reason we don't |
|||
have a state file for this machine; this shouldn't happen, but it means |
|||
the user has to pick out objects to restore individually from the |
|||
web interface. |
|||
5. Return 503 service unavailable, and in the background, spawn |
|||
a restore process that does the following: |
|||
5.1. Check if a file called '''restore-state.idx.processing''' in the |
|||
backup hierarchy for this XO exists. If not, proceed to next |
|||
step. If it exists, and its timestamp is older than 10 |
|||
minutes, we tried to prepare a restore list for this machine |
|||
already and somehow failed (e.g. database timeouts, |
|||
etc). Check if '''restore-state.idx.processing.pid''' exists and |
|||
is owned by us; if so, load its contents and send SIGKILL |
|||
to the PID, then move to step '''5.3'''. If the timestamp is |
|||
younger than 10 minutes, exit. |
|||
5.2. Move '''restore-state.idx''' in the XO's backup hierarchy to |
|||
'''restore-state.processing.idx''' |
|||
5.3. Write our own PID to '''restore-state.processing.idx.pid'''. |
|||
5.4. To a temporary file, write a line containing the backup |
|||
protocol version. |
|||
5.5. For each line in '''restore-state.processing.idx''' |
|||
(representing a UUID), query all the relevant metadata from |
|||
the XS store and write it, one JSON dictionary-encoded line |
|||
per object, to the temporary file. Paths of any referenced |
|||
files (binary objects, thumbnails) must be absolute paths on |
|||
the XS. If any queries fail, retry with a timeout, and if |
|||
failure continues, exit the updater. |
|||
5.6. When finished, move temporary file to '''restore.idx''' in the |
|||
backup hierarchy for this XO. Unlink |
|||
'''restore-state.processing.idx'''. |
Revision as of 02:49, 25 April 2008
Goals
- Simple, efficient (minimise processing, traffic), quick dev turnaround, debuggable
- Sane, fail-safe, atomic-ish
- Independent of the actual storage strategy (DS-agnostic)
- And yet, it must work well with the current DS (as of April 2008), and avoid restricting the evolution of the DS
- Safe for XO and XS
- The server can refuse to backup due to traffic/load
- Simple version negotiation
- Supports full homedir restore
- Supports per-document restore (via journal and/or webbased)
- There is some interest in leveraging a webbased 'document restore' facility as 'async document share/publish' mechanism.
Notes
- All timestamps are integers representing seconds elapsed since the UNIX epoch.
- There is a REST meta-protocol versioning scheme. Outside of that initialcheck, what this page describes is the version "1" of the backup/restore protocol.
XO-initiated backup
XO side
1. Issue a HTTP GET to XS with path /backup/<protocol version>/available/<this_XO_serial_number> <protocol version> is the integer representing the latest backup protocol version supported by this XO. In protocol version 1, a successful reply consists of a single integer: timestamp -- timestamp of latest backed up item for this user or 0 if there are no previous backups If the sent protocol version is not supported by the school server, it will return a 404 not found error, whose only body contents is a comma-separated list of integers representing the backup protocol versions supported by this school server. If this school server refuses to provide backup service for this XO, it will return a 403 forbidden error. If the school server is too busy to deal with the XO's backup request, it will return a 503 service unavailable error. The XO will sleep 5 minutes and retry. 2. If the request in step 1 succeeded, go to step 3. Otherwise, if none of the backup system versions on the XO (multiple may be present) are in the 'versions' variable listed in the 404 error, abort until next scheduled backup time (we cannot back up to this XS). If a version was returned that also exists locally, go back to step 1 and use that protocol version. 3. Write out all the metadata for all the documents available for backup, in CanonicalJSON format. Save it as metadata.json overwriting (atomically) any previously existing version. 4. Run rsync-over-ssh between the datastore and a remote directory called datastore-current/ in the user's home directory on the XS. The remote datastore-current directory will have a complete set of files so use the rsync facilities available to optimise the transfer and delete stale files: --times --partial (to make retries faster) --delete Check the exit value from rsync. If non-zero, retry up to 3 times. If still non-zero, abort until next backup. 5. Store the epoch of the end time of step 4.
Note: This backup scheme is not atomic. Users of the backed-up data must be prepared for slightly inconsistent state between metadata and files - a large window exists between steps 3 and 5. Solutions to this could come from the FS (a ZFS-like implementation) or from a higher-level layer (a git-based DS for example).
XS side
On the school server, when getting a request for /backup/<protocol version>/available/<SN>:
1. Check if we support the protocol version. If not, return 404 and a list of supported versions. Otherwise, proceed. 2. Check if we know this machine (can find it in our registration DB on the XS). If not, return 403. We will not offer it backup service. Check if we're too busy to process another concurrent backup (e.g. based on transfer rate or number of rsync processes), if so, return 503. 3. Check if backups for this machine exist. In protocol version 1, if backups don't exist, let timestamp be 0. Otherwise, find the timestamp of the last backed-up object for this machine and return it. 4. Check system and network load metrics - can we offer service to this client? 5. Return timestamp in the body of a 200 OK response.
When the rsync-over-ssh connection comes in, we need to have an rsync wrapper script that will
1. Establish a lock using flock to prevent overlaps 2. Cleanup/sanitise parameter list to rsync 3. Upon successful completion, set a success flag
XS maintenance
- A regular cronjob checks for recent success flags. Home directories that are marked as successfully backed up will be 'shadowed' with a hardcopy script similar to pdumpfs.
- It might be a good idea to spot partial/failed backups and checkpoint/shadow them anyway. If our handling of inconsistent data is reasonably good, a partial backup might be a passable data source for per-document restores.
- A low-freq cronjob runs hardlink.py
- A cronjob removes old pdumpfs snapshots
XO-initiated full restore
XO side
1. Issue a HTTP GET to the XS with path /backup/<protocol version>/restore/<this_XO_serial_number> The response is 0 or a single absolute path on the XS, pointing to the location of this XO's backup files in the backup hierarchy. If the response is 0, abort and report to user; there are no backups to restore. Otherwise store the path variable for future use. If the request returns a 500, abort and report to user that they must pick out restore files individually from the web interface. If the request returns a 503, wait 1 minute, then retry step 1, otherwise proceed. 2. rsync the directory provided in step 1, restoring mode and times. Retry 3 times; if still failing, abort restore and report to user.
(Do we need to remove the fetched files in case of a dropped rsync? rsync guarantees we won't get partial files in place, so it is reasonably safe, and makes retries "incremental". As long as the metadata is restored only once step 2 succeeds, the Journal should be ok...) 3. Rebuild the metadata in Xapian, based on the metadata.json file that should have been restored by rsync in #2. Might make sense to apply some checks. Mainly: do the files named by the metadata file exist? The race conditions that exist during the backup generation mean that the document may have changed or vanished after the metadata was created. 4. We have succeeded with the restore. Inform user. Eat some ice cream. Dance salsa
XS side
On the school server, when getting a request for /backup/<protocol version>/restore/<SN>:
1. Check if we support the protocol version. If not, return 404 and a list of supported versions. Otherwise, proceed. 2. Check if backups for this machine exist. If not, return 200 OK whose only body contents is 0. Otherwise, proceed. 3. Check for system and network traffic load metrics. Return 503 for "not now".
4. Find the latest complete backup - it should be the most recent directory following this format in the home directory for the laptop:
~/datastore-YYYY-MM-DD
Note: 'Most recent' should be intepreted on the parsed datestamp from the directory name, not the FS ctime/mtime. Return the directory path in a '200' response.