XS backup restore

Revision as of 15:40, 17 October 2007 by (talk) (New page: <pre> XO wants to initiate a backup. In the following text, all timestamps are integers representing seconds elapsed since the UNIX epoch. XO side ------- 1. Issue a HTTP GET to XS wi...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
XO wants to initiate a backup.

In the following text, all timestamps are integers representing seconds
elapsed since the UNIX epoch.

XO side

1. Issue a HTTP GET to XS with path 
   /backup/<protocol version>/last/<this_XO_serial_number>
   <protocol version> is the integer representing the latest
   backup protocol version supported by this XO. In protocol version 1,
   a successful reply consists of two comma-separated integers:
       timestamp -- timestamp of latest backed up item for this user                                    
                    or 0 if there are no previous backups

       nonce -- a random 64-bit integer

   If the sent protocol version is not supported by the school server,
   it will return a 404 not found error, whose only body contents is 
   a comma-separated list of integers representing the backup protocol
   versions supported by this school server.

   If this school server refuses to provide backup service for this XO,
   it will return a 403 forbidden error.

2. If the request in step 1 succeeded, go to step 3. Otherwise, if none
   of the backup system versions on the XO (multiple may be
   present) are in the 'versions' variable listed in the 404 error, abort
   until next scheduled backup time (we cannot back up to this XS). If
   a version was returned that also exists locally, go back to step 1
   and use that protocol version.

3. Let to_backup_all be the collection of _all_ items currently in the
   XO's datastore. If returned timestamp in step 1 is 0, let to_backup
   be the same collection.
   If not 0, let to_backup be the collection of all items whose
   timestamp is greater _or equal_ to the returned timestamp.

4. Write out a plaintext index of all items in to_backup, where the
   index format is defined by the backup protocol version selected in
   step 2. For version 1, I propose a list of lines where the first
   line is a single integer stating the backup protocol version, and
   each following line is a JSON-encoded list describing a single
   entry in to_backup (metadata and filename). This list may include
   references to other files (e.g. thumbnails) as part of the metadata.

   Move this index in the datastore directory to a file called
   'backup.idx' overwriting an old such file if present.

5. Write out a plaintext index of all items in to_backup_all where
   the index format is defined by the backup protocol version selected
   in step 2. For version 1, I propose a list of lines where the first
   line is a single integer stating the backup protocol version, and
   each following line is just a UUID of each object in the list
   (meaning currently on the XO).

   Move this index in the datastore directory to a file called
   'backup-state.idx' overwriting an old such file if present.
6. For every item in to_backup, also write a line to a text file
   called 'backup-files.idx' in the datastore directory, overwriting
   the old file.  Each output line contains only the full path to the
   binary data for each object. For objects that have additional binary
   files associated (such as thumbnails), output an additional line
   per file.

7. Run rsync, telling it to read the list of input files
   from 'backup-files.idx' and write to a directory called 'backup-new/'
   in the user's home directory on the school server.

   Check the exit value from rsync. If non-zero, retry step 7 up to 3 times.
   If still non-zero, abort until next backup. Otherwise, proceed to step 8.

8. Issue a GET request to the XS, with path /backup/<protocol
   version>/new/<XO_serial_number>. For protocol version 1, include a
   'Backup-Auth' header, whose contents is the hex-digest output of
   SHA-1(<nonce>+<XO_UUID>), where 'nonce' is the value received in
   step 1, and XO_UUID is this XO's UUID.


XS side

On the school server, when getting a request for 
/backup/<protocol version>/last/<SN>:

1. Check if we support the protocol version. If not, return 404 and a list
   of supported versions. Otherwise, proceed.

-- Everything below describes protocol version 1:

2. Check if we know this machine (can find it in our registration DB on
   the XS). If not, return 403. We will not offer it backup service.
   Otherwise, proceed.
3. Check if backups for this machine exist. In protocol version 1, if
   backups don't exist, let timestamp be 0. Otherwise, find the
   timestamp of the last backed-up object for this machine and return

   (I deliberately don't specify where the school server stores the timestamp,
   as it might use mysql/sqlite/plain files for this, and the XO doesn't 
   and must not care.)

4. If backups for this machine don't exist yet, let nonce be
   0. Otherwise, find the file 'nonce' in the backup hierarchy for
   this XO, e.g.  /backups/<SN>/nonce and load its contents into the
   variable nonce.

5. Return comma-separated timestamp and nonce in the body of a 200 OK

On the school server, when getting a request for 
/backup/<protocol version>/new/<SN>:

1. Check if we support the protocol version. If not, return 404 and a list
   of supported versions. Otherwise, proceed.

-- Everything below describes protocol version 1:

2. If no 'Backup-Auth' header is present, return 403, otherwise
3. Load the contents of the 'nonce' file from the backup hierarchy for
   this XO (e.g. /backups/<SN>/nonce) in the nonce variable. If there
   is no nonce file, use '0' for the nonce variable.

4. Find the XO's UUID in the local database, load into XO_UUID
   variable. Verify that the contents of the 'Backup-Auth' header
   match exactly the contents of SHA1(<nonce>+<XO_UUID>). If not,
   return 403, otherwise return empty (no body) 200 OK request to the
   client and proceed to next step.

   (Note: the nonce circus is required to keep a malicious actor
   from inhibiting all backups on his network by watching for /last
   GETs, then issuing /new gets 5 seconds later for the same XO. As
   the backup won't have completed, getting an updater running on the
   server would invalidate the backup, as will be seen in the following
5. Spawn an updater process in the background that does this:

   5.1. Issue a call to a setuid helper command that makes the
        'backup-new' folder in the XO's home directory (on the server)
        writable by the updater UID.
   5.2. Check if a file exists in the XO's home directory, within the
        dir 'backup-new', called 'backup.idx.processing'. If the file
        does not exist, go to step 5.3.
        If its timestamp is NOT older than 10 minutes, exit the
        updater.  (We don't allow users to force us to do index
        updates for backups more frequently than once in 10 minutes.)
        If the timestamp is older than 10 minutes:

        * Check if a file called 'backup.idx.processing.pid' exists
          AND is owned by us. If so, read its contents -- it contains
          a PID of the updater that tried to deal with the new backup
          -- and issue a SIGKILL to that PID.

        * Go to step 5.4.
   5.3. Move 'backup.idx' to 'backup.idx.processing' and write our own
        PID to 'backup.idx.processing.pid'. If the move
        fails (because 'backup.idx' doesn't exist), go to last step.
        Check if 'backup-state.idx' exists.  If not, go to last step.

   5.4. Read 'backup.idx.processing' by line. The first line is a
        single backup protocol integer. If this updater doesn't
        support this version, the client sent a backup even though we
        told it not to. Go to last step.

        For every following line, check that the object filename it
        references exists in the 'backup-new' folder. If it exists,
        move this file to the server's real backup hierarchy,
        e.g. /backups/<SN>/ and add a record to the server's backup DB
        backend (whatever it is) for this object.  If the file doesn't
        exist, move to next line.

   5.5. Move 'backup-state.idx' to server backup hierarchy, e.g.
        /backup/<SN>/backup-state.idx. Generate a 64-bit nonce
        and write it out to /backup/<SN>/nonce.

   5.6. Delete everything in 'backup-new' and exit the updater.


XO wants to do a restore.

On the XO side:
1. Issue a HTTP GET to the XS with path
   /backup/<protocol version>/restore/<this_XO_serial_number>

-- Everything below describes protocol version 1:

   The response is 0 or a single absolute path on the XS, pointing to
   the location of this XO's backup files in the backup hierarchy. If
   the response is 0, abort and report to user; there are no backups
   to restore. Otherwise store the 'path' variable for future use.

   If the request returns a 500, abort and report to user that they
   must pick out restore files individually from the web interface.

   If the request returns a 503, wait 1 minute, then retry step 1,
   otherwise proceed.

2. Let variable index_path be the concatenation of the path variable
   from step 1 and the string 'restore.idx'. Rsync the file whose path
   is index_path from the XS to the XO.  This file is a set of lines
   formatted like the contents of a 'backup.idx' file -- produced
   exactly like in step 4 of the XO-side backup.  In other words, the
   first line repeats the protocol version, and every next line
   describes a single DS object. If the rsync fails, retry 3 times;
   if still failing, abort restore and report to user.

3. For every item in this list, parse out any paths to files,
   and write each one (e.g. one for the binary object, one for the
   thumbnail) to a local file called 'restore-files.idx', one per line.

   Note that the paths contained in 'restore.idx' that we
   received in step 2 are absolute paths _on the schoolserver_,
   e.g. /backups/<SN>/<filename>, and those paths MUST be preserved
   when writing to 'restore-files.idx'.

4. Run a rsync on the XO, going from the schoolserver to the XO, and
   pass it 'restore-files.idx' as the list of files to rsync.

5. Check the rsync exit value. If non-zero, retry 3 times. If still
   non-zero, abort and report failure to the user.

6. Go back through the list received in step 2 line by line.  For
   every file path in the current line (there might be several for
   e.g. binary object, thumbnail, etc), strip everything except the
   filename -- remove the directory components. Verify that the files
   exist locally on the XO.

   If they don't exist, rsync didn't get all the files back, but there
   should have been some (because we didn't get 0 for timestamp in
   step 1) AND rsync thinks it succeeded (because of step 5). Abort
   restore and report to user that something is wrong.

   If the files exist, issue a request to the DS to create the object
   based on the metadata in the line, and pass in stripped file paths
   for the contents/thumbnail.

   (Note: if the DS does not support setting creation timestamps or
   thumbnails through the present API, another function might have to
   be added specifically for the restore system to use, where such
   functions are allowed.)

7. If the last line in the list returned in step 1 is processed and
   stored in the DS, we have succeeded with the restore. Inform
   user. Eat some ice cream. Do the macarena.


On the school server, when getting a request for 
/backup/<protocol version>/restore/<SN>:

1. Check if we support the protocol version. If not, return 404 and a list
   of supported versions. Otherwise, proceed.

-- Everything below describes protocol version 1:

2. Check if backups for this machine exist. If not, return 200 OK whose
   only body contents is '0'. Otherwise, proceed.

3. Check if a file called 'restore.idx' exists in the backup hierarchy
   for the XO. If so, return absolute path to this XO's files in the
   server backup hierarchy (e.g. /backups/<SN>/) as sole body of a 200
   OK response. If it doesn't exist, proceed.
4. Check if a file called 'restore-state.idx' in the backup hierarchy
   for the XO exists. If not, return error 500. For some reason we don't
   have a state file for this machine; this shouldn't happen, but it means
   the user has to pick out objects to restore individually from the
   web interface.
5. Return 503 service unavailable, and in the background, spawn
   a restore process that does the following:

     5.1. Check if a file called 'restore-state.idx.processing' in the
          backup hierarchy for this XO exists. If not, proceed to next
          step. If it exists, and its timestamp is older than 10
          minutes, we tried to prepare a restore list for this machine
          already and somehow failed (e.g. database timeouts,
          etc). Check if 'restore-state.idx.processing.pid' exists and
          is owned by us; if so, load its contents and send SIGKILL
          to the PID, then move to step 5.3. If the timestamp is
          younger than 10 minutes, exit.
     5.2. Move 'restore-state.idx' in the XO's backup hierarchy to

     5.3. Write our own PID to 'restore-state.processing.idx.pid'.

     5.4. To a temporary file, write a line containing the backup
          protocol version.

     5.5. For each line in 'restore-state.processing.idx'
          (representing a UUID), query all the relevant metadata from
          the XS store and write it, one JSON dictionary-encoded line
          per object, to the temporary file. Paths of any referenced
          files (binary objects, thumbnails) must be absolute paths on
          the XS. If any queries fail, retry with a timeout, and if
          failure continues, exit the updater.

     5.6. When finished, move temporary file to 'restore.idx' in the
          backup hierarchy for this XO. Unlink