XS Blueprints:Datastore Simple Backup and Restore

From OLPC
Revision as of 19:19, 30 July 2008 by 18.85.47.114 (talk) (TODOs and future work)
Jump to: navigation, search

At a school that has an XS in place, the Datastore Simple Backup (aka ds-backup) provides an automatic and invisible backup of the documents present in the Journal of each XO. This can be used for recovery of old documents if they have been deleted or overwritten and when the laptop is replaced or reflashed.

The storage of documents in the XS is also useful for other uses, such as a simple publishing mechanism.

A strong DS-Backup facility and good availability of the XS allows users to delete large documents to have space on their XOs, knowing that the XS will hold their docs.

Scenarios

  • Jim has deleted or changed a TurteArt activity he did last month, and now he wants it back to use it as a starting point for a new activity.
  • Jocinta's XO broke, and has been repaired and reflashed, so her documents are gone. She wants to retrieve them from the backup on the XS.

Implementation Notes

XO side

A script checks every 30 minutes (on cron - see our TODOs) whether it is appropriate to attempt a backup run - the checks include whether we've completed a backup today, whether we can reach the XS, power status, etc. This script has a random sleep so that clients hit the XS staggered over time (unfortunate that crond doesn't offer this).

If it decides to go ahead, it will grab a lock and ask the XS for permission to start a backup. If the XS is too busy, the process on the XO will retry a few times with exponential back-off between the attempts.

The backup process itself uses rsync over SSH - the XS preserves the previous backup attempt, so we'll transfer incremental updates. An incomplete run can be completed by the next attempt. Once an rsync run completes successfully, the client runs a second, separate run to "touch" a flag-completed file on the XS, marking success.

XS side

We have 3 processes

Traffic control

A simple mod_python script that checks of the XO is registered, and provides basic "traffic control" to keep the load from the backup processes from swamping the XS (and potentially the network).

Backup-complete script

When the client transfers the "flag completed" file to the XS, incrond (an inotify monitor) fires off an execution of `postprocess.py`, which makes a hardlinked copy of the just-transferred directory. It also updates the "datastore-latest" symlink to point to the latest snapshot.

Daily cleanup

Executed on cron, it

  • enforces a per-user "soft" quota - once over the quota, the oldest snapshots for the user are deleted
  • for snapshots over a given age (1 month?) it only keeps one per month - removing intermediary snapshots for that user
  • it attempts to hardlink copies across users

Test plans and user walkthrough

Joe - feel free to edit, expand and/or move this section to a different page as you see fit :-)

Testing the backup run

  1. Start with an unregistered XO, register it with the XS - either over an Active Antenna mesh connection, or a regular AP wifi connection.
  2. After registration, you need to restart the XO (this is part of the XO side of the registration process as of build 708 / joyride 2121, might not be needed in later builds).
  3. Create some documents on the XO - or have them created before registration.
  4. Wait until the backup run happens - it will be triggered once a day. How to recognize that it has happened?
    • On the XO, run `stat /home/sugar/.default/ds_backup-done` and look for the 'modified' time, which shows the last time it ran successfully. Check that the XO clock is set to GMT, and might be off-track. Try `TZ=America/New_York stat /home/sugar/.default/ds_backup-done` to see it in local time.
    • On the XS, a successful registration will have created a directory `/library/users/<Serial Number>` - and each successful backup run creates a new directory under `/library/users/<Serial Number>/datastore`. The directories have a datestamp, and when the backup run completes successfully, a symlink is updated to point to the latest one (called "datastore-latest").

Restore a single document without Moodle

While we get the Moodle side of things sorted, we have a provisional restore UI to support this.

  1. On the XO, ensure you are hooked up to the School Server network, and open Browse
  2. Follow the link to the Schoolserver
  3. Follow the link to "restore"
  4. You should see a listing of backup dates - pick a date
  5. You will see a listing of the documents available for that date, pick a document
  6. Clicking on that document will download it, and it will appear in the Journal (and it may auto-open - need to check that ;-) )

Restore a single document with Moodle

Note: this describes a future feature.

  1. On the XO, ensure you are hooked up to the School Server network, and open Browse
  2. Follow the link to the Schoolserver
  3. On the top-right-corner you will see "You are logged in as 'Nickname'" - where follow the link in the nickname.
  4. The page shows several tabs - click on the "WebJournal" tab (see WebJournal_Project)
  5. You should see the latest documents, and a link to "older journal entries"
  6. Optional: If choosing from the older journal entries you will see a listing of backup dates - pick a date
  7. You will see a listing of the documents available for that date, pick a document
  8. Clicking on that document will download it, and it will appear in the Journal (and it may auto-open - need to check that ;-) )

TODOs and future work

  • UPDATE Ds-backup (or deprecate it) when you make releases!
  • Automagic authentication & Moodle integration
  • In order to be simple, the initial implementation does not cover a "complete restore" scenario, which requires more work
    • on the Sugar UI to trigger, display progress and manage (cancel/retry) a "complete restore"
    • on the user aliasing that needs to take place in the "replaced laptop" scenario
  • We need to test, time and tune the traffic control & backoff settings.
  • Must confine rsync-over-ssh with a chroot jail or tight SELinux policies
  • To support better the "backup as extra storage" model
    • Allow users to "pin" a resource to avoid it being deleted
    • Teach the Journal to browse & request the backups transparently (WebDAV-based browsing?)
  • Extend into the WebJournal_Project concept that Robson Mendonça is working on.
  • Some cron.d files could be swapped out and in depending on our power situation. This can probably save some juice...