XO updater

From OLPC
Jump to: navigation, search

This is a software design document. The design has been mostly implemented as olpc-update. Users looking to upgrade their XO should start with Upgrading the XO.

Software updates on the One Laptop per Child's XO laptop

Problem statement and scope

This document aims to specify the mechanism for updating software on the XO-1 laptop. When we talk about updating software, we are referring both to system software such as the OS and the core services controlled by OLPC that are required for the laptop's basic operation, and about any installed user-facing applications ("activities"), both those provided by OLPC and those provided by third parties.

System updater

Core goals

The three core goals of a software update tool (hereafter "updater") for the XO are as follows:

Security 
Given the initial age group of our users, it is the only reasonable solution to default to automatic detection and installation of updates, both to be able to apply security patches in a timely fashion, and to enable users to benefit from rapid development and improvements in the software they're using. Automatic updates, however, are a security issue unto themselves: compromising the update system in any way can provide an attacker with the ability to wreak havoc across entire installed bases of laptops while bypassing -- by design -- all the security measures on the machine.
Therefore, the security of the updater is paramount and must be its first design goal.
Uncompromising emphasis on fault-tolerance 
Given the scale of our deployment, the relatively high complexity of our network stack when compared to currently-common deployments, the unreliability of Internet connectivity even when available, and perhaps most importantly our desire for participating countries to soon begin customizing the official OLPC OS images to best suit them, it is clear that our updater must be fault-tolerant. This is both in the simple sense -- cryptographic checksums need to be used to ensure updates were received correctly -- and in the more complex sense that the likelihood of a human error with regard to update preparation goes up proportionally to the number of different base OS images at play. A fault-tolerant updater will therefore allow _unconditional_ rollback of the most recently applied update. "Unconditional" here means that, barring the failure of other parts of the system which are dependencies of the updater (e.g. the filesystem), the updater must always know how to correctly unapply an applied update, even if the update was malformed.
Low bandwidth 
For much the same reasons (project scale, Internet access scarcity and unreliability) that require fault-tolerance from the updater, the tool must take maximum care to minimize data transfer requirements. This means, concretely, that a delta-based approach must be utilized by the updater, with a "keyframe" or "heavy" update being strictly a fallback in the unlikely case an update path cannot be constructed from the available or reachable delta sets.

Design

It is given, due to requirements imposed by the Bitfrost security platform, that a laptop will attempt to make daily contact with the OLPC anti-theft servers. During that interaction, the laptop will post its system software version, and the response provided by the anti-theft service will optionally contain a relative URL of a more recent OS image.

If such a pointer has been received and the laptop is behind a known school server, it will probe the school server via rsync at the provided relative URL to determine whether the server has cached the update locally. If the update is not available locally, the laptop will wait up to 24 hours, checking approximately hourly whether the school server has obtained the update. If at the end of this wait period the school server still does not have a local copy of the update, it is assumed to be malfunctioning, and the laptop will contact an upstream master server directly by using the URL provided originally by the anti-theft service.

In any of these three cases (school server has update immediately, school server has update after delay, upstream master has update), we say the laptop has 'found an update source'.

Once an update source has been found, the laptop will invoke the standard rsync tool over a plaintext (unsecured) connection via the rsync protocol -- not piped through a shell of any kind -- to bring its own files up to date with the more recent version of the system. rsync uses a network-efficient binary diff algorithm which satisfies goal 3.

Design note: peer-to-peer updates

It is desirable to provide "viral update" functionality at a later date, such that two laptops with different software versions (and without any notion of trust) can engage in an update to bring the laptop with the older software fully up to date.

However, determining how to provide this functionality securely, efficiently and elegantly is not feasible on the Gen1 FRS timeline. Therefore, laptop-to-laptop updates will NOT be a part of the updater that ships with the FRS image, and are a candidate for release 2-3 months after FRS.

Design note: rsync scalability

rsync is a known CPU hog on the server side. It would be absolutely infeasible to support a very large number of users from a single rsync server. This is far less of a problem in our scenario for three reasons:

High branching factor 
In all normal circumstances, the vast majority of the rsync traffic to our upstream servers will come from school servers, not individual laptops. If school servers are unavailable of malfunctioning, it is not the case that there will be a flood of requests from individual laptops, because it's likely that the school servers are those laptops' only gateway to the Internet.
Element of randomness in anti-theft requests 
Instead of hitting the update servers every hour on the hour, the laptops are already including an element of randomness in choosing when to contact the anti-theft service. This random delay propagates to the rsync requests, as well.
In-depth stagger abilities on the server side 
Because notification of new updates is performed by the anti-theft service which is aware of a laptop's locale, updates can be staggered over several days by country, region, or any other metric such as server load.

Additionally, some optimizations can be added to rsync proper to aid with our use case, but such engineering will need to wait until after FRS.

Implementation

In order to implement runtime file protection, Bitfrost relies on the COW functionality of the Linux-VServer patchset. The functionality imbues immutable hardlinks within a designated context with special meaning: when broken by some destructive file operation, VServer will replace these hardlinks with the content of the file they were pointing to and apply the desired operation on the resulting copy.

The XO updater will run in a special context to which the security service has exposed the entire underlying filesystem as a COW copy. The updater will update this COW copy in-place with rsync. This COW mechanism simply ensures no excess authority lies with the updater; any failures or vulnerabilities in it do not propagate to the rest of the system.

One file contained within each OS image will be its cryptographically signed manifest; at the end of the rsync operation, the laptop will have obtained that file. At this point, the updater will request that the security service applies the update. Note that due to the nature of rsync, we can stop and restart the network phase of a single update several times as connectivity becomes available, and until we've received the complete update.

The security service will terminate the updater and then analyze the manifest and confirm the modified files in the updater's context exactly match the expected OS image end-state. If any discrepancy is discovered, the updater context will be discarded and the update operation aborted.

If the update is verified to be complete and correct, the security service will mark it as such, and designate the files within it to be the files exported into all newly-created containers. System service containers will be restarted gracefully. If the image manifest did not contain a header identifying that image as a high-priority update, the update process ends here. Restartable services have been restarted, and the rest of the system will be initialized from the update on reboot.

If the update has been marked as high-priority, the user will be asked to close applications and reboot his machine immediately. A timer will run that will reboot the machine in 60 minutes if the user does not do so. The high-priority timer can be disabled in the security center; its purpose is merely to provide some extra protection to the youngest users who cannot necessarily be expected to understand or comply with the reboot request.

On boot, the first initialization script to run will perform a pivot_root operation to the directory that currently holds the OS image marked bootable by the security service. With the example above, it would be the directory that belonged to the updater's context. If a key is depressed during boot, however, the pivot_root is performed to the _old_ bootable context, and the user presented a dialog asking whether she would like to make the rollback permanent.

The kernel is the only special case to this handling: in the event that a verified update contains an updated kernel, that kernel will be placed into a predetermined place in the underlying filesystem by the security service. Open Firmware will preferentially boot this newer kernel unless the rollback key combination is depressed during boot.

Notice that the update operation has been reduced to a simple state toggle between (any) two OS images. In so doing, we have satisfied goals 1 and 2.

Application updater

Design

The XO eschews traditional dependency-based approaches to package management, making application upgrades somewhat difficult. The problem is compounded by the fact that Bitfrost does not permit applications to update themselves in-place, which is a common update method on platforms such as Mac OS X and Windows.

When it comes to application updates, we wish to stay true to our goals of security and low-bandwidth updates, but are willing to settle for less fault tolerance as necessitated by the fact that most activities won't be OLPC-written or maintained.

The design should make it possible to have a single tool that can ascertain the existence of updated versions of any currently installed activities, and then fetch and install those updates. It should do so bandwidth-efficiently, such that files that are unchanged between activity versions aren't downloaded as part of the update, and also such that identical resources files packaged by multiple activities are never downloaded more than once, or not at all if they already exist on the system.

Implementation

A manifest file is added to the bundle format specification. The manifest consists of the filename and strong cryptographic hash of every file in the bundle. Another file is added, called 'origin', that specifies a URL where updated activity bundles may be found, and a public key which will be used to sign such updated bundles.

When a global activity update is initiated, the updater enumerates the origins for all installed activities, then probes each one in turn to determine which activities have available updates. The resulting activity list is the 'available update set'.

The most up-to-date bundle for each activity in the set is accessed, and the first several kilobytes downloaded. Since bundles are simple ZIP files, the downloaded data will contain the ZIP file index which stores byte offsets for the constituent compressed files. The updater then locates the bundle manifest in each index and makes a HTTP request with the respective byte range to each bundle origin. At the end of this process, the updater has cheaply obtained a set of manifests of the files in all available activity updates.

A local database of manifests of all installed activities is kept, pruned only to records for files larger than a set size, e.g. 50 KB. The updater cross-references each manifest from the available update set with the installed database, and then with other manifests in the set. Files which exist locally and are also present in the available update set aren't downloaded; the updater simply "plants" the files in the right places. The same happens for identical files present in multiple bundles in the available update set; they are only downloaded once.

After a bundle (minus any redundant files) has been downloaded, it is unpacked and reassembled (if it needs any of the files that haven't been downloaded because they already exist). Cryptographic signature verification is performed. If remaining disk space is larger than a particular margin, e.g. 20%, then the context containing the older version of the activity bundle is kept around, and the user given the ability to perform rollback on the activity update. Otherwise, the old version bundle is destroyed.

Notes on sugar-update-control

This is currently implemented by sugar-update-control. Documentation can be found at software update.

In a deployment scenario, you would create an appropriate group file (ie, http://wiki.laptop.org/go/Activities/Peru, or a group file living on our School server at a URL like [[1]]) by creating a .xo bundle that installs /home/olpc/Activities/.groups pointing at the appropriate location. You would then install that .xo bundle from your customization key. Preseeded web content on the school server can serve the group file URL as well as the updated activity bundles.



Author 
    Ivan Krstić
    ivan AT laptop.org
    One Laptop per Child
    http://laptop.org
Metadata 
    Revision: Draft-14
    Timestamp: Tue Jun  26 17:51:45 UTC 2007


END