Nandblaster for XO-1.5 Design Discussion

From OLPC
Jump to navigation Jump to search

This page records the design thoughts that led to the fully working XO-1.5 version of NANDblaster. Some of the information is now obsolete - or at least some of the ideas didn't make it into the final design.

Problems

  • XO-1.5 uses managed NAND (SD card that emulates a hard disk) with ext2/3/4 filesystems instead of raw NAND with JFFS2.
  • Old NANDblaster has several assumptions about raw NAND.
    • Assumes that the device can have bad blocks anywhere so it has to map logical block numbers to physical block numbers to work around that.
    • Assumes that the actual physical block number doesn't matter except that it must be in the correct partition. So the transmitted block number sequence can be sequential and unbroken, distributing them to good physical blocks and skipping to the next partition start as necessary. For managed NAND with conventional filesystems, an image block must be placed at exactly the specified location. That number must be transmitted and can't be assumed sequential, lest you lose the ability to skip blocks.
    • Implicitly assumes that the image is already compact (internal JFFS2 compression) so there is no need to reduce network bandwidth with additional compression. For non-internally-compressed filesystems, we probably want to add some explicit compression.
    • Assumes that it is possible and necessary to erase a block before writing it. Managed NAND has no concept of erase - you can write zeros or ffs, but that is not "special" at the high-level interface.
    • Assumes that only the non-empty blocks will be transmitted and that the other ones should be left in the erased state. For a conventional filesystem, we need to be explicit about everything.

Ideas

  • Probably should compress at the packet level - just prior to sending a network packet - and decompress upon reception. That minimizes the impact on the existing code, which is fairly rigid in its assumptions about how to reconstruct a full erase block from any N subblock packets. As an optimization, use a chunk size that is likely to compress to less than half an ethernet packet and send two in one packet?
  • Probably should add something to the protocol to represent all-0 or all-ff blocks. The former is particularly common and deserves a fast-path that conserves both network bandwidth and FEC-decoding time.
  • Even better, specify that a range of block numbers contains all zeros or maybe all <some 32-bit number>.
  • Packets specifying a range of constant-data blocks need to be transmitted redundantly to guard against packet loss.
  • Since the partition map is just a block, we could *nearly* ignore partition specifications in the prototol - but it might be nice to have a mode whereby you update just the contents of a single partition, after checking that the existing partition map has a corresponding partition that is large enough.
  • Transmit sources:
    • Clone existing image
    • .zd format (necessary for secure updates because you have to sign the .zsp file)
    • Preformatted packet sequence for ultimate speed.
  • To reduce wear on the device, do we want to capture sections (e.g. 400 MB worth) in memory and write to disk once? Possibly increases transmit time unless we can overlap reception of additional packets with write processing. To do the overlap thing, we would divide memory in two (or maybe asymmetricly based on relative speeds of packet transmission vs FEC processing and write times). If there is enough memory for the whole thing, just use it...
  • On average, I think the FEC + write is likely to be faster than the network transmission, which maxes out at about 2.4 Mbytes/sec. So we could just have a pool of packet frames in memory. Whenever we complete a set for a block, write the block and recover the packet frames. Assuming 128Kib blocks and 1310-byte packet frames (100 packets per block), we need 820K index locations - 3.3 MB assuming 32-bit pointers - per gigabyte of transmitted filesystem data. So a 16 GB filesystem takes 50MB of index, leaving 450 to 950 MB for packet storage. Assuming that the block packets are mostly transmitted together so you have a high probability of getting a complete set quickly, that should work well. A typical filesystem image might fit entirely in memory on 1 GiB machines. Keep the free packet frames as a linked list so finding one is a constant time operation. No need to sort them.

Wish List

  • Some sort of flexible scheme where partitions could be resized on the fly.
  • It would be nice if there was a way to mark a block as don't care, both at the device level and the filesystem level.
  • I am considering doing the high-level parts of this in Forth, relegating only the FEC processing and compression/decompression to C. Makes it easier to debug...


Compressor Selection

The speeds and ratios below were measured on an XO-1.5 using the indicated programs to compress and decompress the file /lib/libc-2.11.so, repeating the operation several times using /dev/null as the data sink.

Program Compress
MB/sec
Decompress
MB/sec
Ratio
comp/orig
Comp time for
1 GiB sec(min)
Decomp time for
1 GiB sec(min)
Network time at
2 MB/sec sec(min)
Network time at
1 MB/sec sec(min)
lzma .25 6 .28 4096 (68.3) 170 (2.8) 143 (2.4) 286 (4.8)
gzip 2.3 20 .36 445 (7.4) 51 (.85) 184 (3.1) 368 (6.1)
lzop 17.0 48 .50 60 (1) 21 (.35) 256 (4.3) 512 (8.5)

Analysis:

The selected data size of 1 GiB assumes that all-0 blocks will be detected and transmitted very efficiently at the protocol level. The non-0 portion is about 1 GiB in the clear, compressing down to between 280K and 512K.

  • lzma is too slow for on-the-fly compression and would only be feasible for off-line preparation of a pre-compressed image. If network conditions are bad (low throughput) it might still be worthwhile.
  • gzip is marginal for on-the-fly compression, especially in conjunction with the FEC generation step (need to benchmark that on XO-1.5). It will be the rate-limiting step unless the network speed is rather slow. But it still might be the winner under typical conditions. The compression time might overlap almost perfectly with the network time on the sending side. Decompression is probably fast enough not to cause problems for the receiver.
  • lzop is the speed demon, but if the network is slow, the compression ratio penalty becomes significant. However, if gzip rate-limits transmission to say 1 MB/sec, while lzop would let it run at 2 MB/sec, then lzop would win handily (256 < 368).

It would be nice if we could take advantage of the fact that the data in the .zd file is already compressed, but I'm not sure how to fit that into the FEC scheme without a lot of surgery. I suppose we could coalesce adjacent compressed blocks of data until we get 128K, then unblock later. That could get fairly complicated...

Comments

[dsd] I don't think the "cloning the senders disk" option is very useful though, and Iwas wondering if it would make your life easier if we forgot about that option.

[wmb] Cloning was quite convenient in early testing of NB-1.0, as I didn't have to prep a stick when I made a protocol change - just download the changed firmware into a machine, turn it on, and voila it was a sender. In the 1.5 case, since internal and external SD cards are pretty much identical, cloning might be somewhat more useful in the deployment case than it was with XO-1. You could use it as a scalable mass-duplicator, for example, requiring no special setup. This brings up an interesting security issue: obviously cloning to the internal disk should be prevented when the client is secure, but what about cloning to an external SD card?

[dsd] Also we still have some flexibility in the .zd format, maybe we could make it more suitable for nandblasting.

[wmb] Excellent point. I suppose we could have a data format where the zdata is FEC-encoded and each sub-block separately compressed. That would make the .zd file somewhat larger, since the FEC redundancy overhead would be included (about 20% overhead) and the compression efficiency would be reduced (smaller blocks). So maybe 30% overhead overall as a guess. It wouldn't affect the non-NB fs-update time too badly, as the first N subblocks are actually the unencoded data, so no need to do FEC if you can read them reliably. It would allow you to recover from some read errors, if fs-update were able to do FEC decoding as necessary.