How to Damage a FLASH Storage Device: Difference between revisions

From OLPC
Jump to navigation Jump to search
m (fix USB drive link)
(add see also to http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/)
Line 78: Line 78:
* If you can, stick with the factory map
* If you can, stick with the factory map
* If you must make a "blast it on with dd" image, be very careful and conservative with the partition and filesystem layout, according to the techniques above.
* If you must make a "blast it on with dd" image, be very careful and conservative with the partition and filesystem layout, according to the techniques above.

== See also ==
* [http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/ blog post on aligning filesystems to flash block size]


[[Category:Hardware]]
[[Category:Hardware]]

Revision as of 01:34, 22 February 2009

This page tells you how to degrade the performance and reliability of FLASH-based storage devices like SD cards and USB drives (pendrive, etc). And by implication, how not to damage them.

Damage

To damage such a device, all you have to do is reformat it with any of the usual Linux-based tools like fdisk, mkfs, and dd. Chances are excellent that you will manage to chose a layout that makes the device work extra hard, thus slowing it down and wearing it out faster.

You can also accomplish the same feat using various Windows tools that are part of the Microsoft OEM Preinstallation Kit, and probably with other Windows-based tools (e.g. dd for Windows, and perhaps even with the GUI format capability). You have to try a bit harder to mess up with Windows, because the powerful low-level tools that let you wantonly inflict damage aren't part of the normal Windows "kit".


Why this happens

Okay, here's where it gets complicated...

How NAND FLASH works

FLASH-based mass storage devices pretend to behave like hard drives, but it's really just an elaborate simulation.

Real hard drives can be read and written in units of 512 bytes, called "sectors". Every hard drive sector is independent of its neighbors, and there is no particular limit to the number of times you can change the data in a given sector.

FLASH-based storage, at its core, uses a technology called NAND FLASH. NAND FLASH is readable and writable, but with several wrinkles.

  1. The fundamental read/write unit is a "page", not a sector. FLASH devices of the 2007-2008 generation have a 2K page size, migrating to a 4K page size in the 2009 generation.
  2. You can't write a page anytime you want - before you write to it, you must first erase it. But you can't erase a single page at a time - you must erase an entire "erase block" of (typically) 64 consecutive pages (128Kbytes or 256Kbytes depending on the generation). And after you have erased the block, you can't write to the pages in an arbitrary order, you must write them sequentially starting at the first one.
  3. Blocks tend to wear out over time. After a certain number of erase cycles, a block will "go bad" permanently, so that it will no longer reliably hold data. Pages can also develop data errors as a result of write activity to other pages, and even as a result of reads!

So NAND FLASH has deep fundamental differences from hard disks. But the computer industry has been using hard disks for decades, so the pressure to emulate hard disk behavior is irresistible. FLASH-based storage devices "hide" the NAND FLASH behavior, using software (running on an embedded microcontroller inside the device) to make the device appear as a hard disk to the outside world. That software is often called a "Flash Translation Layer" (FTL) or a "Flash Abstraction Layer". The hiding job is difficult. The common hard disk operation of writing to a single sector can result in multiple I/O operations involving much more than 512 bytes of the NAND FLASH. To alleviate the "wear out" problems, the FTL must move data around so that repeated writes to a given sector don't cause too many writes to the same NAND page.

The FTL software can do a decent job, but there are certain access patterns that require much more work "on the inside". If the data arrangement on disk happens to trigger the "hard" access patterns frequently, the device's performance and reliability will suffer, because the underlying NAND FLASH device will have to be accessed much more frequently. Conversely, if the data arrangement is "nice" for the NAND FLASH chip, less work has to be done, with better overall results. An example of a "bad" access pattern would be a write of 4 consecutive sectors, two of which are at the end of one erase block and the other two at the beginning of the next erase block. That requires a lot more work than writing 4 consecutive sectors that happen to all be in the same NAND page.

Disk Partition and Filesystem Layouts

Disks are typically divided into one or more "partitions" - subsets of the entire disk. The first sector contains a "partition map" that tells the number of active partitions and the starting sector and number of sectors for each partition.

Inside a partition there is a "filesystem" - a collection of data structures that lets you create, delete, and modify named files and directories. There are various filesystem types - FAT, ext2, NTFS and many others - each with its own particular datastructures to implement the file and directory storage goals. Despite the detailed datastructure differences, there are some common features that are present in an abstract sense in most common filesystems:

  1. Signature - most filesystems have some identifying data at an easy to find location, telling what kind of filesystem it is and declaring some top-level parameters such as the sizes and locations of particular subordinate data structures. The signature usually begins at a well-known location near the beginning of the partition. For ext2 and other Unix-tradition filesystems, the signature is called the "super block", located in sector 2 within the partition. For FAT and NTFS, the signature is called the "BIOS Parameter Block" (BPB), in sector 0 within the partition.
  2. Allocation data - most filesystems maintain tables that keep track of which sectors are already used and which are available. For FAT, the table is called the "File Attribute Table", hence the name FAT. It follows the BPB, perhaps with a few intervening "wasted" sectors. For ext2 and NTFS, the allocation maps are distributed across the disk, which is helpful for hard disks because of seek latency and also helpful for FLASH storage because it reduces the "hot spot" effect of frequent writes to the same location.
  3. Storage clusters (aka "filesystem blocks") - For reasons of performance and also to make the allocation data smaller, filesystems typically define a basic storage unit that is several pages. It's common for data writes to be performed in multiples of that unit size. For FAT and NTFS, that unit is called a "cluster". FAT clusters are often 16Kbytes (32 sectors) or 32 Kbytes (64 sectors) for modern devices, while 4K is common for NTFS. For ext2, the unit is called a "filesystem block" - typical sizes are 1K, 2K, and 4K.

Putting it Together

How is the filesystem layout relevant to our problem? Allocation data accesses are very common, as are cluster-sized data writes. So we want those operations to be efficient. If the allocation data starts on a NAND FLASH page boundary, a given allocation map write is less likely to span two pages, so the FTL gets to do things the "easy" way, which is faster and causes less NAND wear. If the cluster size is a power-of-two multiple of the NAND FLASH page size and the first cluster is aligned on an erase block boundary, cluster writes are similarly "easy".

Conversely, if the layout is bad, every cluster write might "split" two pages, forcing the FTL to perform four internal I/O operations instead of one.

Factory formatting

The manufacturers of FLASH storage devices understand this. When they format the device at the factory, they know which filesystem they are putting on (typically either FAT16 or FAT32), the page and erase sizes for the NAND FLASH chips inside, and the characteristics of the FTL software in the internal microcontroller. (Actually, there is yet another factor - multiple NAND chips or multi-plane chips can further influence the locations of "efficient" boundaries.) Knowing this, they can choose a layout that encourages "easy case" internal operations.

Here is an example. On one (2G) SD card that I own, with a FAT16 filesystem, the partition map is sector 0, sectors 1-254 are "wasted", and the first partition begins at sector 255. The BPB (signature) is thus sector 255, and the allocation table (FAT) begins right after, sector 256. The FAT + root directory area contains 512 sectors, extending to sector 767. The cluster space begins at sector 768 and extends to the end of the device. The cluster size is 64 sectors (32Kbytes). Notice that 256 sectors of 512 bytes each is 128Kbytes - the erase size for this NAND generation. So all of the efficiency criteria listed above are satisfied - the allocation maps and cluster array are aligned and the cluster size (32K) is a power-of-two multiple of the NAND page size (2K). Also notice that the partition map (sector 0) and BPB (sector 255) are at the beginning and end, respectively, of the same erase block, with the intervening sectors unused. Since the partition map and BPB are read-only in normal use, that erase block never needs to be erased or written, thus decreasing the likelihood of complete data loss due to corruption of those critical sectors. The FAT area, which is a "hot spot" for FAT filesystems, does not share an erase block with cluster data, which makes it easier for the FTL to transparently move the FAT for wear leveling.

A 4G SD card from a different manufacturer takes a different approach to following the efficiency rules. The partition map in sector 0 is followed by 8191 unused sectors. The partition starts at sector 8192 (0x2000) (4 MBytes). The cluster array begins at sector 16384 (0x4000) (8 MBytes). So everything is strongly aligned and the various sections are well-separated.

Screwed-up formatting

If you use fdisk to repartition your FLASH-based device, unless you know the rules described herein, you are unlikely to end up with "good" partition boundaries. For historical reasons involving physical geometry of ancient disk drives coupled with the programming model of ancient IDE controllers juxtaposed with legacy BIOS APIs, fdisk likes to do things in units of "cylinders", where a cylinder is H "heads" times S "sectors per track". For even more tedious reasons, the maximum number for H is 255 (0xff) and for S is 63 (0x3f). Furthermore, it is common practice these days to "max out" H and S, so a cylinder is 255 * 63 = 16065 sectors. That number is poorly aligned for any purpose, so if you let fdisk start the first partition on a cylinder boundary, you lose. (Modern devices don't really care about heads and sectors at the interface level, but the legacy software data structures force you to pretend that they do and to make up H and S numbers.)

Another common way to lose is to start the first partition in sector 1, thus minimizing "wasted" sectors but making it very likely that multi-sector writes will "split pages". It also puts your partition map at risk because it has to be erased/rewritten frequently while writing data structures near the beginning of the partition. (This happens to be the layout that the Windows OEM Preinstallation Kit tools for "Ultra Low Cost PCs" uses for formatting SD cards and USB keys.)

Other common partition starting points are sectors 16, 32, and 63. None of these are good for NAND FLASH-based devices, because you really want to separate - into different erase blocks - the partition map from frequently-written data which is often near the beginning of a partition.

Your second set of opportunities to mess up comes when you use "mkfs" to create the filesystem. You really want to ensure that important data structures are nicely aligned and of good sizes. The first problem is that, unless you're really paying attention, you are probably taking the partitioning for granted so you have no idea about the alignment of your partition start. Unless you know that, you can't know how to set intra-partition offsets. The second problem is that, by default, "mkfs.ext2" often chooses a 1K block size, which is not a multiple of 2K, so a filesystem block write is almost guaranteed to "split a page" - and if the net alignment of the block space is odd, it will often split 2 pages!

Making FAT filesystems is subject to all the same kind of problems, but the layout details are different. You must worry about FAT size and root directory size and cluster size and reserved sectors.

How to win

It boils down to the fact that you need to micro-manage a lot of details to ensure that things fall on suitably-aligned boundaries. You need to consider both the partition map and the filesystem layout in concert. One way to separate the problems is to make each partition begin on an erase block boundary, then layout the filesystems so their subordinate data structures (particularly the cluster or "fs block" array) fall on erase block boundaries assuming that the partition itself begins erase-block-aligned. What is a good alignment boundary? Well, 256 KiB is good for most new chips, but to give some breathing room for the future, maybe 1 MiB would be better - or perhaps even 4 MiB.

Better yet, try to avoid reformatting FLASH-based devices when you have the choice.

It is sometimes difficult to avoid reformatting, because to boot from a conventional BIOS (or OFW emulating a conventional BIOS for booting Windows), you need a Master Boot Record in the first sector. (That's not necessary with Open Firmware booting Linux; OFW reads the filesystem and ignores the real-mode code in the Master Boot Record.) Since the first sector also contains the partition map, what you really want is software that merges an MBR into the first sector without overwriting the partition map. But of course that's not what the common "dd to raw device" technique does. It just overwrites everything, from an image that does not necessarily have any relation to what was in that device's preexisting partition map.

"syslinux" can be useful in that respect - it can merge an MBR while preserving the partition table. But using syslinux plus some other "copy on the filesystem bits" step is procedurally more complex than a single "dd" command.

Actually, it's not always just a matter of injecting some MBR code into the first sector. Since it's quite difficult to shoehorn a complete bootstrap program into one sector, most booting schemes also depend on additional code tucked away elsewhere. A common location is the beginning of the partition. So you have to worry about leaving some room in the partition for that too.

Bottom-line recommendations:

  • If you can, stick with the factory map
  • If you must make a "blast it on with dd" image, be very careful and conservative with the partition and filesystem layout, according to the techniques above.

See also