LBA NAND Testing

From OLPC
Revision as of 03:16, 4 April 2009 by Wad (talk | contribs)
Jump to navigation Jump to search

As part of testing storage devices for suitability for future generations of XO Laptops, Toshiba LBA NAND parts were placed onto production XO motherboards and tested. The devices failed catastrophically quite early in the testing process.

History

The tests started with eight XOs at 1CC modified with a 4GB LBA-NAND part. Mitch Bradley prepared a kernel that has the drivers for the LBA-NAND connected through the CaFE chip. He also has a BusyBox initrd which supports partitioning, ext2 formatting, and testing of the parts. Testing started 9/22/08.

By 12/01/08, five of the 4GB devices had failed. Three had failed catastrophically, and could not be accessed (or reformatted) anymore. All devices lost their MBR (located in the first logical block on the device) upon failure. This failure might have been excerbated by a flaw in the original initialization process which placed the MBR in the same erase block as the beginning of the ext2 file system. These five devices were returned to Toshiba for failure analysis, and replaced with new 2GB LBA-NAND devices. Testing was resumed on the new devices.

By 3/10/08, all the 2GB parts, and all but one of the 4GB parts had failed.

Test Estimate

In the case of the LBA-NAND parts, we fill all but 32 MB of the media (leaving 16K blocks). Assume the device withholds 6% of the blocks for wear leveling and bad block replacement (120K blocks). Assuming naive wear leveling, this should result in a write failure in approx. 136K x 5K or 680M block writes (1.4 TB writen). We can continue to write at approx. 350 blocks (0.7 MB) per second, giving a time to failure of 22 days.

The current test program only writes in step 4.2, giving 20 MB/45 sec., or 220 blocks per second. This gives a times to failure of 35 days. But at the same time it is performing storage error rate testing at 42K blocks/sec --- checking the entire 4 GiB device 40 times a day.

The test rates were 14-16 sec/test step 4.1 and 13-16 sec/test step 4.2. This translates into roughly a 4 MByte/s read rate, and 0.7 MByte/s write rate (this test version wrote 10MB, and did no read testing). This is verified by later tests with a 34 sec. mean time for step 4.2, when both reading back 20 MiB of data and writing 20 MiB of data.


Endurance Results

LaptopSerial #TestTotal WrittenComments
LBA1CSN74700D03Wear & Error4906
LBA2CSN74702D30Wear & Error4882device failed
LBA3SHF808021E4Wear & Error985Device failed (sample #1)
LBA4CSN749013AFWear & Error2467Device Failed (sample #3)
LBA5CSN75001985Wear & Error2111Device failed (sample #4)
LBA6CSN74702A8EWear & Error2491Device failed (sample #5)
LBA7CSN748040B6Wear & Error3022device failed
LBA8CSN74900B3CWear & Error2329Device failed (sample #2)
LBA13SHF808021E4Wear & Error16452 GiB part - device failed
LBA14CSN749013AFWear & Error120 ?2 GiB part - device failed
LBA15CSN75001985Wear & Error120 ?2 GiB part - device failed
LBA16CSN74702A8EWear & Error22692 GiB part - device failed
LBA18CSN74900B3CWear & Error120 ?2 GiB part - device failed

Total Written refers to the total amount of data written to date to the storage device in an attempt to test wear levelling and W/E lifetime, in GiB. For the current tests, each pass is 0.02 GiB.

LBA-NAND Setup Notes

Boot with a USB stick containing two directories:

  • boot
  • test.sh - a script for running the wear leveling and error checking test

After the laptop boots, type the following to mount the USB disk for the first time:

mount /usb

At this point, some dangerous sounding error messages will result. Ignore them. If this is the first time, see the next section. If restarting a test, now simply type:

/usb/test.sh

A new logfile will automatically be created on the USB key (in /usb).

fsck

Occasionally, the ext2 filesystem on the NAND device becomes corrupted. You can repair it using:

umount /nand
/sbin/fsck.ext2 /dev/lba1
mount /dev/lba1 /nand

LBA-NAND Initialization

Note: For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test! Do not run fill_cp.sh unless you are starting the tests for the first time!

The /etc/init.d/rc.usbnandtest script attempts to mount the storage device at boot time. Unmount it with:

umount /nand

To partition the disk, use fdisk:

fdisk /dev/lba

Delete any existing partitions, and create a single partition using all available space. Tell fdisk to start the first partition at sector number 512 - that aligns on a 256K boundary which is a multiple of the erase block size for this generation and probably the next.

With the version of fdisk in busybox, use the 'u' command to switch to sector units - it should respond with "Changing display/entry units to sectors". Then when you add the first partition, tell it 512 for the start. It defaults to cylinder units, which is a problem because the goofy old DOS conventions for the maximum values for SPT, tracks, and heads combine in strange ways to give non-power-of-two cylinder sizes.

Then proceed with creating a partition. Hit this series of keys: n <CR> p <CR> 1 <CR> 512 <CR> <CR> w <CR>.

After partitioning, dump the MBR and verify that the partition table is right. You should see something like this:

dd if=/dev/lba bs=32K count=1 | od -b
00000700: xxx xxx 203 xxx xxx xxx 000 002 000 000 yy yy yy yy 00 00

Look for the "000 002 000 000" - that's 0x200 in hex little endian, i.e. starting block 512. (The 203 (hex 83) is the ext2 type code.)

Now, use mke2fs to place a filesystem on the partition, forcing a 2K block size and no space reserved for root:

mke2fs -m 0 -b 2048 /dev/lba1

Now you can mount it and start filling it:

mount /dev/lba1 /nand
/usb/fill_cp.sh

As the kernel provided doesn't include support for /dev/urandom, the method used was to provide the random data on a USB key. fill_cp.sh just copies it from /usb/random. The USB key was previously initialized with sufficient random data using the fill_random.sh command. This command takes a number of 32 MB random data files to generate as an argument (65 files is sufficient for 4 GiB devices):

mkdir /Volumes/USBKEY/random
cd /Volumes/USBKEY/random
~/NANDtest/fill_random.sh 65

Original LBA-NAND Initialization

The first time this test was conducted, the devices were initialized using slightly simpler instructions, which resulted in the MBR being in the same erase block as the start of the ext2 filesystem. This resulted in a failure mode where the device formatting was lost.

The difference with the above instructions for initializing were the command used to repartition the storage device:

fdisk /dev/lba

Delete any existing partitions, and create a single partition using all available space. Hit this series of keys: d <CR> n <CR> p <CR> 1 <CR> <CR> <CR> w <CR>.

And the command used to format the device didn't force a small block size:

mke2fs -m 0 /dev/lba1