NAND Testing: Difference between revisions
No edit summary |
|||
(35 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
<noinclude>{{Google Translations}}{{TOCright}}</noinclude> |
|||
This page describes basic testing of NAND Flash devices, controllers, and [http://en.wikipedia.org/wiki/Wear_levelling wear levelling filesystems] performed in advance of future [[Hardware#XO_Laptop|XO]] hardware designs. Raw results from the tests are available [http://dev.laptop.org/~wad/nand/ here]. |
|||
=Intro= |
=Intro= |
||
Line 16: | Line 20: | ||
Managed NAND devices (such solid-state drives, SD cards, and newer single chip NAND devices) typically set aside between 4 and 8% of the media for wear leveling and bad block replacement. This complicates the test somewhat, but is ameliorated by the reduced W/E cycle lifetime expected with newer NAND Flash devices. |
Managed NAND devices (such solid-state drives, SD cards, and newer single chip NAND devices) typically set aside between 4 and 8% of the media for wear leveling and bad block replacement. This complicates the test somewhat, but is ameliorated by the reduced W/E cycle lifetime expected with newer NAND Flash devices. |
||
Assume we fill all but |
Assume we fill all but 1 MiB of the media (6% of 4 GiB is roughly 250 MiB, leaving up to 251 MiB/125 KBlocks actually free). Assume a maximum write rate of 500 blocks per second. Assuming naively simple wear leveling, a failure should occur after 5K write cycles of all free blocks, or roughly 750M block writes. This will require around 18 days of continuous writing to trigger. |
||
===LBA Test as Implemented=== |
|||
In the case of the LBA-NAND parts, we fill all but 32 MB of the media (leaving 16K blocks). Assume the device withholds 6% of the blocks for wear leveling and bad block replacement (120K blocks). Assuming naive wear leveling, this should result in a write failure in approx. 136K x 5K or 680M block writes (1.4 TB writen). We can continue to write at approx. 350 blocks (0.7 MB) per second, giving a time to failure of 22 days. |
|||
The current test program only writes in step 4.2, giving 20 MB/45 sec., or 220 blocks per second. This gives a times to failure of 35 days. But at the same time it is performing storage error rate testing at 42K blocks/sec --- checking the entire 4 GiB device 40 times a day. |
|||
===JFFS2 Test as Implemented=== |
===JFFS2 Test as Implemented=== |
||
Line 69: | Line 68: | ||
==Samples under Test== |
==Samples under Test== |
||
The current set of tests is devoted to testing four candidate SD cards for XO-.5 production. The current status of these tests is recorded [http://spreadsheets.google.com/ccc?key=0AttqY9e8uyizdHZrNlFqXy1ONFdhX0tjWUZZZGV1MFE&hl=en here]. |
|||
These are the storage media and access methods currently being tested: |
|||
These are the storage media and access methods that have been tested: |
|||
* JFFS2: Five laptops using existing raw NAND plus JFFS2 software Flash translation |
* JFFS2: Five laptops using existing raw NAND plus JFFS2 software Flash translation |
||
* LBA-NAND: We had eight laptops with a 4GB Toshiba LBA-NAND installed. Three remain. We are restarting testing on several laptops with 2GB LBA-NAND. |
|||
* SD cards: Four laptops are testing SanDisk Extreme III (Class 6) SD cards. Another three laptops are testing Transcend (Class 6) 4GB cards. |
|||
* [[UBIFS_on_XO|UbiFS]]: The upgrade to JFFS2. Three laptops started testing on Nov. 25. |
* [[UBIFS_on_XO|UbiFS]]: The upgrade to JFFS2. Three laptops started testing on Nov. 25. |
||
* SD cards: Four laptops are testing SanDisk Extreme III (Class 6) SD cards. Another three laptops are testing Transcend (Class 6) 4GB cards. |
|||
* IDE/NAND (SSD) controllers: We are currently testing three samples from SMI and four samples from Phison. |
|||
* LBA-NAND: We had eight laptops with a 4-GB Toshiba LBA-NAND installed. Five laptops with 2 GB of LBA-NAND were also tested. Almost all devices failed catastrophically. |
|||
We are actively working to get additional devices into the mix, such as: |
We are actively working to get additional devices into the mix, such as: |
||
* eMMC NAND: Basically an MMC card without the wrapper, available from multiple vendors. |
* eMMC NAND: Basically an MMC card without the wrapper, available from multiple vendors. |
||
* IDE/NAND (SSD) controllers: Available cheaply from at least two companies. Phison makes the SSD controller used in both the Acer Aspire and the Asus EEE. |
|||
* PCIe/NAND (SSD) controllers: two evaluation units from Marvell are on their way |
* PCIe/NAND (SSD) controllers: two evaluation units from Marvell are on their way |
||
Line 86: | Line 87: | ||
# While executing from a separate storage device |
# While executing from a separate storage device |
||
# Format as much of the media as possible as a single ext2 partition. The JFFS2 test case will use a JFFS2 partition, and the UBIFS test case will use a UBIFS partition. |
# Format as much of the media as possible as a single ext2 partition. The JFFS2 test case will use a JFFS2 partition, and the UBIFS test case will use a UBIFS partition. |
||
# Create test data filling up all but |
# Create test data filling up all but 32 MB of the partition. This test data will be pseudo-random in nature (white noise), and will be duplicated on the storage device. ''It has been suggested to instead record signatures of the test data. Since the data files are large (multiple media blocks in size), there is little danger of dual-failure (in both files) causing a comparison to give a false negative.'' |
||
# Start a test script which continuously alternates between: |
# Start a test script which continuously alternates between: |
||
## Reading a file and its duplicate from the stored data, reporting any differences. |
## Reading a file and its duplicate from the stored data, reporting any differences. |
||
Line 108: | Line 109: | ||
In order to minimize the runtime support needed for the testing, both the test and initialization scripts are written in Bourne shell. [http://dev.laptop.org/git/users/wad/NANDtest/ Sources] are available from the [http://dev.laptop.org/git/ OLPC git repository]. |
In order to minimize the runtime support needed for the testing, both the test and initialization scripts are written in Bourne shell. [http://dev.laptop.org/git/users/wad/NANDtest/ Sources] are available from the [http://dev.laptop.org/git/ OLPC git repository]. |
||
====Disk Initialization==== |
|||
The following scripts are provided: |
|||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/test.sh test.sh] - the script which actually performs the test |
|||
After formatting and placing a filesystem on a test device, it should be initialized with test data (for the [[NAND_Testing#Wear_.26_Error_Test|Wear and Error Test]]. This is a duplicate set of random data, almost filling the device. |
|||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/parselogs.py parselogs.py] - the script which takes one or more logs and produces statistics |
|||
After mounting the test device, you can either use a script which automatically fills a directory/partition with matched sets of data: |
|||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill.sh fill.sh] - a script for filling a partition with matched sets of random data |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill.sh fill.sh] - a script for filling a partition with matched sets of random data |
||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_random.sh fill_random.sh] - another script for generating the random data |
|||
Or do it manually. Create a directory on it for the first set of data (the name is important), and change to it: |
|||
mkdir /nand/setA |
|||
cd /nand/setA |
|||
Now run the [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_random.sh fill_random.sh] script to fill half of the test device with a number of 32-MB random data files (the number of files to create is the sole argument.) |
|||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_random.sh fill_random.sh] - a script for generating the random data |
|||
For a 4-GB device, this number is usually 58 to 60. For a 2-GB device, this number is usually 27 to 29. |
|||
You can always delete files as needed to drop back below 50% device utilization. Now copy the directory of random data you have just created: |
|||
cd .. |
|||
cp -r setA setB |
|||
You need to ensure that there is at least 30 MB of free space when done. If less, delete a single data file from the setA directory. There shouldn't be more than 50 MB of free space available. Copy data files in setB to fill extra space. |
|||
In most cases, it is quicker to generate the random data once, placing it on a USB storage device, then using one of the following scripts to transfer it onto the test device: |
|||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_jffs.sh fill_jffs.sh] - the script actually used to fill the JFFS2 devices |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_jffs.sh fill_jffs.sh] - the script actually used to fill the JFFS2 devices |
||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_cp.sh fill_cp.sh] - the script actually used to fill the LBA-NAND devices |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_cp.sh fill_cp.sh] - the script actually used to fill the LBA-NAND devices |
||
====Test Script==== |
|||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/test.sh test.sh] - the script which actually performs the test |
|||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/parselogs.py parselogs.py] - the script which takes one or more logs and produces statistics |
|||
====LBA-NAND Initialization==== |
|||
The following are necessary only on LBA-NAND test laptops: |
The following are necessary only on LBA-NAND test laptops: |
||
Line 121: | Line 149: | ||
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/setup.sh setup.sh] - a script for setting up LBA-NAND laptops (deprecated, as it is now /etc/init.d/rc.usbnandtest in the boot ramdisk) |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/setup.sh setup.sh] - a script for setting up LBA-NAND laptops (deprecated, as it is now /etc/init.d/rc.usbnandtest in the boot ramdisk) |
||
====Ubifs Initialization==== |
|||
The following are necessary only on UBIFS test laptops: |
The following are necessary only on UBIFS test laptops: |
||
* [http://dev.laptop.org/~dsaxena/ubi_test/data.img data.img] - the data partition of a UBIFS laptop |
* [http://dev.laptop.org/~dsaxena/ubi_test/data.img data.img] - the data partition of a UBIFS laptop |
||
Line 128: | Line 157: | ||
In most cases, logging is done to an external USB device. In some systems under test (JFFS2 and UbiFS laptops), this is the only storage media other than the device under test. It was used instead of logging the serial console of a laptop due to previous experience trying to collect and maintain serial logs from tens of machines --- the USB bus or serial/USB adapters would occasionally hiccup for unknown reasons and cause the logging to halt. |
In most cases, logging is done to an external USB device. In some systems under test (JFFS2 and UbiFS laptops), this is the only storage media other than the device under test. It was used instead of logging the serial console of a laptop due to previous experience trying to collect and maintain serial logs from tens of machines --- the USB bus or serial/USB adapters would occasionally hiccup for unknown reasons and cause the logging to halt. |
||
Logs may be processed using the [http://dev.laptop.org/git |
Logs may be processed using the [http://dev.laptop.org/git/users/wad/NANDtest/plain/parselogs.py parselogs.py] script. It either takes a list of log files as arguments or processes all log files in the current directory if none are specified. It outputs statistical and error information aggregated from all log files processed. |
||
or processes all log files in the current directory if none are specified. It outputs statistical and error information aggregated from all log files processed. |
|||
Logs are being aggregated at [http://dev.laptop.org/~wad/nand/ http://dev.laptop.org/~wad/nand/]. A summary of each |
Logs are being aggregated at [http://dev.laptop.org/~wad/nand/ http://dev.laptop.org/~wad/nand/]. A summary of each machine's status is shown, with a link to individual log files. |
||
==Control== |
==Control== |
||
Line 150: | Line 178: | ||
<table> |
<table> |
||
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th></tr> |
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th></tr> |
||
<tr><td>JFFS1</td><td>CSN748003DB</td><td>Wear & Error</td><td align=center> |
<tr><td>JFFS1</td><td>CSN748003DB</td><td>Wear & Error</td><td align=center>3780</td></tr> |
||
<tr><td>JFFS2</td><td>CSN74805706</td><td>Wear & Error</td><td align=center> |
<tr><td>JFFS2</td><td>CSN74805706</td><td>Wear & Error</td><td align=center>3788</td></tr> |
||
<tr><td>JFFS3</td><td>SHF80702F53</td><td>Wear & Error</td><td align=center> |
<tr><td>JFFS3</td><td>SHF80702F53</td><td>Wear & Error</td><td align=center>3844</td></tr> |
||
<tr><td>JFFS4</td><td>SHF7250022F</td><td>Wear & Error</td><td align=center> |
<tr><td>JFFS4</td><td>SHF7250022F</td><td>Wear & Error</td><td align=center>3758</td></tr> |
||
<tr><td>JFFS5</td><td>SHF725004D4</td><td>Wear & Error</td><td align=center> |
<tr><td>JFFS5</td><td>SHF725004D4</td><td>Wear & Error</td><td align=center>6404</td></tr> |
||
</table> |
</table> |
||
Line 162: | Line 190: | ||
===JFFS2 Setup Notes=== |
===JFFS2 Setup Notes=== |
||
If this is the first time, see the [[#JFFS2_Initialization|next section]]. If restarting a test, boot the laptop, and insert a USB |
If this is the first time, see the [[#JFFS2_Initialization|next section]]. If restarting a test, boot the laptop, and insert a USB device containing the [http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=test.sh;hb=HEAD test.sh] script. Then simply type: |
||
/usb/test.sh |
/usb/test.sh |
||
A new logfile will automatically be created on the USB |
A new logfile will automatically be created on the USB device (in /usb/logfile-xxxxx). |
||
===JFFS2 Initialization=== |
===JFFS2 Initialization=== |
||
'''Note:''' For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test! |
'''Note:''' For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test! |
||
Install a fresh copy of [http://pilgrim.laptop.org/~pilgrim/xo-1/streams/8.2/build760/devel_jffs2/ release 8.2-760] from a USB |
Install a fresh copy of [http://pilgrim.laptop.org/~pilgrim/xo-1/streams/8.2/build760/devel_jffs2/ release 8.2-760] from a USB device using [http://wiki.laptop.org/go/Open_Firmware Open Firmware]: |
||
copy-nand u:\os760.img |
copy-nand u:\os760.img |
||
Boot, and insert a USB |
Boot, and insert a USB device containing the [http://dev.laptop.org/~dsaxena/kernels/test/kernel-2.6.25-20081025.1.olpc.fix_gc_race.i586.rpm patched kernel RPM]. Install it using: |
||
rpm -ivh kernel-2.6.25-20081025.1.olpc.fix_gc_race.i586.rpm |
rpm -ivh kernel-2.6.25-20081025.1.olpc.fix_gc_race.i586.rpm |
||
cp -a /boot/* /versions/boot/current/boot/ |
cp -a /boot/* /versions/boot/current/boot/ |
||
Reboot, and insert a USB |
Reboot, and insert a USB device containing several scripts: |
||
* [http://dev.laptop.org/git |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_jffs.sh fill_jffs.sh] - a script for filling the NAND with random data |
||
* [http://dev.laptop.org/git |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_random.sh fill_random.sh] - an alternative script for filling the disk |
||
* random - a directory containing over 400 MB of random data, in 32 |
* random - a directory containing over 400 MB of random data, in 32-MiB files (optional) |
||
* [http://dev.laptop.org/git |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/test.sh test.sh] - a script for running the wear leveling and error checking test |
||
If using an earlier OLPC build (say 656), you will have to install the cmp utility: |
If using an earlier OLPC build (say 656), you will have to install the cmp utility: |
||
yum install diffutils |
yum install diffutils |
||
Create a link from the mount point for the USB |
Create a link from the mount point for the USB device to /usb: |
||
ln -s /media/< |
ln -s /media/<USB_DEVICE_NAME> /usb |
||
Now you need to fill the NAND Flash partition ("/" on the stock XO build). This can be done using the same [[#LBA-NAND_Initialization|method used for LBA-NAND devices]]. If the <tt>random</tt> directory is provided on the USB |
Now you need to fill the NAND Flash partition ("/" on the stock XO build). This can be done using the same [[#LBA-NAND_Initialization|method used for LBA-NAND devices]]. If the <tt>random</tt> directory is provided on the USB device, type: |
||
[http://dev.laptop.org/git |
[http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_jffs.sh /usb/fill_jffs.sh] |
||
An alternative. slower approach to filling the NAND with data, which doesn't require pre-computed random data on the USB |
An alternative. slower approach to filling the NAND with data, which doesn't require pre-computed random data on the USB device, is to manually: |
||
mkdir /setA |
mkdir /setA |
||
cd /setA |
cd /setA |
||
[http://dev.laptop.org/git |
[http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_random.sh /usb/fill_random.sh] 11 |
||
cp -r /setA /setB |
cp -r /setA /setB |
||
Line 203: | Line 231: | ||
<table> |
<table> |
||
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th></tr> |
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th></tr> |
||
<tr><td>UBI1</td><td>CSN749030BD</td><td>Wear & Error</td><td align=center> |
<tr><td>UBI1</td><td>CSN749030BD</td><td>Wear & Error</td><td align=center>1900</td></tr> |
||
<tr><td>UBI2</td><td>CSN7440003E</td><td>Wear & Error</td><td align=center> |
<tr><td>UBI2</td><td>CSN7440003E</td><td>Wear & Error</td><td align=center>2950</td></tr> |
||
<tr><td>UBI3</td><td>SHF73300081</td><td>Wear & Error</td><td align=center> |
<tr><td>UBI3</td><td>SHF73300081</td><td>Wear & Error</td><td align=center>3870</td></tr> |
||
</table> |
</table> |
||
Line 211: | Line 239: | ||
===UBIFS Setup Notes=== |
===UBIFS Setup Notes=== |
||
If this is the first time, see the [[#UBIFS_Initialization|next section]]. If restarting a test, boot the laptop, and insert a USB |
If this is the first time, see the [[#UBIFS_Initialization|next section]]. If restarting a test, boot the laptop, and insert a USB device containing the [http://dev.laptop.org/git/users/wad/NANDtest/plain/test.sh test.sh] script. Then simply type: |
||
/usb/test.sh |
/usb/test.sh |
||
Line 217: | Line 245: | ||
'''Note:''' For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test! |
'''Note:''' For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test! |
||
Install a fresh copy of [http://dev.laptop.org/pub/firmware/q2e22/ firmware q2e22] from a USB |
Install a fresh copy of [http://dev.laptop.org/pub/firmware/q2e22/ firmware q2e22] from a USB device using [[Open_Firmware|Open Firmware (OFW)]]: |
||
flash u:\q2e22.rom |
flash u:\q2e22.rom |
||
Boot the laptop, escaping into OFW, and insert a USB |
Boot the laptop, escaping into OFW, and insert a USB device containing the following files: |
||
* [http://dev.laptop.org/~dsaxena/ubi_test/data.img data.img] |
* [http://dev.laptop.org/~dsaxena/ubi_test/data.img data.img] |
||
* [http://dev.laptop.org/~dsaxena/ubi_test/nand.img nand.img] |
* [http://dev.laptop.org/~dsaxena/ubi_test/nand.img nand.img] |
||
Line 232: | Line 260: | ||
At this point OFW will erase the flash and copy the contents of the nand.img file to flash. When complete, reboot the system. |
At this point OFW will erase the flash and copy the contents of the nand.img file to flash. When complete, reboot the system. |
||
Now insert a USB |
Now insert a USB device containing |
||
* [http://dev.laptop.org/git |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_jffs.sh fill_jffs.sh] - a script for filling the NAND with random data |
||
* random - a directory containing over 1 GiB of random data, in 32 |
* random - a directory containing over 1 GiB of random data, in 32-MiB files (optional) |
||
* [http://dev.laptop.org/git |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/test.sh test.sh] - a script for running the wear leveling and error checking test |
||
Create a link from the mount point for the USB |
Create a link from the mount point for the USB device to /usb: |
||
ln -s /media/< |
ln -s /media/<USB_DEVICE_NAME> /usb |
||
Now you need to fill the NAND Flash partition ("/" on the stock XO build). This can be done using the same [[#LBA-NAND_Initialization|method used for LBA-NAND devices]]. If the <tt>random</tt> directory is provided on the USB |
Now you need to fill the NAND Flash partition ("/" on the stock XO build). This can be done using the same [[#LBA-NAND_Initialization|method used for LBA-NAND devices]]. If the <tt>random</tt> directory is provided on the USB device, type: |
||
[http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=fill_jffs.sh;hb=HEAD /usb/fill_jffs.sh] |
[http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=fill_jffs.sh;hb=HEAD /usb/fill_jffs.sh] |
||
Line 250: | Line 278: | ||
<table> |
<table> |
||
<tr><th>Test Unit</th><th>Host</th><th>Device</th><th>Test</th><th>Total Written</th></tr> |
<tr><th>Test Unit</th><th>Host</th><th>Device</th><th>Test</th><th>Total Written</th></tr> |
||
<tr><td>SMI1</td><td> |
<tr><td>SMI1</td><td></td><td>SM223 + 1 Hynix MLC</td><td>Wear & Error</td><td align=center>failed after 2036 GiB</td></tr> |
||
<tr><td>SMI2</td><td> |
<tr><td>SMI2</td><td></td><td>SM223 + 1 Hynix MLC</td><td>Wear & Error</td><td align=center>failed at 1761 GiB</td></tr> |
||
<tr><td>SMI3</td><td>marvell |
<tr><td>SMI3</td><td>marvell sd0</td><td>SM2231 + 1 Hynix MLC</td><td>Wear & Error</td><td align=center>3436</td></tr> |
||
<tr><td> |
<tr><td>SMI4</td><td>marvell sd1</td><td>SM2231 + 1 Hynix MLC</td><td>Wear & Error</td><td align=center>150</td></tr> |
||
<tr><td> |
<tr><td>PHI1</td><td>smi sd0</td><td>Phison 3006 + 1 Hynix MLC</td><td>Wear & Error</td><td align=center>2896</td></tr> |
||
<tr><td> |
<tr><td>PHI2</td><td>phison sd0</td><td>Phison 3006 + 1 Hynix MLC</td><td>Wear & Error</td><td align=center>2365</td></tr> |
||
<tr><td> |
<tr><td>PHI3</td><td>amd sd0</td><td>Phison 3006 + 1 Hynix MLC</td><td>Wear & Error</td><td align=center>1896</td></tr> |
||
<tr><td>PHI4</td><td>amd sd1</td><td>Phison 3006 + 1 Hynix MLC</td><td>Wear & Error</td><td align=center>1896</td></tr> |
|||
<tr><td>PHI5</td><td>sawzall sd0</td><td>Phison 3007 + 1 Samsung MLC</td><td>Wear & Error</td><td align=center></td></tr> |
|||
<tr><td>PHI6</td><td>smi sd0</td><td>Phison 3007 + 1 Samsung MLC</td><td>Wear & Error</td><td align=center></td></tr> |
|||
</table> |
</table> |
||
Line 276: | Line 307: | ||
Touch /ssd when setting up a new test host --- the test script uses this to tell it where to place the logs. |
Touch /ssd when setting up a new test host --- the test script uses this to tell it where to place the logs. |
||
Partition a new drive using fdisk, placing the first partition at sector 512 (byte offset of 256K). |
Partition a new drive using fdisk, placing the first partition at sector 512 (byte offset of 256K). This usually involves the following sequence of commands to fdisk: |
||
u # to change the units to sectors |
|||
p # show the existing partition table |
|||
d # (optional) delete any existing partition |
|||
n # create new partition (primary, number 1, start at sector 512, continue to end of device) |
|||
w # write out new partition table and quit |
|||
Install an ext2 filesystem on the drive, using 2048 |
Install an ext2 filesystem on the drive, using 2048-byte blocks: |
||
mke2fs -b 2048 -m 0 /dev/sdb1 |
mke2fs -b 2048 -m 0 /dev/sdb1 |
||
mount /dev/sdb1 /sd0 |
mount /dev/sdb1 /sd0 |
||
Line 285: | Line 321: | ||
==LBA== |
==LBA== |
||
The tests started with eight XOs at 1CC modified with a 4GB LBA-NAND part. Mitch Bradley prepared a [http://dev.laptop.org/git?p=users/wad/NANDtest;a=tree;f=boot;hb=HEAD kernel] that has the drivers for the LBA-NAND connected through the CaFE chip. He also has a [http://www.busybox.net/ BusyBox] [http://dev.laptop.org/git?p=users/wad/NANDtest;a=tree;f=boot;hb=HEAD initrd] which supports partitioning, ext2 formatting, and testing of the parts. Testing started 9/22/08. |
|||
The LBA-NAND parts being tested have almost all failed. More information about testing this part is available at [[LBA NAND Testing]]. |
|||
By 12/01/08, five of the 4GB devices had failed. Three had failed catastrophically, and could not be accessed (or reformatted) anymore. All devices lost their MBR (located in the first logical block on the device) upon failure. This failure might have been excerbated by a flaw in the [[#Original_LBA-NAND_Initialization|original initialization process]] which placed the MBR in the same erase block as the beginning of the ext2 file system. These five devices were returned to Toshiba for failure analysis, and replaced with new 2GB LBA-NAND devices. Testing was resumed on the new devices. |
|||
The current test rates are 14-16 sec/test step 4.1 and 13-16 sec/test step 4.2. This translates into roughly a 4 MByte/s read rate, and 0.7 MByte/s write rate (this test version wrote 10MB, and did no read testing). This is verified by later tests with a 34 sec. mean time for step 4.2, when both reading back 20 MiB of data and writing 20 MiB of data. |
|||
<table> |
<table> |
||
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th><th>Comments</th></tr> |
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th><th>Comments</th></tr> |
||
<tr><td>LBA1</td><td>CSN74700D03</td><td>Wear & Error</td><td align=center> |
<tr><td>LBA1</td><td>CSN74700D03</td><td>Wear & Error</td><td align=center>4906</td></tr> |
||
<tr><td>LBA2</td><td>CSN74702D30</td><td>Wear & Error</td><td align=center> |
<tr><td>LBA2</td><td>CSN74702D30</td><td>Wear & Error</td><td align=center>4882</td><td>device failed</td></tr> |
||
<tr><td>LBA3</td><td>SHF808021E4</td><td>Wear & Error</td><td align=center>985</td><td>Device failed (sample #1)</td></tr> |
<tr><td>LBA3</td><td>SHF808021E4</td><td>Wear & Error</td><td align=center>985</td><td>Device failed (sample #1)</td></tr> |
||
<tr><td>LBA4</td><td>CSN749013AF</td><td>Wear & Error</td><td align=center>2467</td><td>Device Failed (sample #3)</td></tr> |
<tr><td>LBA4</td><td>CSN749013AF</td><td>Wear & Error</td><td align=center>2467</td><td>Device Failed (sample #3)</td></tr> |
||
<tr><td>LBA5</td><td>CSN75001985</td><td>Wear & Error</td><td align=center>2111</td><td>Device failed (sample #4)</td></tr> |
<tr><td>LBA5</td><td>CSN75001985</td><td>Wear & Error</td><td align=center>2111</td><td>Device failed (sample #4)</td></tr> |
||
<tr><td>LBA6</td><td>CSN74702A8E</td><td>Wear & Error</td><td align=center>2491</td><td>Device failed (sample #5)</td></tr> |
<tr><td>LBA6</td><td>CSN74702A8E</td><td>Wear & Error</td><td align=center>2491</td><td>Device failed (sample #5)</td></tr> |
||
<tr><td>LBA7</td><td>CSN748040B6</td><td>Wear & Error</td><td align=center>3022</td></tr> |
<tr><td>LBA7</td><td>CSN748040B6</td><td>Wear & Error</td><td align=center>3022</td><td>device failed</td></tr> |
||
<tr><td>LBA8</td><td>CSN74900B3C</td><td>Wear & Error</td><td align=center>2329</td><td>Device failed (sample #2)</td></tr> |
<tr><td>LBA8</td><td>CSN74900B3C</td><td>Wear & Error</td><td align=center>2329</td><td>Device failed (sample #2)</td></tr> |
||
<tr><td>LBA13</td><td>SHF808021E4</td><td>Wear & Error</td><td align=center> |
<tr><td>LBA13</td><td>SHF808021E4</td><td>Wear & Error</td><td align=center>1645</td><td>2 GiB part - device failed</td></tr> |
||
<tr><td>LBA14</td><td>CSN749013AF</td><td>Wear & Error</td><td align=center>120</td><td>2 GiB part</td></tr> |
<tr><td>LBA14</td><td>CSN749013AF</td><td>Wear & Error</td><td align=center>120 ?</td><td>2 GiB part - device failed</td></tr> |
||
<tr><td>LBA15</td><td>CSN75001985</td><td>Wear & Error</td><td align=center>120</td><td>2 GiB part</td></tr> |
<tr><td>LBA15</td><td>CSN75001985</td><td>Wear & Error</td><td align=center>120 ?</td><td>2 GiB part - device failed</td></tr> |
||
<tr><td>LBA16</td><td>CSN74702A8E</td><td>Wear & Error</td><td align=center> |
<tr><td>LBA16</td><td>CSN74702A8E</td><td>Wear & Error</td><td align=center>2269</td><td>2 GiB part - device failed</td></tr> |
||
<tr><td>LBA18</td><td>CSN74900B3C</td><td>Wear & Error</td><td align=center>120</td><td>2 GiB part</td></tr> |
<tr><td>LBA18</td><td>CSN74900B3C</td><td>Wear & Error</td><td align=center>120 ?</td><td>2 GiB part - device failed</td></tr> |
||
</table> |
</table> |
||
==SD Cards== |
|||
Total Written refers to the total amount of data written to date to the storage device in an attempt to test wear levelling and W/E lifetime, in GiB. For the current tests, each pass is 0.02 GiB. |
|||
All SD cards follow the same [[#SD_Card_Initialization|initialization]] and [[#SD_Card_Setup_Notes|setup]] procedures. |
|||
Additional tests for SD cards are described at [[SDCard_Testing]]. |
|||
===LBA-NAND Setup Notes=== |
|||
Boot with a USB stick containing two directories: |
|||
* [http://dev.laptop.org/git?p=users/wad/NANDtest;a=tree;f=boot;hb=HEAD boot] |
|||
** vmlinuz |
|||
** initrd.gz |
|||
** [http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=boot/olpc.fth;hb=HEAD olpc.fth] |
|||
* [http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=test.sh;hb=HEAD test.sh] - a script for running the wear leveling and error checking test |
|||
Two types are currently being tested, with more possibly added in the future. |
|||
After the laptop boots, type the following to mount the USB disk for the first time: |
|||
[http://www.busybox.net/downloads/BusyBox.html#item_mount mount] /usb |
|||
At this point, some dangerous sounding error messages will result. Ignore them. If this is the first time, see the next section. If restarting a test, now simply type: |
|||
[http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=setup.sh;hb=HEAD /usb/test.sh] |
|||
A new logfile will automatically be created on the USB key (in /usb). |
|||
====fsck==== |
|||
Occasionally, the ext2 filesystem on the NAND device becomes corrupted. You can repair it using: |
|||
umount /nand |
|||
/sbin/fsck.ext2 /dev/lba1 |
|||
mount /dev/lba1 /nand |
|||
===LBA-NAND Initialization=== |
|||
'''Note:''' For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test! '''Do not run fill_cp.sh unless you are starting the tests for the first time!''' |
|||
The <tt>/etc/init.d/rc.usbnandtest</tt> script attempts to mount the storage device at boot time. Unmount it with: |
|||
umount /nand |
|||
To partition the disk, use fdisk: |
|||
[http://www.busybox.net/downloads/BusyBox.html#item_fdisk fdisk] /dev/lba |
|||
Delete any existing partitions, and create a single partition using all available space. Tell fdisk to start the first partition at sector number 512 - that aligns on a 256K boundary which is a multiple of the erase block size for this generation and probably the next. |
|||
With the version of fdisk in busybox, '''use the 'u' command''' to switch to sector units - it should respond with "Changing display/entry units to sectors". Then when you add the first partition, tell it 512 for the start. It defaults to cylinder units, which is a problem because the goofy old DOS conventions for the maximum values for SPT, tracks, and heads combine in strange ways to give non-power-of-two cylinder sizes. |
|||
Then proceed with creating a partition. ''Hit this series of keys: n <CR> p <CR> 1 <CR> 512 <CR> <CR> w <CR>.'' |
|||
After partitioning, dump the MBR and verify that the partition table is right. You should see something like this: |
|||
dd if=/dev/lba bs=32K count=1 | od -b |
|||
00000700: xxx xxx 203 xxx xxx xxx 000 002 000 000 yy yy yy yy 00 00 |
|||
Look for the "000 002 000 000" - that's 0x200 in hex little endian, i.e. starting block 512. |
|||
(The 203 (hex 83) is the ext2 type code.) |
|||
Now, use mke2fs to place a filesystem on the partition, forcing a 2K block size and no space reserved for root: |
|||
[http://www.busybox.net/downloads/BusyBox.html#item_mke2fs mke2fs] -m 0 -b 2048 /dev/lba1 |
|||
Now you can mount it and start filling it: |
|||
mount /dev/lba1 /nand |
|||
[http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=fill_cp.sh;hb=HEAD /usb/fill_cp.sh] |
|||
As the kernel provided doesn't include support for /dev/urandom, the method used was to provide the random data on a USB key. [http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=fill_cp.sh;hb=HEAD <tt>fill_cp.sh</tt>] just copies it from <tt>/usb/random</tt>. The USB key was previously initialized with sufficient random data using the [http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=fill_random.sh;hb=HEAD <tt>fill_random.sh</tt>] command. This command takes a number of 32 MB random data files to generate as an argument (65 files is sufficient for 4 GiB devices): |
|||
mkdir /Volumes/USBKEY/random |
|||
cd /Volumes/USBKEY/random |
|||
~/NANDtest/fill_random.sh 65 |
|||
===Original LBA-NAND Initialization=== |
|||
The first time this test was conducted, the devices were initialized using slightly simpler instructions, which resulted in the MBR being in the same erase block as the start of the ext2 filesystem. This resulted in a failure mode where the device formatting was lost. |
|||
The difference with the above instructions for initializing were the command used to repartition the storage device: |
|||
[http://www.busybox.net/downloads/BusyBox.html#item_fdisk fdisk] /dev/lba |
|||
Delete any existing partitions, and create a single partition using all available space. ''Hit this series of keys: d <CR> n <CR> p <CR> 1 <CR> <CR> <CR> w <CR>.'' |
|||
And the command used to format the device didn't force a small block size: |
|||
[http://www.busybox.net/downloads/BusyBox.html#item_mke2fs mke2fs] -m 0 /dev/lba1 |
|||
==SD Cards== |
|||
All SD cards follow the same [[#SD_Card_Initialization|initialization]] and [[#SD_Card_Setup_Notes|setup]] procedures. Two types are currently being tested, with more possibly added in the future. |
|||
===Sandisk Extreme III=== |
===Sandisk Extreme III=== |
||
There are four XOs at 1CC running the tests on a SanDisk Extreme III |
There are four XOs at 1CC running the tests on a SanDisk Extreme III 4-GB SD card. Build 8.2-760 was freshly installed on the laptops. |
||
The current test rates are roughly 3.8 sec/test step 4.1, and 4.1 sec/test step 4.2. This translates roughly into a 17 MByte/s read rate, and a 5.7 MByte/s write rate. |
The current test rates are roughly 3.8 sec/test step 4.1, and 4.1 sec/test step 4.2. This translates roughly into a 17 MByte/s read rate, and a 5.7 MByte/s write rate. |
||
Line 390: | Line 355: | ||
<table> |
<table> |
||
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th></tr> |
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th></tr> |
||
<tr><td>SAN1</td><td>SHF725004D1</td><td>Wear & Error</td><td align=center> |
<tr><td>SAN1</td><td>SHF725004D1</td><td>Wear & Error</td><td align=center>30,550</td></tr> |
||
<tr><td>SAN2</td><td>SHF7250048F</td><td>Wear & Error</td><td align=center> |
<tr><td>SAN2</td><td>SHF7250048F</td><td>Wear & Error</td><td align=center>28,335</td></tr> |
||
<tr><td>SAN3</td><td>SHF80600A54</td><td>Wear & Error</td><td align=center> |
<tr><td>SAN3</td><td>SHF80600A54</td><td>Wear & Error</td><td align=center>28,730</td></tr> |
||
<tr><td>SAN4</td><td>CSN74902B22</td><td>Wear & Error</td><td align=center> |
<tr><td>SAN4</td><td>CSN74902B22</td><td>Wear & Error</td><td align=center>29,250</td></tr> |
||
</table> |
</table> |
||
Line 399: | Line 364: | ||
===Transcend Class 6=== |
===Transcend Class 6=== |
||
There are three XOs at 1CC running the tests on a Transcend class 6 |
There are three XOs at 1CC running the tests on a Transcend class 6 4-GB SD card. Build 8.2-767 was freshly installed on the laptops. |
||
The current test rates are roughly 3.9 sec/test step 4.1, and 5 sec/test step 4.2. This translates roughly into a 17 MByte/s read rate, and a 5.7 MByte/s write rate. |
The current test rates are roughly 3.9 sec/test step 4.1, and 5 sec/test step 4.2. This translates roughly into a 17 MByte/s read rate, and a 5.7 MByte/s write rate. |
||
Line 405: | Line 370: | ||
<table> |
<table> |
||
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th></tr> |
<tr><th>Laptop</th><th>Serial #</th><th>Test</th><th>Total Written</th></tr> |
||
<tr><td>TR1</td><td>?</td><td>Wear & Error</td><td align=center> |
<tr><td>TR1</td><td>?</td><td>Wear & Error</td><td align=center>14505</td></tr> |
||
<tr><td>TR2</td><td>?</td><td>Wear & Error</td><td align=center> |
<tr><td>TR2</td><td>?</td><td>Wear & Error</td><td align=center>Failed after 15710 GB</td></tr> |
||
<tr><td>TR3</td><td>?</td><td>Wear & Error</td><td align=center> |
<tr><td>TR3</td><td>?</td><td>Wear & Error</td><td align=center>5026</td></tr> |
||
</table> |
</table> |
||
Line 414: | Line 379: | ||
===SD Card Setup Notes=== |
===SD Card Setup Notes=== |
||
If this is the first time, see the [[#SD_Card_Initialization|next section]]. If restarting a test, boot the laptop, with a USB stick containing the [http://dev.laptop.org/git |
If this is the first time, see the [[#SD_Card_Initialization|next section]]. If restarting a test, boot the laptop, with a USB stick containing the [http://dev.laptop.org/git/users/wad/NANDtest/plain/test.sh test.sh] script, and type: |
||
/usb/test.sh |
/usb/test.sh |
||
Line 425: | Line 390: | ||
copy-nand u:\os760.img. |
copy-nand u:\os760.img. |
||
Boot, and insert a USB |
Boot, and insert a USB device containing several scripts: |
||
* [http://dev.laptop.org/git |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_random.sh fill_random.sh] - an alternative script for filling the disk |
||
* [http://dev.laptop.org/git |
* [http://dev.laptop.org/git/users/wad/NANDtest/plain/test.sh test.sh] - a script for running the wear leveling and error checking test |
||
* [http://dev.laptop.org/git?p=users/wad/NANDtest;a=blob;f=test.sh;hb=HEAD test.sh] - a script for running the wear leveling and error checking test |
|||
* random - a directory containing over 400 MB of random data, in 32 MiB files (only needed for initialization, and optional even then) |
|||
Go to the Journal and unmount the SD card. |
Go to the Journal and unmount the SD card. |
||
You will need to create a link from the mount point for the USB |
You will need to create a link from the mount point for the USB device to /usb: |
||
ln -s /media/<USB_KEY_NAME> /usb |
ln -s /media/<USB_KEY_NAME> /usb |
||
Repartition the storage device using: |
Repartition the storage device using: |
||
[http://www.busybox.net/downloads/BusyBox.html# |
[http://www.busybox.net/downloads/BusyBox.html#fdisk fdisk] /dev/mmcblk0 |
||
Type 'u' to see the information in sectors. Then list the partitions already on the card and remember the starting sector and ending sector for the factory installed partition. |
|||
Delete any existing partitions, and create a single partition using all available space. ''Hit this series of keys: d <CR> n <CR> p <CR> 1 <CR> <CR> <CR> w <CR>.'' If you get an error while re-reading the device partition table, reboot at this point. |
|||
Delete any existing partitions, and create a single partition either using the same partition boundaries as the factory installed partition or start the first partition on a 4 MByte boundary (used by production images). |
|||
Then format the device using: |
|||
[http://www.busybox.net/downloads/BusyBox.html#item_mke2fs mke2fs] -m 0 /dev/mmcblk0p1 |
|||
Then format the device using either (ext3): |
|||
[http://www.busybox.net/downloads/BusyBox.html#item_mke2fs mke2fs] -b 2048 -j /dev/mmcblk0p1 |
|||
or, if wanting a simple test (ext2): |
|||
[http://www.busybox.net/downloads/BusyBox.html#item_mke2fs mke2fs] -b 2048 -m 0 /dev/mmcblk0p1 |
|||
Now, mount the device as /nand, and start filling it with random data: |
Now, mount the device as /nand, and start filling it with random data: |
||
mkdir /nand |
mkdir /nand |
||
mount /dev/mmcblk0p1 /nand |
mount /dev/mmcblk0p1 /nand |
||
[http://dev.laptop.org/git |
[http://dev.laptop.org/git/users/wad/NANDtest/plain/fill_cp.sh /usb/fill_cp.sh] |
||
umount /nand |
umount /nand |
||
rmdir /nand |
rmdir /nand |
||
Line 456: | Line 424: | ||
You are ready to start the testing, with: |
You are ready to start the testing, with: |
||
/usb/test.sh |
/usb/test.sh |
||
[[Category: Hardware]] |
|||
[[Category: XO-1.5]] |
|||
[[Category: XO-1.75]] |
|||
[[Category: XO-3]] |
Latest revision as of 00:55, 18 February 2012
This page describes basic testing of NAND Flash devices, controllers, and wear levelling filesystems performed in advance of future XO hardware designs. Raw results from the tests are available here.
Intro
The non-volatile storage subsystem of the XO has limited design lifetime. It uses an ASIC (the CaFE) to provide an interface to a NAND Flash device. The CaFE is limited in Flash page size, making it unsuitable for future generations of NAND Flash devices. As part of the search for a replacement, OLPC is testing a variety of solutions to gauge their performance.
The goals of the storage subsystem testing are as follows:
- Evaluate the Flash wear leveling algorithms
- Evaluate the storage error rate of the devices
- Evaluate the relative access latency of the devices
Wear Leveling Algorithms Testing
A common flaw in early Flash wear leveling algorithms was only leveling across the remaining unused blocks. The test for this is to fill up most of the disk, then continue to write/erase repeatedly, forcing the write/erase cycles to use the small number of remaining free blocks.
Assume we fill all but 5 MB of the media (leaving 2.5K blocks). We can continue to write at approx. 250 blocks (0.5 MB) per second. Assuming no wear leveling, this should result in a write failure in approx. 200 thousand seconds (100K cycle lifetime). Assuming naively simple wear leveling, a failure should occur in around one million seconds (100K cycle lifetime), or slightly over a week.
Assuming some percent withheld
Managed NAND devices (such solid-state drives, SD cards, and newer single chip NAND devices) typically set aside between 4 and 8% of the media for wear leveling and bad block replacement. This complicates the test somewhat, but is ameliorated by the reduced W/E cycle lifetime expected with newer NAND Flash devices.
Assume we fill all but 1 MiB of the media (6% of 4 GiB is roughly 250 MiB, leaving up to 251 MiB/125 KBlocks actually free). Assume a maximum write rate of 500 blocks per second. Assuming naively simple wear leveling, a failure should occur after 5K write cycles of all free blocks, or roughly 750M block writes. This will require around 18 days of continuous writing to trigger.
JFFS2 Test as Implemented
In the case of the XOs with raw NAND and a JFFS2 management layer, we fill all but around 32 MB of the media (leaving 16K blocks). While the device doesn't withhold any blocks for wear leveling, we expect better than naive wear leveling from JFFS2. Given a 1 GiB device (512K blocks) and W/E lifetime of 100K cycles, we might expect that it will take be 50 billion block write cycles (100 TB of data written) before we start seeing significant errors.
The current test writes 10K blocks/35 seconds, giving us an expected time of 5.5 years of testing before we see failure due to write fatigue.
Storage Error Rate Testing
There is concern that the error rate of MLC devices is not acceptable for use as the primary storage for Linux. As all of the devices being tested are MLC parts, we have an opportunity to evaluate the error rate of the devices.
If we assume that read errors dominate, we can test about 780 passes of an entire device per machine per week. This can be done in conjunction with other tests (c.f. wear leveling), reducing the coverage/speed but not affecting the results.
Unfortunately, NAND manufacturers indicate that write disturbances are a larger problem than read errors, so error testing can't be this simple. I am proposing to verify the consistency of data stored on the vast majority of the media, while writing to the remainder of the media. Note that since there is at least one level of indirection between the test program and the media, it is difficult to simplify the consistency check to blocks possibly affected by a write disturb error.
Access Latency Testing
If the wear leveling algorithm is actually functioning, the latency required to terminate a write may vary widely. Information about this timing may be gathered as part of other tests.
Unfortunately, obtaining realistic timing requires that the disk be realistically fragmented...
Error Rate Assumptions
The stated write/erase cycle lifetime for the devices we are currently using in the XO is 100K cycles -- OLPC has not verified these claims.
The error rate for newer storage devices varies. Toshiba claims that its SLC parts have a 10K cycle lifetime, and its MLC parts have a 5K cycle lifetime.
Timing assumptions
Time estimates in this document are made using the following information, obtained by Mitch Bradley:
JFFS2 reads at between 5.6 and 12 MB/sec (data-dependent, note c), using 100% of the CPU (real time == system time).
- Current test show similar bandwidth (with similar large variance!) --wad
LBA-NAND reads at 5.2 MB/sec, using <1% CPU (real time >> system time).
- Current tests show closer to 4 MB/s --wad
JFFS2 writes at 760 kB/sec, using 100% CPU.
- Current tests slow closer to 0.9 MB/s, but again large variance. --wad
LBA-NAND writes at 1.25 MB/sec, using <2% CPU.
- Actual measurements seem closer to 0.7 MB/sec... --wad
Test Plan
The best laid schemes o' mice an' men... --John Steinbeck
Samples under Test
The current set of tests is devoted to testing four candidate SD cards for XO-.5 production. The current status of these tests is recorded here.
These are the storage media and access methods that have been tested:
- JFFS2: Five laptops using existing raw NAND plus JFFS2 software Flash translation
- UbiFS: The upgrade to JFFS2. Three laptops started testing on Nov. 25.
- SD cards: Four laptops are testing SanDisk Extreme III (Class 6) SD cards. Another three laptops are testing Transcend (Class 6) 4GB cards.
- IDE/NAND (SSD) controllers: We are currently testing three samples from SMI and four samples from Phison.
- LBA-NAND: We had eight laptops with a 4-GB Toshiba LBA-NAND installed. Five laptops with 2 GB of LBA-NAND were also tested. Almost all devices failed catastrophically.
We are actively working to get additional devices into the mix, such as:
- eMMC NAND: Basically an MMC card without the wrapper, available from multiple vendors.
- PCIe/NAND (SSD) controllers: two evaluation units from Marvell are on their way
Wear & Error Test
This will be a combined test which will try to test the wear leveling mechanism of the storage device, while also regularly checking for errors in accessing stored data.
The plan is:
- While executing from a separate storage device
- Format as much of the media as possible as a single ext2 partition. The JFFS2 test case will use a JFFS2 partition, and the UBIFS test case will use a UBIFS partition.
- Create test data filling up all but 32 MB of the partition. This test data will be pseudo-random in nature (white noise), and will be duplicated on the storage device. It has been suggested to instead record signatures of the test data. Since the data files are large (multiple media blocks in size), there is little danger of dual-failure (in both files) causing a comparison to give a false negative.
- Start a test script which continuously alternates between:
- Reading a file and its duplicate from the stored data, reporting any differences.
- Reading a file and its duplicate from the "hot" data, reporting any differences, then overwriting both files with new data.
The test software should log errors onto a storage device other than the device under test.
Step 4.1 is walking through a data set too large to fit into the kernel page cache. Naively done, however, Step 4.2 isn't effective if the kernel page cache is working, as the files being read were recently written to the storage media. The fix (available in newer kernels) is to flush the disk cache before comparing the files (see http://linux-mm.org/Drop_Caches):
echo 1 > /proc/sys/vm/drop_caches
This was properly added to version 1.2 of the test program.
Testing
These are notes detailing the implementation of the testing on the different platforms.
Common
Some elements of the testing, such as the test scripts and the log post-processing, are common to all test platforms.
Test Scripts
In order to minimize the runtime support needed for the testing, both the test and initialization scripts are written in Bourne shell. Sources are available from the OLPC git repository.
Disk Initialization
After formatting and placing a filesystem on a test device, it should be initialized with test data (for the Wear and Error Test. This is a duplicate set of random data, almost filling the device.
After mounting the test device, you can either use a script which automatically fills a directory/partition with matched sets of data:
- fill.sh - a script for filling a partition with matched sets of random data
Or do it manually. Create a directory on it for the first set of data (the name is important), and change to it:
mkdir /nand/setA cd /nand/setA
Now run the fill_random.sh script to fill half of the test device with a number of 32-MB random data files (the number of files to create is the sole argument.)
- fill_random.sh - a script for generating the random data
For a 4-GB device, this number is usually 58 to 60. For a 2-GB device, this number is usually 27 to 29.
You can always delete files as needed to drop back below 50% device utilization. Now copy the directory of random data you have just created:
cd .. cp -r setA setB
You need to ensure that there is at least 30 MB of free space when done. If less, delete a single data file from the setA directory. There shouldn't be more than 50 MB of free space available. Copy data files in setB to fill extra space.
In most cases, it is quicker to generate the random data once, placing it on a USB storage device, then using one of the following scripts to transfer it onto the test device:
- fill_jffs.sh - the script actually used to fill the JFFS2 devices
- fill_cp.sh - the script actually used to fill the LBA-NAND devices
Test Script
- test.sh - the script which actually performs the test
- parselogs.py - the script which takes one or more logs and produces statistics
LBA-NAND Initialization
The following are necessary only on LBA-NAND test laptops:
- boot - a directory containing the OS used for the LBA-NAND tests
- setup.sh - a script for setting up LBA-NAND laptops (deprecated, as it is now /etc/init.d/rc.usbnandtest in the boot ramdisk)
Ubifs Initialization
The following are necessary only on UBIFS test laptops:
Logging
In most cases, logging is done to an external USB device. In some systems under test (JFFS2 and UbiFS laptops), this is the only storage media other than the device under test. It was used instead of logging the serial console of a laptop due to previous experience trying to collect and maintain serial logs from tens of machines --- the USB bus or serial/USB adapters would occasionally hiccup for unknown reasons and cause the logging to halt.
Logs may be processed using the parselogs.py script. It either takes a list of log files as arguments or processes all log files in the current directory if none are specified. It outputs statistical and error information aggregated from all log files processed.
Logs are being aggregated at http://dev.laptop.org/~wad/nand/. A summary of each machine's status is shown, with a link to individual log files.
Control
Coming soon, the destruction of a SATA drive through continuous writing...
JFFS2
There are five XOs at 1CC running the tests on top of JFFS2. Build 8.2-760 was freshly installed on the laptops using Open Firmware's copy-nand command.
A problem with the existing driver appeared immediately as all five crashed overnight on 9/22 (about ten hours into the testing, according to the logs), three definitely with the same kernel error (#8615), one with a dark screen, and one with a white screen (not hardware). Three (JFFS1, JFFS2, and JFFS4) were restarted with a console serial port attached to log error messages. Further information is on Trac ticket #8615. These crashes have continued with almost daily frequency.
A related problem was that JFFS2 might start a test, but after a couple of hundred write/erase cycles, it has run out of disk space for further writes. On these machines, I have gradually been deleting read data as disk space decreases (is consumed by fragmentation ?)
A kernel patch for #8615 was kindly provided by David Woodhouse, and a kernel RPM by Deepak Saxena. This was applied to all machines, after which these problems have not been seen.
The current test rates are roughly 10 sec/test step 4.1, and 25 sec/test step 4.2. This translates into a 6.5 MByte/s read rate, and a 0.9 MByte/s write rate.
Laptop | Serial # | Test | Total Written |
---|---|---|---|
JFFS1 | CSN748003DB | Wear & Error | 3780 |
JFFS2 | CSN74805706 | Wear & Error | 3788 |
JFFS3 | SHF80702F53 | Wear & Error | 3844 |
JFFS4 | SHF7250022F | Wear & Error | 3758 |
JFFS5 | SHF725004D4 | Wear & Error | 6404 |
Total Written refers to the total amount of data written to date to the storage device in an attempt to test wear levelling and W/E lifetime, in GiB. For the current tests, each pass is 0.02 GiB.
A sixth laptop (JFFS6) was briefly used to verify that the kernel bug (#8615) was also present in earlier Sugar releases (such as build 656). Once verified, this laptop was withdrawn from testing.
JFFS2 Setup Notes
If this is the first time, see the next section. If restarting a test, boot the laptop, and insert a USB device containing the test.sh script. Then simply type:
/usb/test.sh
A new logfile will automatically be created on the USB device (in /usb/logfile-xxxxx).
JFFS2 Initialization
Note: For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test!
Install a fresh copy of release 8.2-760 from a USB device using Open Firmware:
copy-nand u:\os760.img
Boot, and insert a USB device containing the patched kernel RPM. Install it using:
rpm -ivh kernel-2.6.25-20081025.1.olpc.fix_gc_race.i586.rpm cp -a /boot/* /versions/boot/current/boot/
Reboot, and insert a USB device containing several scripts:
- fill_jffs.sh - a script for filling the NAND with random data
- fill_random.sh - an alternative script for filling the disk
- random - a directory containing over 400 MB of random data, in 32-MiB files (optional)
- test.sh - a script for running the wear leveling and error checking test
If using an earlier OLPC build (say 656), you will have to install the cmp utility:
yum install diffutils
Create a link from the mount point for the USB device to /usb:
ln -s /media/<USB_DEVICE_NAME> /usb
Now you need to fill the NAND Flash partition ("/" on the stock XO build). This can be done using the same method used for LBA-NAND devices. If the random directory is provided on the USB device, type:
/usb/fill_jffs.sh
An alternative. slower approach to filling the NAND with data, which doesn't require pre-computed random data on the USB device, is to manually:
mkdir /setA cd /setA /usb/fill_random.sh 11 cp -r /setA /setB
UBIFS
There are three laptops running these tests on a UbiFS filesystem. Additional information on how the UBIFS image was created and some of Deepak's notes are available
Laptop | Serial # | Test | Total Written |
---|---|---|---|
UBI1 | CSN749030BD | Wear & Error | 1900 |
UBI2 | CSN7440003E | Wear & Error | 2950 |
UBI3 | SHF73300081 | Wear & Error | 3870 |
Total Written refers to the total amount of data written to date to the storage device in an attempt to test wear levelling and W/E lifetime, in GiB. For the current tests, each pass is 0.02 GiB.
UBIFS Setup Notes
If this is the first time, see the next section. If restarting a test, boot the laptop, and insert a USB device containing the test.sh script. Then simply type:
/usb/test.sh
UBIFS Initialization
Note: For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test!
Install a fresh copy of firmware q2e22 from a USB device using Open Firmware (OFW):
flash u:\q2e22.rom
Boot the laptop, escaping into OFW, and insert a USB device containing the following files:
At the OFW prompt, type:
dev nand : write-blocks write-pages ; dend
You will likely get a message like write-blocks isn't unique. You can ignore this message.
update-nand u:\data.img
At this point OFW will erase the flash and copy the contents of the nand.img file to flash. When complete, reboot the system.
Now insert a USB device containing
- fill_jffs.sh - a script for filling the NAND with random data
- random - a directory containing over 1 GiB of random data, in 32-MiB files (optional)
- test.sh - a script for running the wear leveling and error checking test
Create a link from the mount point for the USB device to /usb:
ln -s /media/<USB_DEVICE_NAME> /usb
Now you need to fill the NAND Flash partition ("/" on the stock XO build). This can be done using the same method used for LBA-NAND devices. If the random directory is provided on the USB device, type:
/usb/fill_jffs.sh
PATA Flash Controllers
There are currently six PATA Flash controllers undergoing tests, with more samples requested. Each drive is carefully partitioned and a Linux ext2 filesystem was placed on it. The system interface to the Flash controller is a PATA driver.
The tests are run from standard Linux desktop machines, each testing one or two systems.
Test Unit | Host | Device | Test | Total Written |
---|---|---|---|---|
SMI1 | SM223 + 1 Hynix MLC | Wear & Error | failed after 2036 GiB | |
SMI2 | SM223 + 1 Hynix MLC | Wear & Error | failed at 1761 GiB | |
SMI3 | marvell sd0 | SM2231 + 1 Hynix MLC | Wear & Error | 3436 |
SMI4 | marvell sd1 | SM2231 + 1 Hynix MLC | Wear & Error | 150 |
PHI1 | smi sd0 | Phison 3006 + 1 Hynix MLC | Wear & Error | 2896 |
PHI2 | phison sd0 | Phison 3006 + 1 Hynix MLC | Wear & Error | 2365 |
PHI3 | amd sd0 | Phison 3006 + 1 Hynix MLC | Wear & Error | 1896 |
PHI4 | amd sd1 | Phison 3006 + 1 Hynix MLC | Wear & Error | 1896 |
PHI5 | sawzall sd0 | Phison 3007 + 1 Samsung MLC | Wear & Error | |
PHI6 | smi sd0 | Phison 3007 + 1 Samsung MLC | Wear & Error |
Total Written refers to the total amount of data written to date to the storage device in an attempt to test wear levelling and W/E lifetime, in GiB. For the current tests, each pass is 0.02 GiB.
PATA Flash Setup Notes
If this is the first time, see the next section. If restarting a test, mount the test device, change directories to the device under test, then run the test.sh script:
sudo fsck /dev/sda1 # optional, if not shutdown cleanly sudo mount /dev/sda1 /sd0 # use appropriate device cd /sd0 ls # check state of directory sudo ~/test.sh
On most modern Linux systems, the PATA drives are assigned /dev/sd[abcdef] device names and enumerated before any SATA drives. The test partition is always the first (e.g. /dev/sda1, /dev/sdb1, etc...)
PATA Flash Initialization
Note: For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test!
Touch /ssd when setting up a new test host --- the test script uses this to tell it where to place the logs.
Partition a new drive using fdisk, placing the first partition at sector 512 (byte offset of 256K). This usually involves the following sequence of commands to fdisk:
u # to change the units to sectors p # show the existing partition table d # (optional) delete any existing partition n # create new partition (primary, number 1, start at sector 512, continue to end of device) w # write out new partition table and quit
Install an ext2 filesystem on the drive, using 2048-byte blocks:
mke2fs -b 2048 -m 0 /dev/sdb1 mount /dev/sdb1 /sd0
Now you need to fill the NAND Flash partition ("/sd0 or /sd1"). This can be done using the fill_random.sh script.
LBA
The LBA-NAND parts being tested have almost all failed. More information about testing this part is available at LBA NAND Testing.
Laptop | Serial # | Test | Total Written | Comments |
---|---|---|---|---|
LBA1 | CSN74700D03 | Wear & Error | 4906 | |
LBA2 | CSN74702D30 | Wear & Error | 4882 | device failed |
LBA3 | SHF808021E4 | Wear & Error | 985 | Device failed (sample #1) |
LBA4 | CSN749013AF | Wear & Error | 2467 | Device Failed (sample #3) |
LBA5 | CSN75001985 | Wear & Error | 2111 | Device failed (sample #4) |
LBA6 | CSN74702A8E | Wear & Error | 2491 | Device failed (sample #5) |
LBA7 | CSN748040B6 | Wear & Error | 3022 | device failed |
LBA8 | CSN74900B3C | Wear & Error | 2329 | Device failed (sample #2) |
LBA13 | SHF808021E4 | Wear & Error | 1645 | 2 GiB part - device failed |
LBA14 | CSN749013AF | Wear & Error | 120 ? | 2 GiB part - device failed |
LBA15 | CSN75001985 | Wear & Error | 120 ? | 2 GiB part - device failed |
LBA16 | CSN74702A8E | Wear & Error | 2269 | 2 GiB part - device failed |
LBA18 | CSN74900B3C | Wear & Error | 120 ? | 2 GiB part - device failed |
SD Cards
All SD cards follow the same initialization and setup procedures.
Additional tests for SD cards are described at SDCard_Testing.
Two types are currently being tested, with more possibly added in the future.
Sandisk Extreme III
There are four XOs at 1CC running the tests on a SanDisk Extreme III 4-GB SD card. Build 8.2-760 was freshly installed on the laptops.
The current test rates are roughly 3.8 sec/test step 4.1, and 4.1 sec/test step 4.2. This translates roughly into a 17 MByte/s read rate, and a 5.7 MByte/s write rate.
Laptop | Serial # | Test | Total Written |
---|---|---|---|
SAN1 | SHF725004D1 | Wear & Error | 30,550 |
SAN2 | SHF7250048F | Wear & Error | 28,335 |
SAN3 | SHF80600A54 | Wear & Error | 28,730 |
SAN4 | CSN74902B22 | Wear & Error | 29,250 |
Total Written refers to the total amount of data written to date to the storage device in an attempt to test wear levelling and W/E lifetime, in GiB. For the current tests, each pass is 0.02 GiB.
Transcend Class 6
There are three XOs at 1CC running the tests on a Transcend class 6 4-GB SD card. Build 8.2-767 was freshly installed on the laptops.
The current test rates are roughly 3.9 sec/test step 4.1, and 5 sec/test step 4.2. This translates roughly into a 17 MByte/s read rate, and a 5.7 MByte/s write rate.
Laptop | Serial # | Test | Total Written |
---|---|---|---|
TR1 | ? | Wear & Error | 14505 |
TR2 | ? | Wear & Error | Failed after 15710 GB |
TR3 | ? | Wear & Error | 5026 |
Total Written refers to the total amount of data written to date to the storage device in an attempt to test wear levelling and W/E lifetime, in GiB. For the current tests, each pass is 0.02 GiB.
SD Card Setup Notes
If this is the first time, see the next section. If restarting a test, boot the laptop, with a USB stick containing the test.sh script, and type:
/usb/test.sh
A new logfile will automatically be created on the USB key (in /usb/logfile-xxxxx).
SD Card Initialization
Note: For these tests to have a valid effect, the storage device should not be re-formatted or re-initialized for the duration of the wear leveling test!
Install a fresh copy of release 8.2-760 from a USB key using Open Firmware:
copy-nand u:\os760.img.
Boot, and insert a USB device containing several scripts:
- fill_random.sh - an alternative script for filling the disk
- test.sh - a script for running the wear leveling and error checking test
Go to the Journal and unmount the SD card.
You will need to create a link from the mount point for the USB device to /usb:
ln -s /media/<USB_KEY_NAME> /usb
Repartition the storage device using:
fdisk /dev/mmcblk0
Type 'u' to see the information in sectors. Then list the partitions already on the card and remember the starting sector and ending sector for the factory installed partition.
Delete any existing partitions, and create a single partition either using the same partition boundaries as the factory installed partition or start the first partition on a 4 MByte boundary (used by production images).
Then format the device using either (ext3):
mke2fs -b 2048 -j /dev/mmcblk0p1
or, if wanting a simple test (ext2):
mke2fs -b 2048 -m 0 /dev/mmcblk0p1
Now, mount the device as /nand, and start filling it with random data:
mkdir /nand mount /dev/mmcblk0p1 /nand /usb/fill_cp.sh umount /nand rmdir /nand
Reboot, and create a link from the mount point for the SD card to /nand:
ln -s /media/<SD_CARD_NAME> /nand
You are ready to start the testing, with:
/usb/test.sh