XO-1.75/Kernel/Testing: Difference between revisions

From OLPC
Jump to navigation Jump to search
 
(275 intermediate revisions by 11 users not shown)
Line 1: Line 1:
Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel.
Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel.

See also:
*[[XO-1.75/Kernel/Runin|using runin for kernel testing]]
*[[XO-1.75/Kernel/FIQ|hints for using FIQ debugger kernels]]
*[[XO-1.75/Kernel/Issues|structured problem tracking]]


<!--
<!--
Line 7: Line 12:
== hash - branch ==
== hash - branch ==
*link to git hash via http
*link to git hash via http
*test dependencies, e.g EC logged in S7, host logged, specific build or firmware version,
*link to binary
*dependencies, e.g firmware version that must be used,
*purpose, e.g test loss of EC comms in runin,
*purpose, e.g test loss of EC comms in runin,
*testing by
*testing by
Line 14: Line 18:
and add after this html comment
and add after this html comment
-->
-->

== 4ae137c7 - arm-3.0-ramp ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-ramp&id=4ae137c791df043b323444bb3c0f2ae8b6adf067 4ae137c7]
*libertas fix
*testing by jnettlet
*testing by James, on six units, with serial, with aggressive runin, passed 96 hours,
*testing by James, as above, with also ''debug no_console_suspend'', passed 96 hours,

== d6a2a5eb - arm-3.0-ramp ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-ramp&id=d6a2a5eb1932d9c742fd355c161533dd8bffbdba d6a2a5eb] aka os27,
*James Cameron, started 2012-02-10 05:50 UTC, stopped some days later, q4d03ja, defconfig, plus [http://dev.laptop.org/~quozl/y/1RvjQc.txt patch] (enable excessive serial port output, leave apb uart clocks active over suspend), ([http://dev.laptop.org/~quozl/kernel-testing/d6a2a5eb.tar.gz .tar.gz]), olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os27, on eight units, looking for hangs, ... 50% fail within 12 hours, two instances of "mmp2_pm_finish: Done" as the last message, also [http://dev.laptop.org/~quozl/z/1Rvz1M.txt hang #0], and a unit that hung without a serial console attached.
*as above, started 2012-02-10 22:52 UTC, with [http://dev.laptop.org/~dilinger/pxa3.patch kgdb disabled],
**[http://dev.laptop.org/~quozl/z/1Rw3ec.txt hang #0], shortly after "mmp2_pm_finish: Done", with evidence of log buffer writes beyond log_end,
**hang #1, an SKU203 without serial console attached,
**hang #2, an SKU204 without serial console attached,
**[http://dev.laptop.org/~quozl/z/1RwmtV.txt hang #3], shortly after "mmp2_pm_finish: Done", an SKU201 with serial console, but watchdog failed to restart,
**hang #4, an SKU204 without serial console attached,
**[http://dev.laptop.org/~quozl/z/1Rwmz9.txt hang #5], an SKU200 with serial console, but watchdog failed to restart,

== 61e93c20 - arm-3.0-ramp ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-ramp&id=61e93c20684696b7464121a6a087fa6ca9895f6f 61e93c20]
*James Cameron, started 2012-02-07 09:53 UTC, stopped after 67.7 hours, q4d03, defconfig, (i.e. with serial driver disabled, with epitaph)([http://dev.laptop.org/~quozl/kernel-testing/3b3820f2-epitaph-nottys2.tar.gz .tar.gz]), olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os26, on eight units, looking for hangs, looking for 3.0.19 regressions, ... one hang on resume, [http://dev.laptop.org/~quozl/z/1Rvhmu.txt hang #1],

== 61e0e9ee - arm-3.0-ramp ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-ramp&id=61e0e9eec5c2315cc6b4bd855e5bfe9b53178361 61e0e9ee]
*os26
*testing by
**James Cameron, completed, q4d02jc, defconfig, with serial console disabled, without epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os26, on eight units, looking for hangs, ... over 96 hours, one hang on SKU204,

== be5c532a - arm-3.0-wip ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip&id=be5c532a4119a9388acf0c4dab21d590c06c27be be5c532a]
*testing by
**James Cameron, completed, q4d02jc, defconfig, with serial driver disabled, with epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, on six units, looking for eMMC or other hangs, 113 hours fine, one instance of camera missing <trac>11595</trac> (which can only occur on boot), see [http://dev.laptop.org/~quozl/z/1Rpupk.txt dmesg]
**James Cameron, completed, as above, but with serial driver enabled, SKU200 [http://dev.laptop.org/~quozl/z/1RrnFf.txt hang #1] (during resume), [http://dev.laptop.org/~quozl/z/1RrzSa.txt hang #2] (during resume), SKU202 [http://dev.laptop.org/~quozl/z/1RrzR3.txt hang #3] (during suspend), and SKU201 hang during suspend or resume, ... all of which suggests that enabling the serial driver leads to hangs,
**James Cameron, completed, as above, but with serial driver with DMA support, <!-- local branch, version 3.0.17-01203-g60273d9 --> SKU200 [http://dev.laptop.org/~quozl/z/1RsAaG.txt hang #1] (during resume), B1 [http://dev.laptop.org/~quozl/z/1RsMMI.txt hang #2] (during resume), B4 [http://dev.laptop.org/~quozl/z/1RsMP9.txt hang #3] (during resume).

== 8f74291b - arm-3.0-wip ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip&id=8f74291be31f0e29051e9a4acaf6155c537f5af8 config: update for new graphics stuff + 3.0.17]
*EC logged in S7/S47 if available
*os25 or another build with the new graphics ABI
*purpose: Copy changes from 3.0.17, CMA, new graphics ABI to -wip branch
*testing by
**Samuel Greenfeld
***5xC1, 10s aggressive suspend w/ runin-camera test active
***1xC1 SKU201 TX FIFO not empty [http://dev.laptop.org/ticket/11594 Ticket]

== 073c7fbf - arm-3.0.17-wip ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0.17-wip&id=073c7fbfc38525105f4882cf8fb7cafddd787670 TTY: serial_core: Fix crash if DCD drop during suspend]
*EC logged S47 (if available)
*os24 base image or one which does *not* have the new graphics ABI
*purpose, test if fixes found in the 3.0.17 Linux kernel solve our issues
*testing by
**Samuel Greenfeld
***5xC1, all units passed 72 hours (3 cycles) with 10s suspend cycles (with clock skews ranging from 10 minutes backwards to 20 minutes forward after being set via ntp-set-clock prior to the test). System reboot order and overall duration varied from the order & time amount I recall starting the systems with. Testing was done without runin-camera, although looking at the kernel configuration it may be possible to turn it on again.

== b8114c95 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=b8114c95e6d369385bc28c6dacc8d9e1cb427262 b8114c95]
*os24 base image or one which does *not* have the new graphics ABI
*disable runin-camera test
*purpose, see if eMMC stack backport from 3.2 fixes eMMC hangs, for Andres.
*testing by
**Samuel Greenfeld
***5xC1, still hangs (as before) or reboots reporting filesystem issues, [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-b8114c9/ sampled logs here]

== 5fef1609 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=5fef160929d2ae18b8212a285e193b4510128469 mmc: try reducing bus bit width in case of error]
*EC logged at S47
*os24 base image or one which does *not* have the new graphics ABI
*disable runin-camera test
*purpose, look for EC, other lockups
*aggressive suspend cycles
*testing by
**Samuel Greenfeld
***2xSKU 202, 3xSKU 201, all with serial consoles, (one of each with EC serial)
***C1 SKU201 (#4) failure due to journal replay error, filesystem remounted readonly & runin reboot
***Other systems all hung within a few hours, some multiple times after being restarted
***Logs online [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-5fef1602/ here]
**James Cameron, with a [http://dev.laptop.org/~quozl/z/1Ro5xF.txt local patch to disable serial and capture dmesg via CForth], one of the five systems has hung over 52 hours, a remarkably stable test result, the one hang was preceeded by an EC driver message ''olpc-ec-1.75: SSP reports TX underrun'' then a WARNING, [http://dev.laptop.org/~quozl/z/1RosBj.txt log]
*** C1 SKU201, start 2012-01-20 04:12:26, end 2012-01-24 04:18:42, four days, no errors, post-run ntpdate correction -3333.884336 sec,
*** C1 SKU202, start 2012-01-20 04:12:28, end 2012-01-24 05:17:09, four days, one EC driver error, post-run ntpdate correction -534.727839 sec,
*** B1, start 2012-01-20 04:12:29, end 2012-01-24 04:19:15, four days, no errors, post-run ntpdate correction 926.420114 sec,
*** B1, start 2012-01-20 04:12:31, end 2012-01-24 04:18:43, four days, no errors, post-run ntpdate correction -2817.877176 sec,
*** B4, start 2012-01-20 04:12:32, end 2012-01-24 04:18:38, four days, no errors, post-run ntpdate correction 1661.545946 sec.

== b2845e62 - arm-3.0-wip-graphics (os25) ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-graphics&id=b2845e62ac6803e0e1ef760ee850e0f7c747349a (pxa168fb) Fix HWCursor Resume logic]
*Use os25 to make this easiest to install
*Q4D01 recommended
*purpose: test possible ramp build
*testing by
**Samuel Greenfeld
***3xB1, 5xC2 running normal runin, completed 10 24-hour runin cycles with all units; should be attempting additional 24-hour cycles on units still available.

== 9a3a8436 - arm-3.0-wip-graphics ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-graphics&id=9a3a84361e24886b34a0134fc18632bf86bb62cd (galcore:marvell) Pass the device to DMA for reservations]
*Jon Nettleton's graphics RPM repo is required (ABI breakage)
*purpose, test new graphics driver in runin with a somewhat ideally stable kernel
*testing by
**Samuel Greenfeld
***2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each
***1xSKU 201 hung after ~20 hours of testing with 10s suspend
***5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each

== daf43181 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=daf43181 TTY: serial_core: Fix crash if DCD drop during suspend]
*make sure no_console_suspend is enabled
*see if this helps or fixes serial hangs
*testing by
**James Cameron, os24 remastered with testconfig kernel, without no_console_suspend, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera),
*** B1 [http://dev.laptop.org/~quozl/z/1RnKRg.txt hang], [http://dev.laptop.org/~quozl/z/1RnNVP.txt hang],
*** B4 [http://dev.laptop.org/~quozl/z/1RnL3p.txt runin restart due to filesystem remount read-only].
**Samuel Greenfeld
*** C1 SKU 202 [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-daf4318/screenlog.0-mmcfail1.bz2 mmc communication problem (no reboot by runin)]
*** C1 SKU 202 [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-daf4318/screenlog.1.bz2 hang]
*** C1 SKU 201 [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-daf4318/screenlog.3.bz2 hang]
*** C1 SKU 201 [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-daf4318/screenlog.4.bz2 hang]
*** C2 hang followed by power off; no serial connection
** James Cameron, local CForth build, os24 remastered with testconfig kernel, without no_console_suspend, with dmesg buffer locating patch, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera), [http://dev.laptop.org/~quozl/z/1Rnh17.txt hang] (irq 20), [http://dev.laptop.org/~quozl/z/1Rnh6v.txt hang] (eMMC), [http://dev.laptop.org/~quozl/z/1RniFg.txt hang] (irq 20), [http://dev.laptop.org/~quozl/z/1Ro1mn.txt hang] (during suspend), summary, over 24 hours, over 5 units, had 5 hangs relating to eMMC, and 10 hangs relating to irq 20.
** James Cameron, os24 remastered with defconfig kernel, without console=ttyS2,115200, as above, eight units passed 26 hours.

== fc631432 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=fc63143216c68cdbb6ca44201dc496a8842f2dab testconfig: reenable no_console_suspend]
*EC logged in S7 or S47.
*Disable runin camera test (disabled mmp_camera driver)
*purpose, gather more information about why we hang, since removing no_console_suspend led to hanging significantly less often.
*testing by
**Samuel Greenfeld
***Seems to hang on all test systems within 2 hours with 10s suspend cycles (multiple failure modes, including #11528), possibly aggravated by sending serial data to the XO during the suspend/resume process. But failures have also been seen after rebooting systems and ignoring them.

== cfb3aa76 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=cfb3aa76 cfg80211: amend regulatory NULL dereference fix]
*EC logged in S7 or S47. Do not use no_console_suspend.
*purpose: fixes a wireless oops; still need to verify TxFIFO sanity check patch
*testing by
**Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles
***SKU202 #0 failed 0:8:26 (hours:minutes:seconds) remaining after 3139 cycles power LED on full charge seen
***SKU199 #6 failed 0:53:10 (hours:minutes:seconds) remaining 2990 cycles power LED on battery LED off (discharging)
***SKU201 #4 failed second run 22:51 (hours:mins) remaining after 441 suspend cycles
***SKU202 #0 failed second run 21:31 (hours:mins) remaining 322 cycles
***SKU202 #1 failed second run 18:36 (hours:mins) remaining 689 cycles
***SKU199 #6 failed second run 13:29 (hours:mins) remaining 1354 cycles
***SKU201 #4 failed third run 20:08 (hours:mins) remaining 510 cycles
***SKU202 #0 failed third run 18:42 (hours:mins) remaining 1694 cycles
***Logs available [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-cfb3aa7/ here]

==be581eb4 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=be581eb4c2f2e1168d34df907f318d512095db96 testconfig: turn off no_console_suspend]
*EC logged in S7 or S47, aggressive suspending in runin
*do not use no_console_suspend, use same configuration as caused hangs previously,
*purpose: Verify the 'olpc-ec: do a full reset if the TxFIFO sanity check fails' patch.
*testing by
** James Cameron, os24 remastered with kernel, C1 SKU201 passed, C1 SKU202 hung ([http://dev.laptop.org/~quozl/z/1RlvVN.txt log]), B1 hung, ([http://dev.laptop.org/~quozl/z/1RlvVz.txt log]).

==f1f8f7fe - arm-3.0-wip, EC 4.01 ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip&id=f1f8f7fe60e77c43b0486bdd6a058085833664e5 git hash],
*[http://dev.laptop.org/~kernels/public_rpms/f14-xo1.75/kernel-3.0.0_xo1.75-20120109.1939.olpc.f1f8f7f.armv7l.rpm rpm],
*[http://build.laptop.org/11.3.1/os24/xo1.75/ build],
*do not use no_console_suspend, include light sensor test
*make sure to manually upgrade the EC firmware to cl2-4_0_4_01
*purpose, test if actually getting the EC to be quiet during suspend improves things
*testing by
** John Watlington:
*** SKU201 (3) 24hr cycles, 600+ cycles ongoing;
*** SKU202 (3) 24hr cycles, 600+ cycles ongoing
*** (12x) SKU198/199/just plain weird, ongoing
*** (10x) SKU200, ongoing;
*** (4x) SKU201, ongoing;
*** (2x) SKU202, ongoing;
*** (2x) SKU203, ongoing;
*** (2x) SKU204, ongoing;

==f1f8f7fe - arm-3.0-wip ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip&id=f1f8f7fe60e77c43b0486bdd6a058085833664e5 git hash],
*[http://dev.laptop.org/~kernels/public_rpms/f14-xo1.75/kernel-3.0.0_xo1.75-20120109.1939.olpc.f1f8f7f.armv7l.rpm rpm],
*[http://build.laptop.org/11.3.1/os24/xo1.75/ build],
*do not use no_console_suspend, use same configuration as caused hangs previously,
*purpose, test if debug console messages may cause hangs,
*testing by
** Jon Nettleton, C1, 13:20 hours / 1966 cycles ongoing, os20,
** James Cameron,
*** C1 SKU201, C1 SKU202, B1, all passed,
*** C1 SKU201, C1 SKU202, B1, B1, B4, with os24, all passed.
** John Watlington, with light-sensor test:
***C1 SKU201 (2) 24hr loop + 2000 cycles stopped;
***C1 SKU202 (2) 24hr loop + 2100 cycles stopped
***(10x) SKU198/199/just plain weird, (2) 24hr loop, stopped
***(8x) SKU200, (2) 24 hr loop, stopped;
***(3x) SKU201, (2) 24 hr loop, stopped;
*** SKU202, (2) 24hr loop, stopped;
*** SKU198, hang in first 24 hrs, all hangs either screen blank w. power light on or turned off.
*** SKU199, hang in first 24 hr`s
*** (2x) SKU200, hang in first 24 hrs
*** SKU201, hang in first 24 hrs
*** SKU202, hang in first 24 hrs
** Samuel Greenfeld, Q4C12, patched 0.17.0 runin to support kernel RPM version
*** C1 SKU202 failure after 1805 cycles (cause unclear but EC locked up) [http://dev.laptop.org/ticket/11571 #11571]
*** C1 SKU202 failure after 3455 cycles (EC transmission queue full followed by null pointer dereference) [http://dev.laptop.org/ticket/11573 #11573]
*** B1 SKU199 failure after 2062 cycles (cause unknown; no serial logs available) [http://dev.laptop.org/ticket/11572 #11572]
*** 3xC1 SKU201, 2xB1 SKU 198, 2xB1 SKU 199 passed 24 hours of testing.
*** All 10 systems mentioned above re-imaged to os24 with normal 24-hour runin settings; passed.
** Chia-Hsiu, 6 x C2, 2 x C1, using os24, ongoing
*** one 24 run completed successfully
** C2 SKU200 x7 @ Miami office using os24 - ongoing
*** 20 hrs with >50 s/r cycles

==1220526f - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=1220526f debug: add debugging to workqueues, irqs, etc]
*test dependencies: save your System.map from your build
*purpose: attempt to track down hangs to specific driver callbacks
* testing by
** James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async,
*** C1 SKU201, hang 1, [http://dev.laptop.org/~quozl/z/1RkTYx.txt log, hang 2, [http://dev.laptop.org/~quozl/z/1RkTrN.txt log], hang 3, a repeating invalid page fault with storage LED blinking, FIQ entry but insane, [http://dev.laptop.org/~quozl/z/1RkUKO.txt log], [http://dev.laptop.org/~quozl/z/1RkUMV.txt fiq], hang 4, [http://dev.laptop.org/~quozl/z/1RkVAN.txt log], restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 5, in resume, [http://dev.laptop.org/~quozl/z/1RkVjV.txt log], stopped.
*** C2 SKU202, hang 1, with spontaneous kdb entry, and a backtrace, [http://dev.laptop.org/~quozl/z/1RkTgv.txt log], hang 2, [http://dev.laptop.org/~quozl/z/1RkUJ4.txt log], hang 3, not with the workqueue debugging active at the time, [http://dev.laptop.org/~quozl/z/1RkVBB.txt log], restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 4, now in EARLY resume, [http://dev.laptop.org/~quozl/z/1RkVf2.txt log], stopped.
*** B1, hang 1, [http://dev.laptop.org/~quozl/z/1RkU4x.txt log], hang 2, [http://dev.laptop.org/~quozl/z/1RkV9G.txt log], hang 3, [http://dev.laptop.org/~quozl/z/1RkVhM.txt log], stopped.
*** B1 (pass), B4 (pass), B1 (pass), with -ENODEV return from serial_pxa_init(), and no calls to mmp2_add_uart(), ongoing, 3483 cycles, 23 hours,

== 0a896953 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?id=0a89695328b4abd3ea1ec27157e1b2bd0687f497 olpc-ec: consolidate code and ensure TxFIFO is empty before writing to it]
*EC logged in S7, runin 0.17.0 or 0.16.7 (not C2 builds) if possible
*Past kernels tested briefly in the past ~12 hours suggest that the fixes in said kernels may fix all known easily reproducible hangs. Need to prove/disprove this.
*If you have enough systems, try using stock runin 0.17.0-style settings as well as anything known to cause particular types of failures. (EC failures seem to be common within a few hours with 10s on/10s off suspend, only using runin-main/gtk/common/tests/sus/battery.)
*testing by
** James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around ''Sat Jan 7 08:24:00 UTC 2012'',
*** C1 SKU201, 534 tests, 530 pass, 4 fail,
**** 3 hang during resume (see [http://dev.laptop.org/~quozl/z/1RjRnj.txt host log], and [http://dev.laptop.org/~quozl/z/1RkAO8.txt host log], and [http://dev.laptop.org/~quozl/z/1RkAQI.txt host log]),
**** 1 fail with with repeating eMMC SET_BLOCK_COUNT, see [http://dev.laptop.org/~quozl/z/1RkASH.txt host log],
*** C1 SKU202, 2777 tests, 2711 pass, 1 WARNING at olpc_ec_1_75_cmd ([http://dev.laptop.org/~quozl/z/1RkB6R.txt log]), 66 fail,
**** 50 hang during resume,
**** 16 fail with repeating eMMC SET_BLOCK_COUNT,
**** see [http://dev.laptop.org/~quozl/z/1RkAy8.txt tail of each fail in one place]
*** one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see [http://dev.laptop.org/~quozl/z/1RjS0m.txt host log],
*** one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see [http://dev.laptop.org/~quozl/z/1RjS3a.txt host log],
** Samuel Greenfeld, five C1 units, runin 0.16.7 setup like 0.17 (light sensor test disabled), 0 to pm_async
*** one C1 SKU201 hang during resume, normal runin settings, see [http://fpaste.org/qOc6/ host log]
*** one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing, see [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.3-resumehang.bz2 host log] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.8-ecfor3-resumehang.bz2 ec]. Unit discovered after powered off due to critical battery failure (runin discharge test).
*** one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing + cl2-4_0_3_07rsmith-586 EC code (S47 debug), see [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.1-resumehang-ec586.bz2 host log] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.7-resumehang-ec586-ecfor1.bz2 ec log].
*** one B1 SKU199 hang during resume, normal runin settings with 10s suspend timing, see [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.6-b1resumehang.bz2 host log].
** Jon Nettleton

== ebf24ea6 - arm-3.0-wip-wfi ==
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=ebf24ea6
*test dependencies: EC logged in S7, use whatever magic possible to get the EC errors
*Another attempt to fix the problem of EC communication failures
*testing by

== 196c2f806 - arm-3.0-wip-wfi ==
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=196c2f80
*testing by
**James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume.
***C1 SKU202 reported EC related kernel problems, see [http://dev.laptop.org/~quozl/z/1RjNDy.txt log], should be fixed in later kernels.

== 46e079fe - arm-3.0-wip-wfi ==
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=46e079fe
*EC logged in S7
*Should fix the hang caused by suspend being aborted due to a wakeup event. May also help with other IRQ hangs as I have fixed a previous mistake I made when moving the audio island code.
*Testing aborted suspends can be done by running this command and then hitting the keyboard right away.
<blockquote>
echo $(cat /sys/power/wakeup_count) > /sys/power/wakeup_count && echo mem > /sys/power/state
</blockquote>
*testing by
**James Cameron, with a local debugging patch, saw one instance of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201, see [http://dev.laptop.org/~quozl/z/1RikOi.txt log tail], and several instances of the [[XO-1.75/Kernel/Issues#hang.2C_dpm_resume|hang in dpm_resume()]].

== bfc1b92b - arm-3.0-wip-wfi ==
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=bfc1b92b
*EC logged in S7
*Should fix the (known) cause of failing EC commands on s/r. Please test both this and the prior kernel.
*testing by

== b7f22e1d - arm-3.0-wip-wfi ==
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=b7f22e1d
*EC logged in S7
*Should fix the problem where, upon a single EC error, all further EC commands fail. Please test!
*testing by

== 10ebd28f - arm-3.0-wip-wfi ==
* [http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=10ebd28ff6bc0ef40c881a818fe67b6bb544954f olpc-ec: log a warning if we manage to process an EC command deep inside of suspend],
* a [http://dev.laptop.org/~quozl/kernel-10ebd28f-fiq.tar.gz tar.gz] by James,
*testing by
** Richard?

== ae48be89 - arm-3.0-wip-wfi ==
* a [http://dev.laptop.org/~quozl/kernel-ae48be89-fiq.tar.gz tar.gz] by James,
*testing by
** Richard?
** James Cameron, 10sec/10sec, with audio disabled, [http://dev.laptop.org/~quozl/z/1RiInC.txt patch],
*** B1 hung, [http://dev.laptop.org/~quozl/z/1RiK9g.txt host log],
*** B1 hung, [http://dev.laptop.org/~quozl/z/1RiMYl.txt host log],
*** B1 going.

== 5ba0b446 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=5ba0b446 olpc-ec: allow unknown commands to be executed]
*test dependencies: EC logging (S7), kernel logging with EC debug enabled ("echo 1 > /sys/module/olpc_ec_1_75/parameters/ec_debug"), pm_async disabled.
*purpose: improve EC driver; fewer (or no?) races/crashes..
*testing by
** Samuel Greenfeld
*** C1 SKU 201 EC communications failure (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin. [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-5ba0b446/screenlog.3-ecfail.bz2 host] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-5ba0b446/screenlog.8-ecfor3-ecfail.bz2 ec]

== ff199462 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=ff199462da2eb97ff2a8cbcd98f7d74822a40ae5 olpc-ec: ensure gpio cmd is left low if something screws up]
*EC logged in S7
*Still has FIQ debugger
*Should help reset EC bus to avoid subsequent failures after the first command fails.
*testing by
**Samuel Greenfeld
*** C1 hang on resume (#3), OLS runin disabled, Q4C11 normal EC code [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-ff199462/screenlog.3-resumehang1.bz2 host] (''mmp2_pm_finish: Enable audio island'' ... ''mmp2_pm_finish: Done'' ... 72.616ms ... ''mmp-camera mmp-camera.0: resume'') [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-ff199462/screenlog.8-resumehang1.bz2]
*** C1 hang on resume (#6), OLS runin disabled, Q4C11 normal EC code
*** C1 hang on resume (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-ff199462/screenlog.3-resumehang2ecmod.bz2 host] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-ff199462/screenlog.8-ecfor3-resumehang2ecmod.bz2 ec]

== 3d4cf36c - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=3d4cf36c56ccce859fc3b508ee8a38cde098d346 mmp2_fiq_debugger.c add file missing from other merge]
*EC logged in S7
*This enables the [[XO-1.75/Kernel/FIQ|FIQ debugger]]. Reproduce kernel hangs, then from a serial console send a Break which should drop you to a debug prompt. run bt to see what is going on.
*testing by
**James Cameron, 10sec/10sec,
*** C1 SKU201 hang at ''mmp-camera mmp-camera.0: resume'', [http://dev.laptop.org/~quozl/z/1RiKZq.txt host],
*** C1 SKU202 hang at ''mmp-camera mmp-camera.0: resume'', [http://dev.laptop.org/~quozl/z/1RiHT6.txt host],
*** C1 SKU202 hang at ''mmp-camera mmp-camera.0: resume'', [http://dev.laptop.org/~quozl/z/1RiKbq.txt host],
*** no response to BREAK,
**James Cameron, 0sec/3sec, FIQ was verified as working before starting,
*** C1 SKU201 [http://dev.laptop.org/~quozl/z/1RiDms.txt host] (known audio problem that doesn't need to be reported again),
*** C1 SKU202 [http://dev.laptop.org/~quozl/z/1RiEIr.txt host] (known audio problem),
*** B4, and two B1 hung, no response to BREAK.

== 26f404e3 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=26f404e322ba63f0abe94a15844048b08e6d6ab5 Revert "olpc-ec: don't process/ack packets when there's an underrun error"]
*EC logged in S7
*further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf. Test that audio pop during suspend/resume is gone from jnettlet.
*testing by
** Samuel Greenfeld
*** 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress.
**** 1xC1 SKU201 (#6) running aggressive (10s on/10s off) suspend failed after 15 minutes during resume, power & full battery LEDs on, all other LEDs off. All tests except the battery test were running on this unit at the time of failure. [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-26f404e/screenlog.6-resumehang.bz2 host] (''mmp2_pm_finish: Enable audio island'' ... ''mmp2_pm_finish: Done'')
**** 1xC1 SKU201 (#3) running aggressive (10s on/10s off) suspend failed at a point TBD with EC communications failure and the eMMC root filesystem remounted read-only. It continued suspend & resume testing after the failure point(s). [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-26f404e/screenlog.3-ecfail1.bz2 host] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-26f404e/screenlog.8-ecfor3-ecfail1.bz2 ec]
** James Cameron
*** C1 C1 B4 B1 B1 B1, os20, runin 0.16.7, 10sec/10sec, hangs still occur [http://dev.laptop.org/~quozl/z/1Rhvmt.txt one], [http://dev.laptop.org/~quozl/z/1RhvpV.txt two], audio pops during suspend and resume are still present.

== 7dad6c10 - arm-3.0-wip-wfi (BROKEN) ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=7dad6c10 XO-1.75: always send the suspend hint command to the EC, even on C-series]
*EC logged in S7
*further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf.
*testing by
** Samuel Greenfeld
*** 1xC1 SKU202 tested; logs "olpc-ec-1.75: SSP reports TX underrun" and "SSP reports RX overrun" every three seconds shortly after the kernel starts initializing. This causes the XO to never fully boot, making this kernel unusable. [http://dev.laptop.org/~greenfeld/temp/175bringup/results-7dad6c1.tgz log with EC at S7] [http://dev.laptop.org/~greenfeld/temp/175bringup/results2-7dad6c1.tgz log with EC at S47]

== afa391a5 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=afa391a5 Revert "olpc-1.75: back off the hardware clock gating for MMC devices"]
*EC logged in S7
*test code to work around an EC communications race; needs further work, but this should hopefully keep things from hanging.
*testing by

== 9177e6a8 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=9177e6a8 olpc-1.75: back off the hardware clock gating for MMC devices]
*purpose: kill off the SET_BLOCK_COUNT errors seen by Quozl in the prior tests
*result of test: no change (the patch did not affect the outcome).
*testing by
**James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
*** C1 SKU201 passed, [http://dev.laptop.org/~quozl/kernel-testing/9177e6a8/s1/log.txt.gz ec] [http://dev.laptop.org/~quozl/kernel-testing/9177e6a8/s0/log.txt.gz host],
*** C1 SKU202 hung, after 11 minutes, SET_BLOCK_COUNT eMMC failure at kernel timestamp 1159.762854, [http://dev.laptop.org/~quozl/kernel-testing/9177e6a8/s2/log.txt.gz host]
*** B1 SKU199 hung, after 2.5 hours, SET_BLOCK_COUNT eMMC failure at kernel timestamp 7409.481875, [http://dev.laptop.org/~quozl/kernel-testing/9177e6a8/s3/log.txt.gz host]
*** B1 SKU199 hung,
*** B4 SKU199 hung, (connecting serial port afterwards did not show streaming eMMC messages),
*** B1 SKU198 hung.
**James Cameron, Q4C11, os20, single, dortc,
*** C1 SKU202 manually stopped early, had been needing keyboard wakeup,
*** B1 SKU199 manually stopped early,
*** B1 SKU199 manually stopped early, had been needing keyboard wakeup,
*** B4 SKU199 manually stopped early.
*purpose: test theory that battery state of charge changes are associated with hangs
**Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, ''runin-battery test disabled''
*** 2xC1 SKU201 hung on resume after 10.5 hours [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/ host#3/ec#8 & host #6]
*** 1xC1 SKU202 hung after 11.75 due disabling pm_async or keyboard events during runin [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/ host#4], possibly while resetting system #3 in front of it.
*** All systems then reset with pm_async disabled, 4 C1 SKU201 & 3 C1 SKU 202 total.
*** 1xC1 SKU202 hung citing phantom keyboard events followed by an illegal instruction, pm_async disabled [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/screenlog.0-keybThenIllegalOp.bz2]
*** 1xC1 SKU201 hung with MMC problems, pm_async disabled [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/screenlog.6-mmcfailure.bz2]
**Samuel Greenfeld, os23, 3 B1 SKU198 4 B1 SKU199, olpc-runin-tests-0.16.7-1 installed instead of bringup build, ''runin-battery test enabled, runin-camera, runin-wlan disabled, pm_async enabled'
*** 1xB1 SKU 199 failure to properly handle EC interrupt on resume [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/logs-SHC12900F77-111230_024931.tar.gz]
**James Cameron, Q4C11, os20, 10sec/10sec S/R runin, without ''runin-battery'', without battery inserted.
*** C1 SKU201 hung, after 67 minutes, [http://dev.laptop.org/~quozl/z/1RgTUF.txt host], C1 SKU202 hung, after 15 minutes, [http://dev.laptop.org/~quozl/z/1RgSMU.txt host], B1 SKU199 x 2, B4 SKU199 hung, after 90 minutes, [http://dev.laptop.org/~quozl/z/1RgThv.txt host tail], B1 SKU198.
*purpose: disable asynchronous S/R
*purpose: no-suspend-contention runin branch
** five units run to 100 suspend cycles, 2.5 hours each, no issues.


== 58360582 - arm-3.0-wip-wfi ==
== 58360582 - arm-3.0-wip-wfi ==
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=58360582 Revert "olpc-ec-1-75: clean up cmd state locking and other things"]
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=58360582 Revert "olpc-ec-1-75: clean up cmd state locking and other things"]
*purpose: test the old EC driver across runin and s/r.
*purpose: test the old EC driver across runin and s/r.
*result of test: no change (the patch did not affect the outcome).
*additional tests: against normal runin, and 10sec/10sec S/R runin.
*testing by
*testing by
**Richard Smith, os21, C1 SKU201, three runs, pass.
**Richard Smith, os21, C1 SKU201, three runs, against two-stage runin, pass.
**James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
*** C1 SKU201 stopped, SOC display loss at EC timestamp 5380673 kernel timestamp 745.92 (same symptom as seen in [[#4239902]]), [http://dev.laptop.org/~quozl/kernel-testing/58360582/s1/log.1 ec] [http://dev.laptop.org/~quozl/kernel-testing/58360582/s0/log.1 host],
*** C1 SKU202 stopped, eMMC failure, [http://dev.laptop.org/~quozl/kernel-testing/58360582/s2/log.1.txt.gz host]
**James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
*** C1 SKU201 hung, [http://dev.laptop.org/~quozl/kernel-testing/58360582/s1/log.txt.gz ec] [http://dev.laptop.org/~quozl/kernel-testing/58360582/s0/log.txt.gz host],
*** C1 SKU202 hung, within 30 minutes, [http://dev.laptop.org/~quozl/kernel-testing/58360582/s2/log.txt.gz host]
*** B4 SKU199 hung, after two hours,
*** B1 SKU199 was going fine for 2:15, but manually stopped, [http://dev.laptop.org/~quozl/kernel-testing/58360582/s3/log.txt.gz host]
*** B1 SKU199 hung, after one hour,
*** B1 SKU198 hung.
**Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build
*** C1 SKU201 hung on resume after 21 cycles (10s on/10s suspend), [http://dev.laptop.org/~greenfeld/temp/os23-5836058/ host#3 ec#8]
*** C1 SKU201 hung on resume overnight (10s on/10s suspend), [http://dev.laptop.org/~greenfeld/temp/os23-5836058/ host#6]
*** C1 SKU202 EC communications failure overnight [http://dev.laptop.org/~greenfeld/temp/os23-5836058/ host#1 ec#7]
*** Three remaining C1 systems running default runin suspend cycle timings did not hang after 18 hours.


== 2d8e7cc - arm-3.0-wip-wfi ==
== 2d8e7cc - arm-3.0-wip-wfi ==
Line 26: Line 427:
*os: os20, ofw: q4c08 or q4c09, runin in build, runin-fscheck disabled, runin-battery disabled, runin-sus set to 10sec/10sec
*os: os20, ofw: q4c08 or q4c09, runin in build, runin-fscheck disabled, runin-battery disabled, runin-sus set to 10sec/10sec
*purpose, test looking for non-ec related hangs.
*purpose, test looking for non-ec related hangs.
*result of test: no change (the patch did not affect the outcome).
**24hrs?
*testing by
*testing by
**Samuel Greenfeld
**Samuel Greenfeld
***in progress, os21 not os20, B1 test bed, seven units, 15 hours so far as at 2011-12-24 11:21,
***Passed, os21 not os20, B1 test bed, seven units.
**James Cameron,
**James Cameron,
***C1 SKU201 [http://dev.laptop.org/~quozl/kernel-testing/2d8e7cc/s0/log host] [http://dev.laptop.org/~quozl/kernel-testing/2d8e7cc/s1/log ec], hung at 31 cycles 23:46:01 remaining,
***C1 SKU201 [http://dev.laptop.org/~quozl/kernel-testing/2d8e7cc/s0/log host] [http://dev.laptop.org/~quozl/kernel-testing/2d8e7cc/s1/log ec], hung at 31 cycles 23:46:01 remaining,
Line 41: Line 442:
*os: os20, ofw: q4c08, runin in build, runin-sus disabled.
*os: os20, ofw: q4c08, runin in build, runin-sus disabled.
*purpose: verification of a build for manufacturing testing.
*purpose: verification of a build for manufacturing testing.
*result of test: success, the kernel is stable for runin testing in manufacturing if used without suspend and resume.
*testing by
*testing by
**Samuel Greenfeld
**Samuel Greenfeld

Latest revision as of 00:35, 19 March 2012

Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel.

See also:


4ae137c7 - arm-3.0-ramp

  • 4ae137c7
  • libertas fix
  • testing by jnettlet
  • testing by James, on six units, with serial, with aggressive runin, passed 96 hours,
  • testing by James, as above, with also debug no_console_suspend, passed 96 hours,

d6a2a5eb - arm-3.0-ramp

  • d6a2a5eb aka os27,
  • James Cameron, started 2012-02-10 05:50 UTC, stopped some days later, q4d03ja, defconfig, plus patch (enable excessive serial port output, leave apb uart clocks active over suspend), (.tar.gz), olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os27, on eight units, looking for hangs, ... 50% fail within 12 hours, two instances of "mmp2_pm_finish: Done" as the last message, also hang #0, and a unit that hung without a serial console attached.
  • as above, started 2012-02-10 22:52 UTC, with kgdb disabled,
    • hang #0, shortly after "mmp2_pm_finish: Done", with evidence of log buffer writes beyond log_end,
    • hang #1, an SKU203 without serial console attached,
    • hang #2, an SKU204 without serial console attached,
    • hang #3, shortly after "mmp2_pm_finish: Done", an SKU201 with serial console, but watchdog failed to restart,
    • hang #4, an SKU204 without serial console attached,
    • hang #5, an SKU200 with serial console, but watchdog failed to restart,

61e93c20 - arm-3.0-ramp

  • 61e93c20
  • James Cameron, started 2012-02-07 09:53 UTC, stopped after 67.7 hours, q4d03, defconfig, (i.e. with serial driver disabled, with epitaph)(.tar.gz), olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os26, on eight units, looking for hangs, looking for 3.0.19 regressions, ... one hang on resume, hang #1,

61e0e9ee - arm-3.0-ramp

  • 61e0e9ee
  • os26
  • testing by
    • James Cameron, completed, q4d02jc, defconfig, with serial console disabled, without epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os26, on eight units, looking for hangs, ... over 96 hours, one hang on SKU204,

be5c532a - arm-3.0-wip

  • be5c532a
  • testing by
    • James Cameron, completed, q4d02jc, defconfig, with serial driver disabled, with epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, on six units, looking for eMMC or other hangs, 113 hours fine, one instance of camera missing <trac>11595</trac> (which can only occur on boot), see dmesg
    • James Cameron, completed, as above, but with serial driver enabled, SKU200 hang #1 (during resume), hang #2 (during resume), SKU202 hang #3 (during suspend), and SKU201 hang during suspend or resume, ... all of which suggests that enabling the serial driver leads to hangs,
    • James Cameron, completed, as above, but with serial driver with DMA support, SKU200 hang #1 (during resume), B1 hang #2 (during resume), B4 hang #3 (during resume).

8f74291b - arm-3.0-wip

  • config: update for new graphics stuff + 3.0.17
  • EC logged in S7/S47 if available
  • os25 or another build with the new graphics ABI
  • purpose: Copy changes from 3.0.17, CMA, new graphics ABI to -wip branch
  • testing by
    • Samuel Greenfeld
      • 5xC1, 10s aggressive suspend w/ runin-camera test active
      • 1xC1 SKU201 TX FIFO not empty Ticket

073c7fbf - arm-3.0.17-wip

  • TTY: serial_core: Fix crash if DCD drop during suspend
  • EC logged S47 (if available)
  • os24 base image or one which does *not* have the new graphics ABI
  • purpose, test if fixes found in the 3.0.17 Linux kernel solve our issues
  • testing by
    • Samuel Greenfeld
      • 5xC1, all units passed 72 hours (3 cycles) with 10s suspend cycles (with clock skews ranging from 10 minutes backwards to 20 minutes forward after being set via ntp-set-clock prior to the test). System reboot order and overall duration varied from the order & time amount I recall starting the systems with. Testing was done without runin-camera, although looking at the kernel configuration it may be possible to turn it on again.

b8114c95 - arm-3.0-wip-wfi

  • b8114c95
  • os24 base image or one which does *not* have the new graphics ABI
  • disable runin-camera test
  • purpose, see if eMMC stack backport from 3.2 fixes eMMC hangs, for Andres.
  • testing by
    • Samuel Greenfeld
      • 5xC1, still hangs (as before) or reboots reporting filesystem issues, sampled logs here

5fef1609 - arm-3.0-wip-wfi

  • mmc: try reducing bus bit width in case of error
  • EC logged at S47
  • os24 base image or one which does *not* have the new graphics ABI
  • disable runin-camera test
  • purpose, look for EC, other lockups
  • aggressive suspend cycles
  • testing by
    • Samuel Greenfeld
      • 2xSKU 202, 3xSKU 201, all with serial consoles, (one of each with EC serial)
      • C1 SKU201 (#4) failure due to journal replay error, filesystem remounted readonly & runin reboot
      • Other systems all hung within a few hours, some multiple times after being restarted
      • Logs online here
    • James Cameron, with a local patch to disable serial and capture dmesg via CForth, one of the five systems has hung over 52 hours, a remarkably stable test result, the one hang was preceeded by an EC driver message olpc-ec-1.75: SSP reports TX underrun then a WARNING, log
      • C1 SKU201, start 2012-01-20 04:12:26, end 2012-01-24 04:18:42, four days, no errors, post-run ntpdate correction -3333.884336 sec,
      • C1 SKU202, start 2012-01-20 04:12:28, end 2012-01-24 05:17:09, four days, one EC driver error, post-run ntpdate correction -534.727839 sec,
      • B1, start 2012-01-20 04:12:29, end 2012-01-24 04:19:15, four days, no errors, post-run ntpdate correction 926.420114 sec,
      • B1, start 2012-01-20 04:12:31, end 2012-01-24 04:18:43, four days, no errors, post-run ntpdate correction -2817.877176 sec,
      • B4, start 2012-01-20 04:12:32, end 2012-01-24 04:18:38, four days, no errors, post-run ntpdate correction 1661.545946 sec.

b2845e62 - arm-3.0-wip-graphics (os25)

  • (pxa168fb) Fix HWCursor Resume logic
  • Use os25 to make this easiest to install
  • Q4D01 recommended
  • purpose: test possible ramp build
  • testing by
    • Samuel Greenfeld
      • 3xB1, 5xC2 running normal runin, completed 10 24-hour runin cycles with all units; should be attempting additional 24-hour cycles on units still available.

9a3a8436 - arm-3.0-wip-graphics

  • (galcore:marvell) Pass the device to DMA for reservations
  • Jon Nettleton's graphics RPM repo is required (ABI breakage)
  • purpose, test new graphics driver in runin with a somewhat ideally stable kernel
  • testing by
    • Samuel Greenfeld
      • 2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each
      • 1xSKU 201 hung after ~20 hours of testing with 10s suspend
      • 5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each

daf43181 - arm-3.0-wip-wfi

  • TTY: serial_core: Fix crash if DCD drop during suspend
  • make sure no_console_suspend is enabled
  • see if this helps or fixes serial hangs
  • testing by
    • James Cameron, os24 remastered with testconfig kernel, without no_console_suspend, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera),
    • Samuel Greenfeld
    • James Cameron, local CForth build, os24 remastered with testconfig kernel, without no_console_suspend, with dmesg buffer locating patch, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera), hang (irq 20), hang (eMMC), hang (irq 20), hang (during suspend), summary, over 24 hours, over 5 units, had 5 hangs relating to eMMC, and 10 hangs relating to irq 20.
    • James Cameron, os24 remastered with defconfig kernel, without console=ttyS2,115200, as above, eight units passed 26 hours.

fc631432 - arm-3.0-wip-wfi

  • testconfig: reenable no_console_suspend
  • EC logged in S7 or S47.
  • Disable runin camera test (disabled mmp_camera driver)
  • purpose, gather more information about why we hang, since removing no_console_suspend led to hanging significantly less often.
  • testing by
    • Samuel Greenfeld
      • Seems to hang on all test systems within 2 hours with 10s suspend cycles (multiple failure modes, including #11528), possibly aggravated by sending serial data to the XO during the suspend/resume process. But failures have also been seen after rebooting systems and ignoring them.

cfb3aa76 - arm-3.0-wip-wfi

  • cfg80211: amend regulatory NULL dereference fix
  • EC logged in S7 or S47. Do not use no_console_suspend.
  • purpose: fixes a wireless oops; still need to verify TxFIFO sanity check patch
  • testing by
    • Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles
      • SKU202 #0 failed 0:8:26 (hours:minutes:seconds) remaining after 3139 cycles power LED on full charge seen
      • SKU199 #6 failed 0:53:10 (hours:minutes:seconds) remaining 2990 cycles power LED on battery LED off (discharging)
      • SKU201 #4 failed second run 22:51 (hours:mins) remaining after 441 suspend cycles
      • SKU202 #0 failed second run 21:31 (hours:mins) remaining 322 cycles
      • SKU202 #1 failed second run 18:36 (hours:mins) remaining 689 cycles
      • SKU199 #6 failed second run 13:29 (hours:mins) remaining 1354 cycles
      • SKU201 #4 failed third run 20:08 (hours:mins) remaining 510 cycles
      • SKU202 #0 failed third run 18:42 (hours:mins) remaining 1694 cycles
      • Logs available here

be581eb4 - arm-3.0-wip-wfi

  • testconfig: turn off no_console_suspend
  • EC logged in S7 or S47, aggressive suspending in runin
  • do not use no_console_suspend, use same configuration as caused hangs previously,
  • purpose: Verify the 'olpc-ec: do a full reset if the TxFIFO sanity check fails' patch.
  • testing by
    • James Cameron, os24 remastered with kernel, C1 SKU201 passed, C1 SKU202 hung (log), B1 hung, (log).

f1f8f7fe - arm-3.0-wip, EC 4.01

  • git hash,
  • rpm,
  • build,
  • do not use no_console_suspend, include light sensor test
  • make sure to manually upgrade the EC firmware to cl2-4_0_4_01
  • purpose, test if actually getting the EC to be quiet during suspend improves things
  • testing by
    • John Watlington:
      • SKU201 (3) 24hr cycles, 600+ cycles ongoing;
      • SKU202 (3) 24hr cycles, 600+ cycles ongoing
      • (12x) SKU198/199/just plain weird, ongoing
      • (10x) SKU200, ongoing;
      • (4x) SKU201, ongoing;
      • (2x) SKU202, ongoing;
      • (2x) SKU203, ongoing;
      • (2x) SKU204, ongoing;

f1f8f7fe - arm-3.0-wip

  • git hash,
  • rpm,
  • build,
  • do not use no_console_suspend, use same configuration as caused hangs previously,
  • purpose, test if debug console messages may cause hangs,
  • testing by
    • Jon Nettleton, C1, 13:20 hours / 1966 cycles ongoing, os20,
    • James Cameron,
      • C1 SKU201, C1 SKU202, B1, all passed,
      • C1 SKU201, C1 SKU202, B1, B1, B4, with os24, all passed.
    • John Watlington, with light-sensor test:
      • C1 SKU201 (2) 24hr loop + 2000 cycles stopped;
      • C1 SKU202 (2) 24hr loop + 2100 cycles stopped
      • (10x) SKU198/199/just plain weird, (2) 24hr loop, stopped
      • (8x) SKU200, (2) 24 hr loop, stopped;
      • (3x) SKU201, (2) 24 hr loop, stopped;
      • SKU202, (2) 24hr loop, stopped;
      • SKU198, hang in first 24 hrs, all hangs either screen blank w. power light on or turned off.
      • SKU199, hang in first 24 hr`s
      • (2x) SKU200, hang in first 24 hrs
      • SKU201, hang in first 24 hrs
      • SKU202, hang in first 24 hrs
    • Samuel Greenfeld, Q4C12, patched 0.17.0 runin to support kernel RPM version
      • C1 SKU202 failure after 1805 cycles (cause unclear but EC locked up) #11571
      • C1 SKU202 failure after 3455 cycles (EC transmission queue full followed by null pointer dereference) #11573
      • B1 SKU199 failure after 2062 cycles (cause unknown; no serial logs available) #11572
      • 3xC1 SKU201, 2xB1 SKU 198, 2xB1 SKU 199 passed 24 hours of testing.
      • All 10 systems mentioned above re-imaged to os24 with normal 24-hour runin settings; passed.
    • Chia-Hsiu, 6 x C2, 2 x C1, using os24, ongoing
      • one 24 run completed successfully
    • C2 SKU200 x7 @ Miami office using os24 - ongoing
      • 20 hrs with >50 s/r cycles

1220526f - arm-3.0-wip-wfi

  • debug: add debugging to workqueues, irqs, etc
  • test dependencies: save your System.map from your build
  • purpose: attempt to track down hangs to specific driver callbacks
  • testing by
    • James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async,
      • C1 SKU201, hang 1, log, hang 2, [http://dev.laptop.org/~quozl/z/1RkTrN.txt log, hang 3, a repeating invalid page fault with storage LED blinking, FIQ entry but insane, log, fiq, hang 4, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 5, in resume, log, stopped.
      • C2 SKU202, hang 1, with spontaneous kdb entry, and a backtrace, log, hang 2, log, hang 3, not with the workqueue debugging active at the time, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 4, now in EARLY resume, log, stopped.
      • B1, hang 1, log, hang 2, log, hang 3, log, stopped.
      • B1 (pass), B4 (pass), B1 (pass), with -ENODEV return from serial_pxa_init(), and no calls to mmp2_add_uart(), ongoing, 3483 cycles, 23 hours,

0a896953 - arm-3.0-wip-wfi

  • olpc-ec: consolidate code and ensure TxFIFO is empty before writing to it
  • EC logged in S7, runin 0.17.0 or 0.16.7 (not C2 builds) if possible
  • Past kernels tested briefly in the past ~12 hours suggest that the fixes in said kernels may fix all known easily reproducible hangs. Need to prove/disprove this.
  • If you have enough systems, try using stock runin 0.17.0-style settings as well as anything known to cause particular types of failures. (EC failures seem to be common within a few hours with 10s on/10s off suspend, only using runin-main/gtk/common/tests/sus/battery.)
  • testing by
    • James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around Sat Jan 7 08:24:00 UTC 2012,
      • C1 SKU201, 534 tests, 530 pass, 4 fail,
      • C1 SKU202, 2777 tests, 2711 pass, 1 WARNING at olpc_ec_1_75_cmd (log), 66 fail,
      • one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
      • one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
    • Samuel Greenfeld, five C1 units, runin 0.16.7 setup like 0.17 (light sensor test disabled), 0 to pm_async
      • one C1 SKU201 hang during resume, normal runin settings, see host log
      • one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing, see host log ec. Unit discovered after powered off due to critical battery failure (runin discharge test).
      • one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing + cl2-4_0_3_07rsmith-586 EC code (S47 debug), see host log ec log.
      • one B1 SKU199 hang during resume, normal runin settings with 10s suspend timing, see host log.
    • Jon Nettleton

ebf24ea6 - arm-3.0-wip-wfi

196c2f806 - arm-3.0-wip-wfi

  • http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=196c2f80
  • testing by
    • James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume.
      • C1 SKU202 reported EC related kernel problems, see log, should be fixed in later kernels.

46e079fe - arm-3.0-wip-wfi

  • http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=46e079fe
  • EC logged in S7
  • Should fix the hang caused by suspend being aborted due to a wakeup event. May also help with other IRQ hangs as I have fixed a previous mistake I made when moving the audio island code.
  • Testing aborted suspends can be done by running this command and then hitting the keyboard right away.

echo $(cat /sys/power/wakeup_count) > /sys/power/wakeup_count && echo mem > /sys/power/state

  • testing by
    • James Cameron, with a local debugging patch, saw one instance of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201, see log tail, and several instances of the hang in dpm_resume().

bfc1b92b - arm-3.0-wip-wfi

b7f22e1d - arm-3.0-wip-wfi

10ebd28f - arm-3.0-wip-wfi

ae48be89 - arm-3.0-wip-wfi

  • a tar.gz by James,
  • testing by
    • Richard?
    • James Cameron, 10sec/10sec, with audio disabled, patch,

5ba0b446 - arm-3.0-wip-wfi

  • olpc-ec: allow unknown commands to be executed
  • test dependencies: EC logging (S7), kernel logging with EC debug enabled ("echo 1 > /sys/module/olpc_ec_1_75/parameters/ec_debug"), pm_async disabled.
  • purpose: improve EC driver; fewer (or no?) races/crashes..
  • testing by
    • Samuel Greenfeld
      • C1 SKU 201 EC communications failure (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin. host ec

ff199462 - arm-3.0-wip-wfi

  • olpc-ec: ensure gpio cmd is left low if something screws up
  • EC logged in S7
  • Still has FIQ debugger
  • Should help reset EC bus to avoid subsequent failures after the first command fails.
  • testing by
    • Samuel Greenfeld
      • C1 hang on resume (#3), OLS runin disabled, Q4C11 normal EC code host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done ... 72.616ms ... mmp-camera mmp-camera.0: resume) [1]
      • C1 hang on resume (#6), OLS runin disabled, Q4C11 normal EC code
      • C1 hang on resume (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin host ec

3d4cf36c - arm-3.0-wip-wfi

  • mmp2_fiq_debugger.c add file missing from other merge
  • EC logged in S7
  • This enables the FIQ debugger. Reproduce kernel hangs, then from a serial console send a Break which should drop you to a debug prompt. run bt to see what is going on.
  • testing by
    • James Cameron, 10sec/10sec,
      • C1 SKU201 hang at mmp-camera mmp-camera.0: resume, host,
      • C1 SKU202 hang at mmp-camera mmp-camera.0: resume, host,
      • C1 SKU202 hang at mmp-camera mmp-camera.0: resume, host,
      • no response to BREAK,
    • James Cameron, 0sec/3sec, FIQ was verified as working before starting,
      • C1 SKU201 host (known audio problem that doesn't need to be reported again),
      • C1 SKU202 host (known audio problem),
      • B4, and two B1 hung, no response to BREAK.

26f404e3 - arm-3.0-wip-wfi

  • Revert "olpc-ec: don't process/ack packets when there's an underrun error"
  • EC logged in S7
  • further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf. Test that audio pop during suspend/resume is gone from jnettlet.
  • testing by
    • Samuel Greenfeld
      • 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress.
        • 1xC1 SKU201 (#6) running aggressive (10s on/10s off) suspend failed after 15 minutes during resume, power & full battery LEDs on, all other LEDs off. All tests except the battery test were running on this unit at the time of failure. host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done)
        • 1xC1 SKU201 (#3) running aggressive (10s on/10s off) suspend failed at a point TBD with EC communications failure and the eMMC root filesystem remounted read-only. It continued suspend & resume testing after the failure point(s). host ec
    • James Cameron
      • C1 C1 B4 B1 B1 B1, os20, runin 0.16.7, 10sec/10sec, hangs still occur one, two, audio pops during suspend and resume are still present.

7dad6c10 - arm-3.0-wip-wfi (BROKEN)

afa391a5 - arm-3.0-wip-wfi

9177e6a8 - arm-3.0-wip-wfi

  • olpc-1.75: back off the hardware clock gating for MMC devices
  • purpose: kill off the SET_BLOCK_COUNT errors seen by Quozl in the prior tests
  • result of test: no change (the patch did not affect the outcome).
  • testing by
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 passed, ec host,
      • C1 SKU202 hung, after 11 minutes, SET_BLOCK_COUNT eMMC failure at kernel timestamp 1159.762854, host
      • B1 SKU199 hung, after 2.5 hours, SET_BLOCK_COUNT eMMC failure at kernel timestamp 7409.481875, host
      • B1 SKU199 hung,
      • B4 SKU199 hung, (connecting serial port afterwards did not show streaming eMMC messages),
      • B1 SKU198 hung.
    • James Cameron, Q4C11, os20, single, dortc,
      • C1 SKU202 manually stopped early, had been needing keyboard wakeup,
      • B1 SKU199 manually stopped early,
      • B1 SKU199 manually stopped early, had been needing keyboard wakeup,
      • B4 SKU199 manually stopped early.
  • purpose: test theory that battery state of charge changes are associated with hangs
    • Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test disabled
      • 2xC1 SKU201 hung on resume after 10.5 hours host#3/ec#8 & host #6
      • 1xC1 SKU202 hung after 11.75 due disabling pm_async or keyboard events during runin host#4, possibly while resetting system #3 in front of it.
      • All systems then reset with pm_async disabled, 4 C1 SKU201 & 3 C1 SKU 202 total.
      • 1xC1 SKU202 hung citing phantom keyboard events followed by an illegal instruction, pm_async disabled [2]
      • 1xC1 SKU201 hung with MMC problems, pm_async disabled [3]
    • Samuel Greenfeld, os23, 3 B1 SKU198 4 B1 SKU199, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test enabled, runin-camera, runin-wlan disabled, pm_async enabled'
      • 1xB1 SKU 199 failure to properly handle EC interrupt on resume [4]
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin, without runin-battery, without battery inserted.
      • C1 SKU201 hung, after 67 minutes, host, C1 SKU202 hung, after 15 minutes, host, B1 SKU199 x 2, B4 SKU199 hung, after 90 minutes, host tail, B1 SKU198.
  • purpose: disable asynchronous S/R
  • purpose: no-suspend-contention runin branch
    • five units run to 100 suspend cycles, 2.5 hours each, no issues.

58360582 - arm-3.0-wip-wfi

  • Revert "olpc-ec-1-75: clean up cmd state locking and other things"
  • purpose: test the old EC driver across runin and s/r.
  • result of test: no change (the patch did not affect the outcome).
  • additional tests: against normal runin, and 10sec/10sec S/R runin.
  • testing by
    • Richard Smith, os21, C1 SKU201, three runs, against two-stage runin, pass.
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 stopped, SOC display loss at EC timestamp 5380673 kernel timestamp 745.92 (same symptom as seen in #4239902), ec host,
      • C1 SKU202 stopped, eMMC failure, host
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 hung, ec host,
      • C1 SKU202 hung, within 30 minutes, host
      • B4 SKU199 hung, after two hours,
      • B1 SKU199 was going fine for 2:15, but manually stopped, host
      • B1 SKU199 hung, after one hour,
      • B1 SKU198 hung.
    • Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build
      • C1 SKU201 hung on resume after 21 cycles (10s on/10s suspend), host#3 ec#8
      • C1 SKU201 hung on resume overnight (10s on/10s suspend), host#6
      • C1 SKU202 EC communications failure overnight host#1 ec#7
      • Three remaining C1 systems running default runin suspend cycle timings did not hang after 18 hours.

2d8e7cc - arm-3.0-wip-wfi

  • sdhci: ignore interrupts received after suspend
  • zImage-2d8e7cc-wip-wfi
  • os: os20, ofw: q4c08 or q4c09, runin in build, runin-fscheck disabled, runin-battery disabled, runin-sus set to 10sec/10sec
  • purpose, test looking for non-ec related hangs.
  • result of test: no change (the patch did not affect the outcome).
  • testing by
    • Samuel Greenfeld
      • Passed, os21 not os20, B1 test bed, seven units.
    • James Cameron,
      • C1 SKU201 host ec, hung at 31 cycles 23:46:01 remaining,
      • C1 SKU202 host, hung at 36 cycles 23:43:41 remaining,
      • B1 SKU199 host, hung at 1737 cycles 11:25:02 remaining,
      • B4 SKU199, hung at 398 cycles 12:03:29 remaining,

6f125d7 - arm-3.0-wip

  • sdhci: ignore interrupts received after suspend (same commit comment but different git branch than above)
  • zImage-6f125d7-wip
  • os: os20, ofw: q4c08, runin in build, runin-sus disabled.
  • purpose: verification of a build for manufacturing testing.
  • result of test: success, the kernel is stable for runin testing in manufacturing if used without suspend and resume.
  • testing by
    • Samuel Greenfeld
      • Four C1 SKU 201, Three C1 SKU202 passed 24 hr testing
    • James Cameron,
      • C1 SKU201 host ec, passed,
      • C1 SKU202 host, passed.
  • was added to build os23.
  • os23 testing by
    • James Cameron, C1 SKU201, C1 SKU202, B4, B1, B1, B1, all passed 24 hr testing.

4239902

  • os20 q4c09 runin 0.16.7 10sec 10sec 24hrs
  • testing by
    • James Cameron, fail, one unit hung, all units lost EC communications (blank SOC display, hang after runin pass),