XO-1.75/Kernel/Testing: Difference between revisions
< XO-1.75
Jump to navigation
Jump to search
(224 intermediate revisions by 11 users not shown) | |||
Line 1: | Line 1: | ||
Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel. |
Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel. |
||
See also: |
|||
*[[XO-1.75/Kernel/Runin|using runin for kernel testing]] |
|||
*[[XO-1.75/Kernel/FIQ|hints for using FIQ debugger kernels]] |
|||
*[[XO-1.75/Kernel/Issues|structured problem tracking]] |
|||
<!-- |
<!-- |
||
Line 13: | Line 18: | ||
and add after this html comment |
and add after this html comment |
||
--> |
--> |
||
== 4ae137c7 - arm-3.0-ramp == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-ramp&id=4ae137c791df043b323444bb3c0f2ae8b6adf067 4ae137c7] |
|||
*libertas fix |
|||
*testing by jnettlet |
|||
*testing by James, on six units, with serial, with aggressive runin, passed 96 hours, |
|||
*testing by James, as above, with also ''debug no_console_suspend'', passed 96 hours, |
|||
== d6a2a5eb - arm-3.0-ramp == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-ramp&id=d6a2a5eb1932d9c742fd355c161533dd8bffbdba d6a2a5eb] aka os27, |
|||
*James Cameron, started 2012-02-10 05:50 UTC, stopped some days later, q4d03ja, defconfig, plus [http://dev.laptop.org/~quozl/y/1RvjQc.txt patch] (enable excessive serial port output, leave apb uart clocks active over suspend), ([http://dev.laptop.org/~quozl/kernel-testing/d6a2a5eb.tar.gz .tar.gz]), olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os27, on eight units, looking for hangs, ... 50% fail within 12 hours, two instances of "mmp2_pm_finish: Done" as the last message, also [http://dev.laptop.org/~quozl/z/1Rvz1M.txt hang #0], and a unit that hung without a serial console attached. |
|||
*as above, started 2012-02-10 22:52 UTC, with [http://dev.laptop.org/~dilinger/pxa3.patch kgdb disabled], |
|||
**[http://dev.laptop.org/~quozl/z/1Rw3ec.txt hang #0], shortly after "mmp2_pm_finish: Done", with evidence of log buffer writes beyond log_end, |
|||
**hang #1, an SKU203 without serial console attached, |
|||
**hang #2, an SKU204 without serial console attached, |
|||
**[http://dev.laptop.org/~quozl/z/1RwmtV.txt hang #3], shortly after "mmp2_pm_finish: Done", an SKU201 with serial console, but watchdog failed to restart, |
|||
**hang #4, an SKU204 without serial console attached, |
|||
**[http://dev.laptop.org/~quozl/z/1Rwmz9.txt hang #5], an SKU200 with serial console, but watchdog failed to restart, |
|||
== 61e93c20 - arm-3.0-ramp == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-ramp&id=61e93c20684696b7464121a6a087fa6ca9895f6f 61e93c20] |
|||
*James Cameron, started 2012-02-07 09:53 UTC, stopped after 67.7 hours, q4d03, defconfig, (i.e. with serial driver disabled, with epitaph)([http://dev.laptop.org/~quozl/kernel-testing/3b3820f2-epitaph-nottys2.tar.gz .tar.gz]), olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os26, on eight units, looking for hangs, looking for 3.0.19 regressions, ... one hang on resume, [http://dev.laptop.org/~quozl/z/1Rvhmu.txt hang #1], |
|||
== 61e0e9ee - arm-3.0-ramp == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-ramp&id=61e0e9eec5c2315cc6b4bd855e5bfe9b53178361 61e0e9ee] |
|||
*os26 |
|||
*testing by |
|||
**James Cameron, completed, q4d02jc, defconfig, with serial console disabled, without epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os26, on eight units, looking for hangs, ... over 96 hours, one hang on SKU204, |
|||
== be5c532a - arm-3.0-wip == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip&id=be5c532a4119a9388acf0c4dab21d590c06c27be be5c532a] |
|||
*testing by |
|||
**James Cameron, completed, q4d02jc, defconfig, with serial driver disabled, with epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, on six units, looking for eMMC or other hangs, 113 hours fine, one instance of camera missing <trac>11595</trac> (which can only occur on boot), see [http://dev.laptop.org/~quozl/z/1Rpupk.txt dmesg] |
|||
**James Cameron, completed, as above, but with serial driver enabled, SKU200 [http://dev.laptop.org/~quozl/z/1RrnFf.txt hang #1] (during resume), [http://dev.laptop.org/~quozl/z/1RrzSa.txt hang #2] (during resume), SKU202 [http://dev.laptop.org/~quozl/z/1RrzR3.txt hang #3] (during suspend), and SKU201 hang during suspend or resume, ... all of which suggests that enabling the serial driver leads to hangs, |
|||
**James Cameron, completed, as above, but with serial driver with DMA support, <!-- local branch, version 3.0.17-01203-g60273d9 --> SKU200 [http://dev.laptop.org/~quozl/z/1RsAaG.txt hang #1] (during resume), B1 [http://dev.laptop.org/~quozl/z/1RsMMI.txt hang #2] (during resume), B4 [http://dev.laptop.org/~quozl/z/1RsMP9.txt hang #3] (during resume). |
|||
== 8f74291b - arm-3.0-wip == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip&id=8f74291be31f0e29051e9a4acaf6155c537f5af8 config: update for new graphics stuff + 3.0.17] |
|||
*EC logged in S7/S47 if available |
|||
*os25 or another build with the new graphics ABI |
|||
*purpose: Copy changes from 3.0.17, CMA, new graphics ABI to -wip branch |
|||
*testing by |
|||
**Samuel Greenfeld |
|||
***5xC1, 10s aggressive suspend w/ runin-camera test active |
|||
***1xC1 SKU201 TX FIFO not empty [http://dev.laptop.org/ticket/11594 Ticket] |
|||
== 073c7fbf - arm-3.0.17-wip == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0.17-wip&id=073c7fbfc38525105f4882cf8fb7cafddd787670 TTY: serial_core: Fix crash if DCD drop during suspend] |
|||
*EC logged S47 (if available) |
|||
*os24 base image or one which does *not* have the new graphics ABI |
|||
*purpose, test if fixes found in the 3.0.17 Linux kernel solve our issues |
|||
*testing by |
|||
**Samuel Greenfeld |
|||
***5xC1, all units passed 72 hours (3 cycles) with 10s suspend cycles (with clock skews ranging from 10 minutes backwards to 20 minutes forward after being set via ntp-set-clock prior to the test). System reboot order and overall duration varied from the order & time amount I recall starting the systems with. Testing was done without runin-camera, although looking at the kernel configuration it may be possible to turn it on again. |
|||
== b8114c95 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=b8114c95e6d369385bc28c6dacc8d9e1cb427262 b8114c95] |
|||
*os24 base image or one which does *not* have the new graphics ABI |
|||
*disable runin-camera test |
|||
*purpose, see if eMMC stack backport from 3.2 fixes eMMC hangs, for Andres. |
|||
*testing by |
|||
**Samuel Greenfeld |
|||
***5xC1, still hangs (as before) or reboots reporting filesystem issues, [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-b8114c9/ sampled logs here] |
|||
== 5fef1609 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=5fef160929d2ae18b8212a285e193b4510128469 mmc: try reducing bus bit width in case of error] |
|||
*EC logged at S47 |
|||
*os24 base image or one which does *not* have the new graphics ABI |
|||
*disable runin-camera test |
|||
*purpose, look for EC, other lockups |
|||
*aggressive suspend cycles |
|||
*testing by |
|||
**Samuel Greenfeld |
|||
***2xSKU 202, 3xSKU 201, all with serial consoles, (one of each with EC serial) |
|||
***C1 SKU201 (#4) failure due to journal replay error, filesystem remounted readonly & runin reboot |
|||
***Other systems all hung within a few hours, some multiple times after being restarted |
|||
***Logs online [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-5fef1602/ here] |
|||
**James Cameron, with a [http://dev.laptop.org/~quozl/z/1Ro5xF.txt local patch to disable serial and capture dmesg via CForth], one of the five systems has hung over 52 hours, a remarkably stable test result, the one hang was preceeded by an EC driver message ''olpc-ec-1.75: SSP reports TX underrun'' then a WARNING, [http://dev.laptop.org/~quozl/z/1RosBj.txt log] |
|||
*** C1 SKU201, start 2012-01-20 04:12:26, end 2012-01-24 04:18:42, four days, no errors, post-run ntpdate correction -3333.884336 sec, |
|||
*** C1 SKU202, start 2012-01-20 04:12:28, end 2012-01-24 05:17:09, four days, one EC driver error, post-run ntpdate correction -534.727839 sec, |
|||
*** B1, start 2012-01-20 04:12:29, end 2012-01-24 04:19:15, four days, no errors, post-run ntpdate correction 926.420114 sec, |
|||
*** B1, start 2012-01-20 04:12:31, end 2012-01-24 04:18:43, four days, no errors, post-run ntpdate correction -2817.877176 sec, |
|||
*** B4, start 2012-01-20 04:12:32, end 2012-01-24 04:18:38, four days, no errors, post-run ntpdate correction 1661.545946 sec. |
|||
== b2845e62 - arm-3.0-wip-graphics (os25) == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-graphics&id=b2845e62ac6803e0e1ef760ee850e0f7c747349a (pxa168fb) Fix HWCursor Resume logic] |
|||
*Use os25 to make this easiest to install |
|||
*Q4D01 recommended |
|||
*purpose: test possible ramp build |
|||
*testing by |
|||
**Samuel Greenfeld |
|||
***3xB1, 5xC2 running normal runin, completed 10 24-hour runin cycles with all units; should be attempting additional 24-hour cycles on units still available. |
|||
== 9a3a8436 - arm-3.0-wip-graphics == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-graphics&id=9a3a84361e24886b34a0134fc18632bf86bb62cd (galcore:marvell) Pass the device to DMA for reservations] |
|||
*Jon Nettleton's graphics RPM repo is required (ABI breakage) |
|||
*purpose, test new graphics driver in runin with a somewhat ideally stable kernel |
|||
*testing by |
|||
**Samuel Greenfeld |
|||
***2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each |
|||
***1xSKU 201 hung after ~20 hours of testing with 10s suspend |
|||
***5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each |
|||
== daf43181 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=daf43181 TTY: serial_core: Fix crash if DCD drop during suspend] |
|||
*make sure no_console_suspend is enabled |
|||
*see if this helps or fixes serial hangs |
|||
*testing by |
|||
**James Cameron, os24 remastered with testconfig kernel, without no_console_suspend, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera), |
|||
*** B1 [http://dev.laptop.org/~quozl/z/1RnKRg.txt hang], [http://dev.laptop.org/~quozl/z/1RnNVP.txt hang], |
|||
*** B4 [http://dev.laptop.org/~quozl/z/1RnL3p.txt runin restart due to filesystem remount read-only]. |
|||
**Samuel Greenfeld |
|||
*** C1 SKU 202 [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-daf4318/screenlog.0-mmcfail1.bz2 mmc communication problem (no reboot by runin)] |
|||
*** C1 SKU 202 [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-daf4318/screenlog.1.bz2 hang] |
|||
*** C1 SKU 201 [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-daf4318/screenlog.3.bz2 hang] |
|||
*** C1 SKU 201 [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-daf4318/screenlog.4.bz2 hang] |
|||
*** C2 hang followed by power off; no serial connection |
|||
** James Cameron, local CForth build, os24 remastered with testconfig kernel, without no_console_suspend, with dmesg buffer locating patch, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera), [http://dev.laptop.org/~quozl/z/1Rnh17.txt hang] (irq 20), [http://dev.laptop.org/~quozl/z/1Rnh6v.txt hang] (eMMC), [http://dev.laptop.org/~quozl/z/1RniFg.txt hang] (irq 20), [http://dev.laptop.org/~quozl/z/1Ro1mn.txt hang] (during suspend), summary, over 24 hours, over 5 units, had 5 hangs relating to eMMC, and 10 hangs relating to irq 20. |
|||
** James Cameron, os24 remastered with defconfig kernel, without console=ttyS2,115200, as above, eight units passed 26 hours. |
|||
== fc631432 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=fc63143216c68cdbb6ca44201dc496a8842f2dab testconfig: reenable no_console_suspend] |
|||
*EC logged in S7 or S47. |
|||
*Disable runin camera test (disabled mmp_camera driver) |
|||
*purpose, gather more information about why we hang, since removing no_console_suspend led to hanging significantly less often. |
|||
*testing by |
|||
**Samuel Greenfeld |
|||
***Seems to hang on all test systems within 2 hours with 10s suspend cycles (multiple failure modes, including #11528), possibly aggravated by sending serial data to the XO during the suspend/resume process. But failures have also been seen after rebooting systems and ignoring them. |
|||
== cfb3aa76 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=cfb3aa76 cfg80211: amend regulatory NULL dereference fix] |
|||
*EC logged in S7 or S47. Do not use no_console_suspend. |
|||
*purpose: fixes a wireless oops; still need to verify TxFIFO sanity check patch |
|||
*testing by |
|||
**Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles |
|||
***SKU202 #0 failed 0:8:26 (hours:minutes:seconds) remaining after 3139 cycles power LED on full charge seen |
|||
***SKU199 #6 failed 0:53:10 (hours:minutes:seconds) remaining 2990 cycles power LED on battery LED off (discharging) |
|||
***SKU201 #4 failed second run 22:51 (hours:mins) remaining after 441 suspend cycles |
|||
***SKU202 #0 failed second run 21:31 (hours:mins) remaining 322 cycles |
|||
***SKU202 #1 failed second run 18:36 (hours:mins) remaining 689 cycles |
|||
***SKU199 #6 failed second run 13:29 (hours:mins) remaining 1354 cycles |
|||
***SKU201 #4 failed third run 20:08 (hours:mins) remaining 510 cycles |
|||
***SKU202 #0 failed third run 18:42 (hours:mins) remaining 1694 cycles |
|||
***Logs available [http://dev.laptop.org/~greenfeld/temp/175bringup/os24-cfb3aa7/ here] |
|||
==be581eb4 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=be581eb4c2f2e1168d34df907f318d512095db96 testconfig: turn off no_console_suspend] |
|||
*EC logged in S7 or S47, aggressive suspending in runin |
|||
*do not use no_console_suspend, use same configuration as caused hangs previously, |
|||
*purpose: Verify the 'olpc-ec: do a full reset if the TxFIFO sanity check fails' patch. |
|||
*testing by |
|||
** James Cameron, os24 remastered with kernel, C1 SKU201 passed, C1 SKU202 hung ([http://dev.laptop.org/~quozl/z/1RlvVN.txt log]), B1 hung, ([http://dev.laptop.org/~quozl/z/1RlvVz.txt log]). |
|||
==f1f8f7fe - arm-3.0-wip, EC 4.01 == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip&id=f1f8f7fe60e77c43b0486bdd6a058085833664e5 git hash], |
|||
*[http://dev.laptop.org/~kernels/public_rpms/f14-xo1.75/kernel-3.0.0_xo1.75-20120109.1939.olpc.f1f8f7f.armv7l.rpm rpm], |
|||
*[http://build.laptop.org/11.3.1/os24/xo1.75/ build], |
|||
*do not use no_console_suspend, include light sensor test |
|||
*make sure to manually upgrade the EC firmware to cl2-4_0_4_01 |
|||
*purpose, test if actually getting the EC to be quiet during suspend improves things |
|||
*testing by |
|||
** John Watlington: |
|||
*** SKU201 (3) 24hr cycles, 600+ cycles ongoing; |
|||
*** SKU202 (3) 24hr cycles, 600+ cycles ongoing |
|||
*** (12x) SKU198/199/just plain weird, ongoing |
|||
*** (10x) SKU200, ongoing; |
|||
*** (4x) SKU201, ongoing; |
|||
*** (2x) SKU202, ongoing; |
|||
*** (2x) SKU203, ongoing; |
|||
*** (2x) SKU204, ongoing; |
|||
==f1f8f7fe - arm-3.0-wip == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip&id=f1f8f7fe60e77c43b0486bdd6a058085833664e5 git hash], |
|||
*[http://dev.laptop.org/~kernels/public_rpms/f14-xo1.75/kernel-3.0.0_xo1.75-20120109.1939.olpc.f1f8f7f.armv7l.rpm rpm], |
|||
*[http://build.laptop.org/11.3.1/os24/xo1.75/ build], |
|||
*do not use no_console_suspend, use same configuration as caused hangs previously, |
|||
*purpose, test if debug console messages may cause hangs, |
|||
*testing by |
|||
** Jon Nettleton, C1, 13:20 hours / 1966 cycles ongoing, os20, |
|||
** James Cameron, |
|||
*** C1 SKU201, C1 SKU202, B1, all passed, |
|||
*** C1 SKU201, C1 SKU202, B1, B1, B4, with os24, all passed. |
|||
** John Watlington, with light-sensor test: |
|||
***C1 SKU201 (2) 24hr loop + 2000 cycles stopped; |
|||
***C1 SKU202 (2) 24hr loop + 2100 cycles stopped |
|||
***(10x) SKU198/199/just plain weird, (2) 24hr loop, stopped |
|||
***(8x) SKU200, (2) 24 hr loop, stopped; |
|||
***(3x) SKU201, (2) 24 hr loop, stopped; |
|||
*** SKU202, (2) 24hr loop, stopped; |
|||
*** SKU198, hang in first 24 hrs, all hangs either screen blank w. power light on or turned off. |
|||
*** SKU199, hang in first 24 hr`s |
|||
*** (2x) SKU200, hang in first 24 hrs |
|||
*** SKU201, hang in first 24 hrs |
|||
*** SKU202, hang in first 24 hrs |
|||
** Samuel Greenfeld, Q4C12, patched 0.17.0 runin to support kernel RPM version |
|||
*** C1 SKU202 failure after 1805 cycles (cause unclear but EC locked up) [http://dev.laptop.org/ticket/11571 #11571] |
|||
*** C1 SKU202 failure after 3455 cycles (EC transmission queue full followed by null pointer dereference) [http://dev.laptop.org/ticket/11573 #11573] |
|||
*** B1 SKU199 failure after 2062 cycles (cause unknown; no serial logs available) [http://dev.laptop.org/ticket/11572 #11572] |
|||
*** 3xC1 SKU201, 2xB1 SKU 198, 2xB1 SKU 199 passed 24 hours of testing. |
|||
*** All 10 systems mentioned above re-imaged to os24 with normal 24-hour runin settings; passed. |
|||
** Chia-Hsiu, 6 x C2, 2 x C1, using os24, ongoing |
|||
*** one 24 run completed successfully |
|||
** C2 SKU200 x7 @ Miami office using os24 - ongoing |
|||
*** 20 hrs with >50 s/r cycles |
|||
==1220526f - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=1220526f debug: add debugging to workqueues, irqs, etc] |
|||
*test dependencies: save your System.map from your build |
|||
*purpose: attempt to track down hangs to specific driver callbacks |
|||
* testing by |
|||
** James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async, |
|||
*** C1 SKU201, hang 1, [http://dev.laptop.org/~quozl/z/1RkTYx.txt log, hang 2, [http://dev.laptop.org/~quozl/z/1RkTrN.txt log], hang 3, a repeating invalid page fault with storage LED blinking, FIQ entry but insane, [http://dev.laptop.org/~quozl/z/1RkUKO.txt log], [http://dev.laptop.org/~quozl/z/1RkUMV.txt fiq], hang 4, [http://dev.laptop.org/~quozl/z/1RkVAN.txt log], restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 5, in resume, [http://dev.laptop.org/~quozl/z/1RkVjV.txt log], stopped. |
|||
*** C2 SKU202, hang 1, with spontaneous kdb entry, and a backtrace, [http://dev.laptop.org/~quozl/z/1RkTgv.txt log], hang 2, [http://dev.laptop.org/~quozl/z/1RkUJ4.txt log], hang 3, not with the workqueue debugging active at the time, [http://dev.laptop.org/~quozl/z/1RkVBB.txt log], restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 4, now in EARLY resume, [http://dev.laptop.org/~quozl/z/1RkVf2.txt log], stopped. |
|||
*** B1, hang 1, [http://dev.laptop.org/~quozl/z/1RkU4x.txt log], hang 2, [http://dev.laptop.org/~quozl/z/1RkV9G.txt log], hang 3, [http://dev.laptop.org/~quozl/z/1RkVhM.txt log], stopped. |
|||
*** B1 (pass), B4 (pass), B1 (pass), with -ENODEV return from serial_pxa_init(), and no calls to mmp2_add_uart(), ongoing, 3483 cycles, 23 hours, |
|||
== 0a896953 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?id=0a89695328b4abd3ea1ec27157e1b2bd0687f497 olpc-ec: consolidate code and ensure TxFIFO is empty before writing to it] |
|||
*EC logged in S7, runin 0.17.0 or 0.16.7 (not C2 builds) if possible |
|||
*Past kernels tested briefly in the past ~12 hours suggest that the fixes in said kernels may fix all known easily reproducible hangs. Need to prove/disprove this. |
|||
*If you have enough systems, try using stock runin 0.17.0-style settings as well as anything known to cause particular types of failures. (EC failures seem to be common within a few hours with 10s on/10s off suspend, only using runin-main/gtk/common/tests/sus/battery.) |
|||
*testing by |
|||
** James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around ''Sat Jan 7 08:24:00 UTC 2012'', |
|||
*** C1 SKU201, 534 tests, 530 pass, 4 fail, |
|||
**** 3 hang during resume (see [http://dev.laptop.org/~quozl/z/1RjRnj.txt host log], and [http://dev.laptop.org/~quozl/z/1RkAO8.txt host log], and [http://dev.laptop.org/~quozl/z/1RkAQI.txt host log]), |
|||
**** 1 fail with with repeating eMMC SET_BLOCK_COUNT, see [http://dev.laptop.org/~quozl/z/1RkASH.txt host log], |
|||
*** C1 SKU202, 2777 tests, 2711 pass, 1 WARNING at olpc_ec_1_75_cmd ([http://dev.laptop.org/~quozl/z/1RkB6R.txt log]), 66 fail, |
|||
**** 50 hang during resume, |
|||
**** 16 fail with repeating eMMC SET_BLOCK_COUNT, |
|||
**** see [http://dev.laptop.org/~quozl/z/1RkAy8.txt tail of each fail in one place] |
|||
*** one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see [http://dev.laptop.org/~quozl/z/1RjS0m.txt host log], |
|||
*** one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see [http://dev.laptop.org/~quozl/z/1RjS3a.txt host log], |
|||
** Samuel Greenfeld, five C1 units, runin 0.16.7 setup like 0.17 (light sensor test disabled), 0 to pm_async |
|||
*** one C1 SKU201 hang during resume, normal runin settings, see [http://fpaste.org/qOc6/ host log] |
|||
*** one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing, see [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.3-resumehang.bz2 host log] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.8-ecfor3-resumehang.bz2 ec]. Unit discovered after powered off due to critical battery failure (runin discharge test). |
|||
*** one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing + cl2-4_0_3_07rsmith-586 EC code (S47 debug), see [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.1-resumehang-ec586.bz2 host log] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.7-resumehang-ec586-ecfor1.bz2 ec log]. |
|||
*** one B1 SKU199 hang during resume, normal runin settings with 10s suspend timing, see [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-0a896953/screenlog.6-b1resumehang.bz2 host log]. |
|||
** Jon Nettleton |
|||
== ebf24ea6 - arm-3.0-wip-wfi == |
|||
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=ebf24ea6 |
|||
*test dependencies: EC logged in S7, use whatever magic possible to get the EC errors |
|||
*Another attempt to fix the problem of EC communication failures |
|||
*testing by |
|||
== 196c2f806 - arm-3.0-wip-wfi == |
|||
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=196c2f80 |
|||
*testing by |
|||
**James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume. |
|||
***C1 SKU202 reported EC related kernel problems, see [http://dev.laptop.org/~quozl/z/1RjNDy.txt log], should be fixed in later kernels. |
|||
== 46e079fe - arm-3.0-wip-wfi == |
|||
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=46e079fe |
|||
*EC logged in S7 |
|||
*Should fix the hang caused by suspend being aborted due to a wakeup event. May also help with other IRQ hangs as I have fixed a previous mistake I made when moving the audio island code. |
|||
*Testing aborted suspends can be done by running this command and then hitting the keyboard right away. |
|||
<blockquote> |
|||
echo $(cat /sys/power/wakeup_count) > /sys/power/wakeup_count && echo mem > /sys/power/state |
|||
</blockquote> |
|||
*testing by |
|||
**James Cameron, with a local debugging patch, saw one instance of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201, see [http://dev.laptop.org/~quozl/z/1RikOi.txt log tail], and several instances of the [[XO-1.75/Kernel/Issues#hang.2C_dpm_resume|hang in dpm_resume()]]. |
|||
== bfc1b92b - arm-3.0-wip-wfi == |
|||
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=bfc1b92b |
|||
*EC logged in S7 |
|||
*Should fix the (known) cause of failing EC commands on s/r. Please test both this and the prior kernel. |
|||
*testing by |
|||
== b7f22e1d - arm-3.0-wip-wfi == |
|||
*http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=b7f22e1d |
|||
*EC logged in S7 |
|||
*Should fix the problem where, upon a single EC error, all further EC commands fail. Please test! |
|||
*testing by |
|||
== 10ebd28f - arm-3.0-wip-wfi == |
|||
* [http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=10ebd28ff6bc0ef40c881a818fe67b6bb544954f olpc-ec: log a warning if we manage to process an EC command deep inside of suspend], |
|||
* a [http://dev.laptop.org/~quozl/kernel-10ebd28f-fiq.tar.gz tar.gz] by James, |
|||
*testing by |
|||
** Richard? |
|||
== ae48be89 - arm-3.0-wip-wfi == |
|||
* a [http://dev.laptop.org/~quozl/kernel-ae48be89-fiq.tar.gz tar.gz] by James, |
|||
*testing by |
|||
** Richard? |
|||
** James Cameron, 10sec/10sec, with audio disabled, [http://dev.laptop.org/~quozl/z/1RiInC.txt patch], |
|||
*** B1 hung, [http://dev.laptop.org/~quozl/z/1RiK9g.txt host log], |
|||
*** B1 hung, [http://dev.laptop.org/~quozl/z/1RiMYl.txt host log], |
|||
*** B1 going. |
|||
== 5ba0b446 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=5ba0b446 olpc-ec: allow unknown commands to be executed] |
|||
*test dependencies: EC logging (S7), kernel logging with EC debug enabled ("echo 1 > /sys/module/olpc_ec_1_75/parameters/ec_debug"), pm_async disabled. |
|||
*purpose: improve EC driver; fewer (or no?) races/crashes.. |
|||
*testing by |
|||
** Samuel Greenfeld |
|||
*** C1 SKU 201 EC communications failure (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin. [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-5ba0b446/screenlog.3-ecfail.bz2 host] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-5ba0b446/screenlog.8-ecfor3-ecfail.bz2 ec] |
|||
== ff199462 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=ff199462da2eb97ff2a8cbcd98f7d74822a40ae5 olpc-ec: ensure gpio cmd is left low if something screws up] |
|||
*EC logged in S7 |
|||
*Still has FIQ debugger |
|||
*Should help reset EC bus to avoid subsequent failures after the first command fails. |
|||
*testing by |
|||
**Samuel Greenfeld |
|||
*** C1 hang on resume (#3), OLS runin disabled, Q4C11 normal EC code [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-ff199462/screenlog.3-resumehang1.bz2 host] (''mmp2_pm_finish: Enable audio island'' ... ''mmp2_pm_finish: Done'' ... 72.616ms ... ''mmp-camera mmp-camera.0: resume'') [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-ff199462/screenlog.8-resumehang1.bz2] |
|||
*** C1 hang on resume (#6), OLS runin disabled, Q4C11 normal EC code |
|||
*** C1 hang on resume (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-ff199462/screenlog.3-resumehang2ecmod.bz2 host] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-ff199462/screenlog.8-ecfor3-resumehang2ecmod.bz2 ec] |
|||
== 3d4cf36c - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=3d4cf36c56ccce859fc3b508ee8a38cde098d346 mmp2_fiq_debugger.c add file missing from other merge] |
|||
*EC logged in S7 |
|||
*This enables the [[XO-1.75/Kernel/FIQ|FIQ debugger]]. Reproduce kernel hangs, then from a serial console send a Break which should drop you to a debug prompt. run bt to see what is going on. |
|||
*testing by |
|||
**James Cameron, 10sec/10sec, |
|||
*** C1 SKU201 hang at ''mmp-camera mmp-camera.0: resume'', [http://dev.laptop.org/~quozl/z/1RiKZq.txt host], |
|||
*** C1 SKU202 hang at ''mmp-camera mmp-camera.0: resume'', [http://dev.laptop.org/~quozl/z/1RiHT6.txt host], |
|||
*** C1 SKU202 hang at ''mmp-camera mmp-camera.0: resume'', [http://dev.laptop.org/~quozl/z/1RiKbq.txt host], |
|||
*** no response to BREAK, |
|||
**James Cameron, 0sec/3sec, FIQ was verified as working before starting, |
|||
*** C1 SKU201 [http://dev.laptop.org/~quozl/z/1RiDms.txt host] (known audio problem that doesn't need to be reported again), |
|||
*** C1 SKU202 [http://dev.laptop.org/~quozl/z/1RiEIr.txt host] (known audio problem), |
|||
*** B4, and two B1 hung, no response to BREAK. |
|||
== 26f404e3 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=26f404e322ba63f0abe94a15844048b08e6d6ab5 Revert "olpc-ec: don't process/ack packets when there's an underrun error"] |
|||
*EC logged in S7 |
|||
*further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf. Test that audio pop during suspend/resume is gone from jnettlet. |
|||
*testing by |
|||
** Samuel Greenfeld |
|||
*** 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress. |
|||
**** 1xC1 SKU201 (#6) running aggressive (10s on/10s off) suspend failed after 15 minutes during resume, power & full battery LEDs on, all other LEDs off. All tests except the battery test were running on this unit at the time of failure. [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-26f404e/screenlog.6-resumehang.bz2 host] (''mmp2_pm_finish: Enable audio island'' ... ''mmp2_pm_finish: Done'') |
|||
**** 1xC1 SKU201 (#3) running aggressive (10s on/10s off) suspend failed at a point TBD with EC communications failure and the eMMC root filesystem remounted read-only. It continued suspend & resume testing after the failure point(s). [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-26f404e/screenlog.3-ecfail1.bz2 host] [http://dev.laptop.org/~greenfeld/temp/175bringup/os23-26f404e/screenlog.8-ecfor3-ecfail1.bz2 ec] |
|||
** James Cameron |
|||
*** C1 C1 B4 B1 B1 B1, os20, runin 0.16.7, 10sec/10sec, hangs still occur [http://dev.laptop.org/~quozl/z/1Rhvmt.txt one], [http://dev.laptop.org/~quozl/z/1RhvpV.txt two], audio pops during suspend and resume are still present. |
|||
== 7dad6c10 - arm-3.0-wip-wfi (BROKEN) == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=7dad6c10 XO-1.75: always send the suspend hint command to the EC, even on C-series] |
|||
*EC logged in S7 |
|||
*further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf. |
|||
*testing by |
|||
** Samuel Greenfeld |
|||
*** 1xC1 SKU202 tested; logs "olpc-ec-1.75: SSP reports TX underrun" and "SSP reports RX overrun" every three seconds shortly after the kernel starts initializing. This causes the XO to never fully boot, making this kernel unusable. [http://dev.laptop.org/~greenfeld/temp/175bringup/results-7dad6c1.tgz log with EC at S7] [http://dev.laptop.org/~greenfeld/temp/175bringup/results2-7dad6c1.tgz log with EC at S47] |
|||
== afa391a5 - arm-3.0-wip-wfi == |
|||
*[http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=afa391a5 Revert "olpc-1.75: back off the hardware clock gating for MMC devices"] |
|||
*EC logged in S7 |
|||
*test code to work around an EC communications race; needs further work, but this should hopefully keep things from hanging. |
|||
*testing by |
|||
== 9177e6a8 - arm-3.0-wip-wfi == |
== 9177e6a8 - arm-3.0-wip-wfi == |
||
Line 33: | Line 386: | ||
*purpose: test theory that battery state of charge changes are associated with hangs |
*purpose: test theory that battery state of charge changes are associated with hangs |
||
**Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, ''runin-battery test disabled'' |
**Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, ''runin-battery test disabled'' |
||
*** 2xC1 SKU201 hung on resume after 10.5 hours [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/ host#3/ec#8 & host #6] |
|||
*** 4 C1 SKU201 & 3 C1 SKU 202 going. |
|||
*** 1xC1 SKU202 hung after 11.75 due disabling pm_async or keyboard events during runin [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/ host#4], possibly while resetting system #3 in front of it. |
|||
*** All systems then reset with pm_async disabled, 4 C1 SKU201 & 3 C1 SKU 202 total. |
|||
*** 1xC1 SKU202 hung citing phantom keyboard events followed by an illegal instruction, pm_async disabled [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/screenlog.0-keybThenIllegalOp.bz2] |
|||
*** 1xC1 SKU201 hung with MMC problems, pm_async disabled [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/screenlog.6-mmcfailure.bz2] |
|||
**Samuel Greenfeld, os23, 3 B1 SKU198 4 B1 SKU199, olpc-runin-tests-0.16.7-1 installed instead of bringup build, ''runin-battery test enabled, runin-camera, runin-wlan disabled, pm_async enabled' |
|||
*** 1xB1 SKU 199 failure to properly handle EC interrupt on resume [http://dev.laptop.org/~greenfeld/temp/os23-9177e6a/logs-SHC12900F77-111230_024931.tar.gz] |
|||
**James Cameron, Q4C11, os20, 10sec/10sec S/R runin, without ''runin-battery'', without battery inserted. |
**James Cameron, Q4C11, os20, 10sec/10sec S/R runin, without ''runin-battery'', without battery inserted. |
||
*** C1 SKU201 hung, after 67 minutes, [http://dev.laptop.org/~quozl/z/1RgTUF.txt host], C1 SKU202 hung, after 15 minutes, [http://dev.laptop.org/~quozl/z/1RgSMU.txt host], B1 SKU199 x 2 |
*** C1 SKU201 hung, after 67 minutes, [http://dev.laptop.org/~quozl/z/1RgTUF.txt host], C1 SKU202 hung, after 15 minutes, [http://dev.laptop.org/~quozl/z/1RgSMU.txt host], B1 SKU199 x 2, B4 SKU199 hung, after 90 minutes, [http://dev.laptop.org/~quozl/z/1RgThv.txt host tail], B1 SKU198. |
||
*purpose: disable asynchronous S/R |
*purpose: disable asynchronous S/R |
||
*purpose: no-suspend-contention runin branch |
|||
** five units run to 100 suspend cycles, 2.5 hours each, no issues. |
|||
== 58360582 - arm-3.0-wip-wfi == |
== 58360582 - arm-3.0-wip-wfi == |
||
Line 58: | Line 419: | ||
*** C1 SKU201 hung on resume after 21 cycles (10s on/10s suspend), [http://dev.laptop.org/~greenfeld/temp/os23-5836058/ host#3 ec#8] |
*** C1 SKU201 hung on resume after 21 cycles (10s on/10s suspend), [http://dev.laptop.org/~greenfeld/temp/os23-5836058/ host#3 ec#8] |
||
*** C1 SKU201 hung on resume overnight (10s on/10s suspend), [http://dev.laptop.org/~greenfeld/temp/os23-5836058/ host#6] |
*** C1 SKU201 hung on resume overnight (10s on/10s suspend), [http://dev.laptop.org/~greenfeld/temp/os23-5836058/ host#6] |
||
*** C1 SKU202 EC communications failure overnight [http://dev.laptop.org/~greenfeld/temp/os23-5836058/ host#1 ec#7] |
|||
*** Four remaining systems running default runin suspend cycle timings have not hung after 12 hours. |
|||
*** Three remaining C1 systems running default runin suspend cycle timings did not hang after 18 hours. |
|||
== 2d8e7cc - arm-3.0-wip-wfi == |
== 2d8e7cc - arm-3.0-wip-wfi == |
Latest revision as of 00:35, 19 March 2012
Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel.
See also:
4ae137c7 - arm-3.0-ramp
- 4ae137c7
- libertas fix
- testing by jnettlet
- testing by James, on six units, with serial, with aggressive runin, passed 96 hours,
- testing by James, as above, with also debug no_console_suspend, passed 96 hours,
d6a2a5eb - arm-3.0-ramp
- d6a2a5eb aka os27,
- James Cameron, started 2012-02-10 05:50 UTC, stopped some days later, q4d03ja, defconfig, plus patch (enable excessive serial port output, leave apb uart clocks active over suspend), (.tar.gz), olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os27, on eight units, looking for hangs, ... 50% fail within 12 hours, two instances of "mmp2_pm_finish: Done" as the last message, also hang #0, and a unit that hung without a serial console attached.
- as above, started 2012-02-10 22:52 UTC, with kgdb disabled,
- hang #0, shortly after "mmp2_pm_finish: Done", with evidence of log buffer writes beyond log_end,
- hang #1, an SKU203 without serial console attached,
- hang #2, an SKU204 without serial console attached,
- hang #3, shortly after "mmp2_pm_finish: Done", an SKU201 with serial console, but watchdog failed to restart,
- hang #4, an SKU204 without serial console attached,
- hang #5, an SKU200 with serial console, but watchdog failed to restart,
61e93c20 - arm-3.0-ramp
- 61e93c20
- James Cameron, started 2012-02-07 09:53 UTC, stopped after 67.7 hours, q4d03, defconfig, (i.e. with serial driver disabled, with epitaph)(.tar.gz), olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os26, on eight units, looking for hangs, looking for 3.0.19 regressions, ... one hang on resume, hang #1,
61e0e9ee - arm-3.0-ramp
- 61e0e9ee
- os26
- testing by
- James Cameron, completed, q4d02jc, defconfig, with serial console disabled, without epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, backlight off, based on os26, on eight units, looking for hangs, ... over 96 hours, one hang on SKU204,
be5c532a - arm-3.0-wip
- be5c532a
- testing by
- James Cameron, completed, q4d02jc, defconfig, with serial driver disabled, with epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, on six units, looking for eMMC or other hangs, 113 hours fine, one instance of camera missing <trac>11595</trac> (which can only occur on boot), see dmesg
- James Cameron, completed, as above, but with serial driver enabled, SKU200 hang #1 (during resume), hang #2 (during resume), SKU202 hang #3 (during suspend), and SKU201 hang during suspend or resume, ... all of which suggests that enabling the serial driver leads to hangs,
- James Cameron, completed, as above, but with serial driver with DMA support, SKU200 hang #1 (during resume), B1 hang #2 (during resume), B4 hang #3 (during resume).
8f74291b - arm-3.0-wip
- config: update for new graphics stuff + 3.0.17
- EC logged in S7/S47 if available
- os25 or another build with the new graphics ABI
- purpose: Copy changes from 3.0.17, CMA, new graphics ABI to -wip branch
- testing by
- Samuel Greenfeld
- 5xC1, 10s aggressive suspend w/ runin-camera test active
- 1xC1 SKU201 TX FIFO not empty Ticket
- Samuel Greenfeld
073c7fbf - arm-3.0.17-wip
- TTY: serial_core: Fix crash if DCD drop during suspend
- EC logged S47 (if available)
- os24 base image or one which does *not* have the new graphics ABI
- purpose, test if fixes found in the 3.0.17 Linux kernel solve our issues
- testing by
- Samuel Greenfeld
- 5xC1, all units passed 72 hours (3 cycles) with 10s suspend cycles (with clock skews ranging from 10 minutes backwards to 20 minutes forward after being set via ntp-set-clock prior to the test). System reboot order and overall duration varied from the order & time amount I recall starting the systems with. Testing was done without runin-camera, although looking at the kernel configuration it may be possible to turn it on again.
- Samuel Greenfeld
b8114c95 - arm-3.0-wip-wfi
- b8114c95
- os24 base image or one which does *not* have the new graphics ABI
- disable runin-camera test
- purpose, see if eMMC stack backport from 3.2 fixes eMMC hangs, for Andres.
- testing by
- Samuel Greenfeld
- 5xC1, still hangs (as before) or reboots reporting filesystem issues, sampled logs here
- Samuel Greenfeld
5fef1609 - arm-3.0-wip-wfi
- mmc: try reducing bus bit width in case of error
- EC logged at S47
- os24 base image or one which does *not* have the new graphics ABI
- disable runin-camera test
- purpose, look for EC, other lockups
- aggressive suspend cycles
- testing by
- Samuel Greenfeld
- 2xSKU 202, 3xSKU 201, all with serial consoles, (one of each with EC serial)
- C1 SKU201 (#4) failure due to journal replay error, filesystem remounted readonly & runin reboot
- Other systems all hung within a few hours, some multiple times after being restarted
- Logs online here
- James Cameron, with a local patch to disable serial and capture dmesg via CForth, one of the five systems has hung over 52 hours, a remarkably stable test result, the one hang was preceeded by an EC driver message olpc-ec-1.75: SSP reports TX underrun then a WARNING, log
- C1 SKU201, start 2012-01-20 04:12:26, end 2012-01-24 04:18:42, four days, no errors, post-run ntpdate correction -3333.884336 sec,
- C1 SKU202, start 2012-01-20 04:12:28, end 2012-01-24 05:17:09, four days, one EC driver error, post-run ntpdate correction -534.727839 sec,
- B1, start 2012-01-20 04:12:29, end 2012-01-24 04:19:15, four days, no errors, post-run ntpdate correction 926.420114 sec,
- B1, start 2012-01-20 04:12:31, end 2012-01-24 04:18:43, four days, no errors, post-run ntpdate correction -2817.877176 sec,
- B4, start 2012-01-20 04:12:32, end 2012-01-24 04:18:38, four days, no errors, post-run ntpdate correction 1661.545946 sec.
- Samuel Greenfeld
b2845e62 - arm-3.0-wip-graphics (os25)
- (pxa168fb) Fix HWCursor Resume logic
- Use os25 to make this easiest to install
- Q4D01 recommended
- purpose: test possible ramp build
- testing by
- Samuel Greenfeld
- 3xB1, 5xC2 running normal runin, completed 10 24-hour runin cycles with all units; should be attempting additional 24-hour cycles on units still available.
- Samuel Greenfeld
9a3a8436 - arm-3.0-wip-graphics
- (galcore:marvell) Pass the device to DMA for reservations
- Jon Nettleton's graphics RPM repo is required (ABI breakage)
- purpose, test new graphics driver in runin with a somewhat ideally stable kernel
- testing by
- Samuel Greenfeld
- 2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each
- 1xSKU 201 hung after ~20 hours of testing with 10s suspend
- 5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each
- Samuel Greenfeld
daf43181 - arm-3.0-wip-wfi
- TTY: serial_core: Fix crash if DCD drop during suspend
- make sure no_console_suspend is enabled
- see if this helps or fixes serial hangs
- testing by
- James Cameron, os24 remastered with testconfig kernel, without no_console_suspend, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera),
- Samuel Greenfeld
- C1 SKU 202 mmc communication problem (no reboot by runin)
- C1 SKU 202 hang
- C1 SKU 201 hang
- C1 SKU 201 hang
- C2 hang followed by power off; no serial connection
- James Cameron, local CForth build, os24 remastered with testconfig kernel, without no_console_suspend, with dmesg buffer locating patch, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera), hang (irq 20), hang (eMMC), hang (irq 20), hang (during suspend), summary, over 24 hours, over 5 units, had 5 hangs relating to eMMC, and 10 hangs relating to irq 20.
- James Cameron, os24 remastered with defconfig kernel, without console=ttyS2,115200, as above, eight units passed 26 hours.
fc631432 - arm-3.0-wip-wfi
- testconfig: reenable no_console_suspend
- EC logged in S7 or S47.
- Disable runin camera test (disabled mmp_camera driver)
- purpose, gather more information about why we hang, since removing no_console_suspend led to hanging significantly less often.
- testing by
- Samuel Greenfeld
- Seems to hang on all test systems within 2 hours with 10s suspend cycles (multiple failure modes, including #11528), possibly aggravated by sending serial data to the XO during the suspend/resume process. But failures have also been seen after rebooting systems and ignoring them.
- Samuel Greenfeld
cfb3aa76 - arm-3.0-wip-wfi
- cfg80211: amend regulatory NULL dereference fix
- EC logged in S7 or S47. Do not use no_console_suspend.
- purpose: fixes a wireless oops; still need to verify TxFIFO sanity check patch
- testing by
- Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles
- SKU202 #0 failed 0:8:26 (hours:minutes:seconds) remaining after 3139 cycles power LED on full charge seen
- SKU199 #6 failed 0:53:10 (hours:minutes:seconds) remaining 2990 cycles power LED on battery LED off (discharging)
- SKU201 #4 failed second run 22:51 (hours:mins) remaining after 441 suspend cycles
- SKU202 #0 failed second run 21:31 (hours:mins) remaining 322 cycles
- SKU202 #1 failed second run 18:36 (hours:mins) remaining 689 cycles
- SKU199 #6 failed second run 13:29 (hours:mins) remaining 1354 cycles
- SKU201 #4 failed third run 20:08 (hours:mins) remaining 510 cycles
- SKU202 #0 failed third run 18:42 (hours:mins) remaining 1694 cycles
- Logs available here
- Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles
be581eb4 - arm-3.0-wip-wfi
- testconfig: turn off no_console_suspend
- EC logged in S7 or S47, aggressive suspending in runin
- do not use no_console_suspend, use same configuration as caused hangs previously,
- purpose: Verify the 'olpc-ec: do a full reset if the TxFIFO sanity check fails' patch.
- testing by
f1f8f7fe - arm-3.0-wip, EC 4.01
- git hash,
- rpm,
- build,
- do not use no_console_suspend, include light sensor test
- make sure to manually upgrade the EC firmware to cl2-4_0_4_01
- purpose, test if actually getting the EC to be quiet during suspend improves things
- testing by
- John Watlington:
- SKU201 (3) 24hr cycles, 600+ cycles ongoing;
- SKU202 (3) 24hr cycles, 600+ cycles ongoing
- (12x) SKU198/199/just plain weird, ongoing
- (10x) SKU200, ongoing;
- (4x) SKU201, ongoing;
- (2x) SKU202, ongoing;
- (2x) SKU203, ongoing;
- (2x) SKU204, ongoing;
- John Watlington:
f1f8f7fe - arm-3.0-wip
- git hash,
- rpm,
- build,
- do not use no_console_suspend, use same configuration as caused hangs previously,
- purpose, test if debug console messages may cause hangs,
- testing by
- Jon Nettleton, C1, 13:20 hours / 1966 cycles ongoing, os20,
- James Cameron,
- C1 SKU201, C1 SKU202, B1, all passed,
- C1 SKU201, C1 SKU202, B1, B1, B4, with os24, all passed.
- John Watlington, with light-sensor test:
- C1 SKU201 (2) 24hr loop + 2000 cycles stopped;
- C1 SKU202 (2) 24hr loop + 2100 cycles stopped
- (10x) SKU198/199/just plain weird, (2) 24hr loop, stopped
- (8x) SKU200, (2) 24 hr loop, stopped;
- (3x) SKU201, (2) 24 hr loop, stopped;
- SKU202, (2) 24hr loop, stopped;
- SKU198, hang in first 24 hrs, all hangs either screen blank w. power light on or turned off.
- SKU199, hang in first 24 hr`s
- (2x) SKU200, hang in first 24 hrs
- SKU201, hang in first 24 hrs
- SKU202, hang in first 24 hrs
- Samuel Greenfeld, Q4C12, patched 0.17.0 runin to support kernel RPM version
- C1 SKU202 failure after 1805 cycles (cause unclear but EC locked up) #11571
- C1 SKU202 failure after 3455 cycles (EC transmission queue full followed by null pointer dereference) #11573
- B1 SKU199 failure after 2062 cycles (cause unknown; no serial logs available) #11572
- 3xC1 SKU201, 2xB1 SKU 198, 2xB1 SKU 199 passed 24 hours of testing.
- All 10 systems mentioned above re-imaged to os24 with normal 24-hour runin settings; passed.
- Chia-Hsiu, 6 x C2, 2 x C1, using os24, ongoing
- one 24 run completed successfully
- C2 SKU200 x7 @ Miami office using os24 - ongoing
- 20 hrs with >50 s/r cycles
1220526f - arm-3.0-wip-wfi
- debug: add debugging to workqueues, irqs, etc
- test dependencies: save your System.map from your build
- purpose: attempt to track down hangs to specific driver callbacks
- testing by
- James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async,
- C1 SKU201, hang 1, log, hang 2, [http://dev.laptop.org/~quozl/z/1RkTrN.txt log, hang 3, a repeating invalid page fault with storage LED blinking, FIQ entry but insane, log, fiq, hang 4, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 5, in resume, log, stopped.
- C2 SKU202, hang 1, with spontaneous kdb entry, and a backtrace, log, hang 2, log, hang 3, not with the workqueue debugging active at the time, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 4, now in EARLY resume, log, stopped.
- B1, hang 1, log, hang 2, log, hang 3, log, stopped.
- B1 (pass), B4 (pass), B1 (pass), with -ENODEV return from serial_pxa_init(), and no calls to mmp2_add_uart(), ongoing, 3483 cycles, 23 hours,
- James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async,
0a896953 - arm-3.0-wip-wfi
- olpc-ec: consolidate code and ensure TxFIFO is empty before writing to it
- EC logged in S7, runin 0.17.0 or 0.16.7 (not C2 builds) if possible
- Past kernels tested briefly in the past ~12 hours suggest that the fixes in said kernels may fix all known easily reproducible hangs. Need to prove/disprove this.
- If you have enough systems, try using stock runin 0.17.0-style settings as well as anything known to cause particular types of failures. (EC failures seem to be common within a few hours with 10s on/10s off suspend, only using runin-main/gtk/common/tests/sus/battery.)
- testing by
- James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around Sat Jan 7 08:24:00 UTC 2012,
- C1 SKU201, 534 tests, 530 pass, 4 fail,
- C1 SKU202, 2777 tests, 2711 pass, 1 WARNING at olpc_ec_1_75_cmd (log), 66 fail,
- 50 hang during resume,
- 16 fail with repeating eMMC SET_BLOCK_COUNT,
- see tail of each fail in one place
- one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
- one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
- Samuel Greenfeld, five C1 units, runin 0.16.7 setup like 0.17 (light sensor test disabled), 0 to pm_async
- one C1 SKU201 hang during resume, normal runin settings, see host log
- one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing, see host log ec. Unit discovered after powered off due to critical battery failure (runin discharge test).
- one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing + cl2-4_0_3_07rsmith-586 EC code (S47 debug), see host log ec log.
- one B1 SKU199 hang during resume, normal runin settings with 10s suspend timing, see host log.
- Jon Nettleton
- James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around Sat Jan 7 08:24:00 UTC 2012,
ebf24ea6 - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=ebf24ea6
- test dependencies: EC logged in S7, use whatever magic possible to get the EC errors
- Another attempt to fix the problem of EC communication failures
- testing by
196c2f806 - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=196c2f80
- testing by
- James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume.
- C1 SKU202 reported EC related kernel problems, see log, should be fixed in later kernels.
- James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume.
46e079fe - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=46e079fe
- EC logged in S7
- Should fix the hang caused by suspend being aborted due to a wakeup event. May also help with other IRQ hangs as I have fixed a previous mistake I made when moving the audio island code.
- Testing aborted suspends can be done by running this command and then hitting the keyboard right away.
echo $(cat /sys/power/wakeup_count) > /sys/power/wakeup_count && echo mem > /sys/power/state
- testing by
- James Cameron, with a local debugging patch, saw one instance of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201, see log tail, and several instances of the hang in dpm_resume().
bfc1b92b - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=bfc1b92b
- EC logged in S7
- Should fix the (known) cause of failing EC commands on s/r. Please test both this and the prior kernel.
- testing by
b7f22e1d - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=b7f22e1d
- EC logged in S7
- Should fix the problem where, upon a single EC error, all further EC commands fail. Please test!
- testing by
10ebd28f - arm-3.0-wip-wfi
- olpc-ec: log a warning if we manage to process an EC command deep inside of suspend,
- a tar.gz by James,
- testing by
- Richard?
ae48be89 - arm-3.0-wip-wfi
- a tar.gz by James,
- testing by
5ba0b446 - arm-3.0-wip-wfi
- olpc-ec: allow unknown commands to be executed
- test dependencies: EC logging (S7), kernel logging with EC debug enabled ("echo 1 > /sys/module/olpc_ec_1_75/parameters/ec_debug"), pm_async disabled.
- purpose: improve EC driver; fewer (or no?) races/crashes..
- testing by
ff199462 - arm-3.0-wip-wfi
- olpc-ec: ensure gpio cmd is left low if something screws up
- EC logged in S7
- Still has FIQ debugger
- Should help reset EC bus to avoid subsequent failures after the first command fails.
- testing by
- Samuel Greenfeld
- C1 hang on resume (#3), OLS runin disabled, Q4C11 normal EC code host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done ... 72.616ms ... mmp-camera mmp-camera.0: resume) [1]
- C1 hang on resume (#6), OLS runin disabled, Q4C11 normal EC code
- C1 hang on resume (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin host ec
- Samuel Greenfeld
3d4cf36c - arm-3.0-wip-wfi
- mmp2_fiq_debugger.c add file missing from other merge
- EC logged in S7
- This enables the FIQ debugger. Reproduce kernel hangs, then from a serial console send a Break which should drop you to a debug prompt. run bt to see what is going on.
- testing by
- James Cameron, 10sec/10sec,
- James Cameron, 0sec/3sec, FIQ was verified as working before starting,
26f404e3 - arm-3.0-wip-wfi
- Revert "olpc-ec: don't process/ack packets when there's an underrun error"
- EC logged in S7
- further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf. Test that audio pop during suspend/resume is gone from jnettlet.
- testing by
- Samuel Greenfeld
- 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress.
- 1xC1 SKU201 (#6) running aggressive (10s on/10s off) suspend failed after 15 minutes during resume, power & full battery LEDs on, all other LEDs off. All tests except the battery test were running on this unit at the time of failure. host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done)
- 1xC1 SKU201 (#3) running aggressive (10s on/10s off) suspend failed at a point TBD with EC communications failure and the eMMC root filesystem remounted read-only. It continued suspend & resume testing after the failure point(s). host ec
- 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress.
- James Cameron
- Samuel Greenfeld
7dad6c10 - arm-3.0-wip-wfi (BROKEN)
- XO-1.75: always send the suspend hint command to the EC, even on C-series
- EC logged in S7
- further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf.
- testing by
- Samuel Greenfeld
- 1xC1 SKU202 tested; logs "olpc-ec-1.75: SSP reports TX underrun" and "SSP reports RX overrun" every three seconds shortly after the kernel starts initializing. This causes the XO to never fully boot, making this kernel unusable. log with EC at S7 log with EC at S47
- Samuel Greenfeld
afa391a5 - arm-3.0-wip-wfi
- Revert "olpc-1.75: back off the hardware clock gating for MMC devices"
- EC logged in S7
- test code to work around an EC communications race; needs further work, but this should hopefully keep things from hanging.
- testing by
9177e6a8 - arm-3.0-wip-wfi
- olpc-1.75: back off the hardware clock gating for MMC devices
- purpose: kill off the SET_BLOCK_COUNT errors seen by Quozl in the prior tests
- result of test: no change (the patch did not affect the outcome).
- testing by
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
- C1 SKU201 passed, ec host,
- C1 SKU202 hung, after 11 minutes, SET_BLOCK_COUNT eMMC failure at kernel timestamp 1159.762854, host
- B1 SKU199 hung, after 2.5 hours, SET_BLOCK_COUNT eMMC failure at kernel timestamp 7409.481875, host
- B1 SKU199 hung,
- B4 SKU199 hung, (connecting serial port afterwards did not show streaming eMMC messages),
- B1 SKU198 hung.
- James Cameron, Q4C11, os20, single, dortc,
- C1 SKU202 manually stopped early, had been needing keyboard wakeup,
- B1 SKU199 manually stopped early,
- B1 SKU199 manually stopped early, had been needing keyboard wakeup,
- B4 SKU199 manually stopped early.
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
- purpose: test theory that battery state of charge changes are associated with hangs
- Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test disabled
- 2xC1 SKU201 hung on resume after 10.5 hours host#3/ec#8 & host #6
- 1xC1 SKU202 hung after 11.75 due disabling pm_async or keyboard events during runin host#4, possibly while resetting system #3 in front of it.
- All systems then reset with pm_async disabled, 4 C1 SKU201 & 3 C1 SKU 202 total.
- 1xC1 SKU202 hung citing phantom keyboard events followed by an illegal instruction, pm_async disabled [2]
- 1xC1 SKU201 hung with MMC problems, pm_async disabled [3]
- Samuel Greenfeld, os23, 3 B1 SKU198 4 B1 SKU199, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test enabled, runin-camera, runin-wlan disabled, pm_async enabled'
- 1xB1 SKU 199 failure to properly handle EC interrupt on resume [4]
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin, without runin-battery, without battery inserted.
- Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test disabled
- purpose: disable asynchronous S/R
- purpose: no-suspend-contention runin branch
- five units run to 100 suspend cycles, 2.5 hours each, no issues.
58360582 - arm-3.0-wip-wfi
- Revert "olpc-ec-1-75: clean up cmd state locking and other things"
- purpose: test the old EC driver across runin and s/r.
- result of test: no change (the patch did not affect the outcome).
- additional tests: against normal runin, and 10sec/10sec S/R runin.
- testing by
- Richard Smith, os21, C1 SKU201, three runs, against two-stage runin, pass.
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
- Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build
- C1 SKU201 hung on resume after 21 cycles (10s on/10s suspend), host#3 ec#8
- C1 SKU201 hung on resume overnight (10s on/10s suspend), host#6
- C1 SKU202 EC communications failure overnight host#1 ec#7
- Three remaining C1 systems running default runin suspend cycle timings did not hang after 18 hours.
2d8e7cc - arm-3.0-wip-wfi
- sdhci: ignore interrupts received after suspend
- zImage-2d8e7cc-wip-wfi
- os: os20, ofw: q4c08 or q4c09, runin in build, runin-fscheck disabled, runin-battery disabled, runin-sus set to 10sec/10sec
- purpose, test looking for non-ec related hangs.
- result of test: no change (the patch did not affect the outcome).
- testing by
6f125d7 - arm-3.0-wip
- sdhci: ignore interrupts received after suspend (same commit comment but different git branch than above)
- zImage-6f125d7-wip
- os: os20, ofw: q4c08, runin in build, runin-sus disabled.
- purpose: verification of a build for manufacturing testing.
- result of test: success, the kernel is stable for runin testing in manufacturing if used without suspend and resume.
- testing by
- was added to build os23.
- os23 testing by
- James Cameron, C1 SKU201, C1 SKU202, B4, B1, B1, B1, all passed 24 hr testing.
4239902
- os20 q4c09 runin 0.16.7 10sec 10sec 24hrs
- testing by
- James Cameron, fail, one unit hung, all units lost EC communications (blank SOC display, hang after runin pass),
- C1 SKU201 pass, host log at point of SOC loss, ec log at point of SOC loss
- C1 SKU202 hang at 1821 cycles, remaining 10:35:38, no serial cable was attached,
- B4 SKU199 pass,
- B1 SKU199 pass,
- B1 SKU199 pass,
- B1 SKU198 pass,
- James Cameron, fail, one unit hung, all units lost EC communications (blank SOC display, hang after runin pass),