XO-1.75/Kernel/Testing

From OLPC
< XO-1.75
Revision as of 20:05, 25 January 2012 by Quozl (talk | contribs) (be5c532a - arm-3.0-wip)
Jump to: navigation, search

Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel.

See also:


be5c532a - arm-3.0-wip

  • be5c532a
  • testing by
    • James Cameron, q4d02jc, defconfig, with serial driver disabled, with epitaph, olpc-runin-tests 0.17.3, with watchdog, with aggressive, on six units, looking for eMMC or other hangs, 20 hours fine, one instance of camera missing (which can only occur on boot), see dmesg

8f74291b - arm-3.0-wip

  • config: update for new graphics stuff + 3.0.17
  • EC logged in S7/S47 if available
  • os25 or another build with the new graphics ABI
  • purpose: Copy changes from 3.0.17, CMA, new graphics ABI to -wip branch
  • testing by
    • Samuel Greenfeld
      • 5xC1, 7 hours elapsed; 10s aggressive suspend w/ runin-camera test active; in progress

073c7fbf - arm-3.0.17-wip

  • TTY: serial_core: Fix crash if DCD drop during suspend
  • EC logged S47 (if available)
  • os24 base image or one which does *not* have the new graphics ABI
  • purpose, test if fixes found in the 3.0.17 Linux kernel solve our issues
  • testing by
    • Samuel Greenfeld
      • 5xC1, all units passed 72 hours (3 cycles) with 10s suspend cycles (with clock skews ranging from 10 minutes backwards to 20 minutes forward after being set via ntp-set-clock prior to the test). System reboot order and overall duration varied from the order & time amount I recall starting the systems with. Testing was done without runin-camera, although looking at the kernel configuration it may be possible to turn it on again.

b8114c95 - arm-3.0-wip-wfi

  • b8114c95
  • os24 base image or one which does *not* have the new graphics ABI
  • disable runin-camera test
  • purpose, see if eMMC stack backport from 3.2 fixes eMMC hangs, for Andres.
  • testing by
    • Samuel Greenfeld
      • 5xC1, still hangs (as before) or reboots reporting filesystem issues, sampled logs here

5fef1609 - arm-3.0-wip-wfi

  • mmc: try reducing bus bit width in case of error
  • EC logged at S47
  • os24 base image or one which does *not* have the new graphics ABI
  • disable runin-camera test
  • purpose, look for EC, other lockups
  • aggressive suspend cycles
  • testing by
    • Samuel Greenfeld
      • 2xSKU 202, 3xSKU 201, all with serial consoles, (one of each with EC serial)
      • C1 SKU201 (#4) failure due to journal replay error, filesystem remounted readonly & runin reboot
      • Other systems all hung within a few hours, some multiple times after being restarted
      • Logs online here
    • James Cameron, with a local patch to disable serial and capture dmesg via CForth, one of the five systems has hung over 52 hours, a remarkably stable test result, the one hang was preceeded by an EC driver message olpc-ec-1.75: SSP reports TX underrun then a WARNING, log
      • C1 SKU201, start 2012-01-20 04:12:26, end 2012-01-24 04:18:42, four days, no errors, post-run ntpdate correction -3333.884336 sec,
      • C1 SKU202, start 2012-01-20 04:12:28, end 2012-01-24 05:17:09, four days, one EC driver error, post-run ntpdate correction -534.727839 sec,
      • B1, start 2012-01-20 04:12:29, end 2012-01-24 04:19:15, four days, no errors, post-run ntpdate correction 926.420114 sec,
      • B1, start 2012-01-20 04:12:31, end 2012-01-24 04:18:43, four days, no errors, post-run ntpdate correction -2817.877176 sec,
      • B4, start 2012-01-20 04:12:32, end 2012-01-24 04:18:38, four days, no errors, post-run ntpdate correction 1661.545946 sec.

b2845e62 - arm-3.0-wip-graphics (os25)

  • (pxa168fb) Fix HWCursor Resume logic
  • Use os25 to make this easiest to install
  • Q4D01 recommended
  • purpose: test possible ramp build
  • testing by
    • Samuel Greenfeld
      • 3xB1, 5xC2 running normal runin, completed all units Friday; should be attempting additional 24-hour cycles

9a3a8436 - arm-3.0-wip-graphics

  • (galcore:marvell) Pass the device to DMA for reservations
  • Jon Nettleton's graphics RPM repo is required (ABI breakage)
  • purpose, test new graphics driver in runin with a somewhat ideally stable kernel
  • testing by
    • Samuel Greenfeld
      • 2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each
      • 1xSKU 201 hung after ~20 hours of testing with 10s suspend
      • 5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each

daf43181 - arm-3.0-wip-wfi

fc631432 - arm-3.0-wip-wfi

  • testconfig: reenable no_console_suspend
  • EC logged in S7 or S47.
  • Disable runin camera test (disabled mmp_camera driver)
  • purpose, gather more information about why we hang, since removing no_console_suspend led to hanging significantly less often.
  • testing by
    • Samuel Greenfeld
      • Seems to hang on all test systems within 2 hours with 10s suspend cycles (multiple failure modes, including #11528), possibly aggravated by sending serial data to the XO during the suspend/resume process. But failures have also been seen after rebooting systems and ignoring them.

cfb3aa76 - arm-3.0-wip-wfi

  • cfg80211: amend regulatory NULL dereference fix
  • EC logged in S7 or S47. Do not use no_console_suspend.
  • purpose: fixes a wireless oops; still need to verify TxFIFO sanity check patch
  • testing by
    • Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles
      • SKU202 #0 failed 0:8:26 (hours:minutes:seconds) remaining after 3139 cycles power LED on full charge seen
      • SKU199 #6 failed 0:53:10 (hours:minutes:seconds) remaining 2990 cycles power LED on battery LED off (discharging)
      • SKU201 #4 failed second run 22:51 (hours:mins) remaining after 441 suspend cycles
      • SKU202 #0 failed second run 21:31 (hours:mins) remaining 322 cycles
      • SKU202 #1 failed second run 18:36 (hours:mins) remaining 689 cycles
      • SKU199 #6 failed second run 13:29 (hours:mins) remaining 1354 cycles
      • SKU201 #4 failed third run 20:08 (hours:mins) remaining 510 cycles
      • SKU202 #0 failed third run 18:42 (hours:mins) remaining 1694 cycles
      • Logs available here

be581eb4 - arm-3.0-wip-wfi

  • testconfig: turn off no_console_suspend
  • EC logged in S7 or S47, aggressive suspending in runin
  • do not use no_console_suspend, use same configuration as caused hangs previously,
  • purpose: Verify the 'olpc-ec: do a full reset if the TxFIFO sanity check fails' patch.
  • testing by
    • James Cameron, os24 remastered with kernel, C1 SKU201 passed, C1 SKU202 hung (log), B1 hung, (log).

f1f8f7fe - arm-3.0-wip, EC 4.01

  • git hash,
  • rpm,
  • build,
  • do not use no_console_suspend, include light sensor test
  • make sure to manually upgrade the EC firmware to cl2-4_0_4_01
  • purpose, test if actually getting the EC to be quiet during suspend improves things
  • testing by
    • John Watlington:
      • SKU201 (3) 24hr cycles, 600+ cycles ongoing;
      • SKU202 (3) 24hr cycles, 600+ cycles ongoing
      • (12x) SKU198/199/just plain weird, ongoing
      • (10x) SKU200, ongoing;
      • (4x) SKU201, ongoing;
      • (2x) SKU202, ongoing;
      • (2x) SKU203, ongoing;
      • (2x) SKU204, ongoing;

f1f8f7fe - arm-3.0-wip

  • git hash,
  • rpm,
  • build,
  • do not use no_console_suspend, use same configuration as caused hangs previously,
  • purpose, test if debug console messages may cause hangs,
  • testing by
    • Jon Nettleton, C1, 13:20 hours / 1966 cycles ongoing, os20,
    • James Cameron,
      • C1 SKU201, C1 SKU202, B1, all passed,
      • C1 SKU201, C1 SKU202, B1, B1, B4, with os24, all passed.
    • John Watlington, with light-sensor test:
      • C1 SKU201 (2) 24hr loop + 2000 cycles stopped;
      • C1 SKU202 (2) 24hr loop + 2100 cycles stopped
      • (10x) SKU198/199/just plain weird, (2) 24hr loop, stopped
      • (8x) SKU200, (2) 24 hr loop, stopped;
      • (3x) SKU201, (2) 24 hr loop, stopped;
      • SKU202, (2) 24hr loop, stopped;
      • SKU198, hang in first 24 hrs, all hangs either screen blank w. power light on or turned off.
      • SKU199, hang in first 24 hr`s
      • (2x) SKU200, hang in first 24 hrs
      • SKU201, hang in first 24 hrs
      • SKU202, hang in first 24 hrs
    • Samuel Greenfeld, Q4C12, patched 0.17.0 runin to support kernel RPM version
      • C1 SKU202 failure after 1805 cycles (cause unclear but EC locked up) #11571
      • C1 SKU202 failure after 3455 cycles (EC transmission queue full followed by null pointer dereference) #11573
      • B1 SKU199 failure after 2062 cycles (cause unknown; no serial logs available) #11572
      • 3xC1 SKU201, 2xB1 SKU 198, 2xB1 SKU 199 passed 24 hours of testing.
      • All 10 systems mentioned above re-imaged to os24 with normal 24-hour runin settings; passed.
    • Chia-Hsiu, 6 x C2, 2 x C1, using os24, ongoing
      • one 24 run completed successfully
    • C2 SKU200 x7 @ Miami office using os24 - ongoing
      • 20 hrs with >50 s/r cycles

1220526f - arm-3.0-wip-wfi

  • debug: add debugging to workqueues, irqs, etc
  • test dependencies: save your System.map from your build
  • purpose: attempt to track down hangs to specific driver callbacks
  • testing by
    • James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async,
      • C1 SKU201, hang 1, log, hang 2, [http://dev.laptop.org/~quozl/z/1RkTrN.txt log, hang 3, a repeating invalid page fault with storage LED blinking, FIQ entry but insane, log, fiq, hang 4, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 5, in resume, log, stopped.
      • C2 SKU202, hang 1, with spontaneous kdb entry, and a backtrace, log, hang 2, log, hang 3, not with the workqueue debugging active at the time, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 4, now in EARLY resume, log, stopped.
      • B1, hang 1, log, hang 2, log, hang 3, log, stopped.
      • B1 (pass), B4 (pass), B1 (pass), with -ENODEV return from serial_pxa_init(), and no calls to mmp2_add_uart(), ongoing, 3483 cycles, 23 hours,

0a896953 - arm-3.0-wip-wfi

  • olpc-ec: consolidate code and ensure TxFIFO is empty before writing to it
  • EC logged in S7, runin 0.17.0 or 0.16.7 (not C2 builds) if possible
  • Past kernels tested briefly in the past ~12 hours suggest that the fixes in said kernels may fix all known easily reproducible hangs. Need to prove/disprove this.
  • If you have enough systems, try using stock runin 0.17.0-style settings as well as anything known to cause particular types of failures. (EC failures seem to be common within a few hours with 10s on/10s off suspend, only using runin-main/gtk/common/tests/sus/battery.)
  • testing by
    • James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around Sat Jan 7 08:24:00 UTC 2012,
      • C1 SKU201, 534 tests, 530 pass, 4 fail,
      • C1 SKU202, 2777 tests, 2711 pass, 1 WARNING at olpc_ec_1_75_cmd (log), 66 fail,
      • one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
      • one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
    • Samuel Greenfeld, five C1 units, runin 0.16.7 setup like 0.17 (light sensor test disabled), 0 to pm_async
      • one C1 SKU201 hang during resume, normal runin settings, see host log
      • one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing, see host log ec. Unit discovered after powered off due to critical battery failure (runin discharge test).
      • one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing + cl2-4_0_3_07rsmith-586 EC code (S47 debug), see host log ec log.
      • one B1 SKU199 hang during resume, normal runin settings with 10s suspend timing, see host log.
    • Jon Nettleton

ebf24ea6 - arm-3.0-wip-wfi

196c2f806 - arm-3.0-wip-wfi

  • http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=196c2f80
  • testing by
    • James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume.
      • C1 SKU202 reported EC related kernel problems, see log, should be fixed in later kernels.

46e079fe - arm-3.0-wip-wfi

  • http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=46e079fe
  • EC logged in S7
  • Should fix the hang caused by suspend being aborted due to a wakeup event. May also help with other IRQ hangs as I have fixed a previous mistake I made when moving the audio island code.
  • Testing aborted suspends can be done by running this command and then hitting the keyboard right away.

echo $(cat /sys/power/wakeup_count) > /sys/power/wakeup_count && echo mem > /sys/power/state

  • testing by
    • James Cameron, with a local debugging patch, saw one instance of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201, see log tail, and several instances of the hang in dpm_resume().

bfc1b92b - arm-3.0-wip-wfi

b7f22e1d - arm-3.0-wip-wfi

10ebd28f - arm-3.0-wip-wfi

ae48be89 - arm-3.0-wip-wfi

  • a tar.gz by James,
  • testing by
    • Richard?
    • James Cameron, 10sec/10sec, with audio disabled, patch,

5ba0b446 - arm-3.0-wip-wfi

  • olpc-ec: allow unknown commands to be executed
  • test dependencies: EC logging (S7), kernel logging with EC debug enabled ("echo 1 > /sys/module/olpc_ec_1_75/parameters/ec_debug"), pm_async disabled.
  • purpose: improve EC driver; fewer (or no?) races/crashes..
  • testing by
    • Samuel Greenfeld
      • C1 SKU 201 EC communications failure (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin. host ec

ff199462 - arm-3.0-wip-wfi

  • olpc-ec: ensure gpio cmd is left low if something screws up
  • EC logged in S7
  • Still has FIQ debugger
  • Should help reset EC bus to avoid subsequent failures after the first command fails.
  • testing by
    • Samuel Greenfeld
      • C1 hang on resume (#3), OLS runin disabled, Q4C11 normal EC code host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done ... 72.616ms ... mmp-camera mmp-camera.0: resume) [1]
      • C1 hang on resume (#6), OLS runin disabled, Q4C11 normal EC code
      • C1 hang on resume (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin host ec

3d4cf36c - arm-3.0-wip-wfi

  • mmp2_fiq_debugger.c add file missing from other merge
  • EC logged in S7
  • This enables the FIQ debugger. Reproduce kernel hangs, then from a serial console send a Break which should drop you to a debug prompt. run bt to see what is going on.
  • testing by
    • James Cameron, 10sec/10sec,
      • C1 SKU201 hang at mmp-camera mmp-camera.0: resume, host,
      • C1 SKU202 hang at mmp-camera mmp-camera.0: resume, host,
      • C1 SKU202 hang at mmp-camera mmp-camera.0: resume, host,
      • no response to BREAK,
    • James Cameron, 0sec/3sec, FIQ was verified as working before starting,
      • C1 SKU201 host (known audio problem that doesn't need to be reported again),
      • C1 SKU202 host (known audio problem),
      • B4, and two B1 hung, no response to BREAK.

26f404e3 - arm-3.0-wip-wfi

  • Revert "olpc-ec: don't process/ack packets when there's an underrun error"
  • EC logged in S7
  • further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf. Test that audio pop during suspend/resume is gone from jnettlet.
  • testing by
    • Samuel Greenfeld
      • 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress.
        • 1xC1 SKU201 (#6) running aggressive (10s on/10s off) suspend failed after 15 minutes during resume, power & full battery LEDs on, all other LEDs off. All tests except the battery test were running on this unit at the time of failure. host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done)
        • 1xC1 SKU201 (#3) running aggressive (10s on/10s off) suspend failed at a point TBD with EC communications failure and the eMMC root filesystem remounted read-only. It continued suspend & resume testing after the failure point(s). host ec
    • James Cameron
      • C1 C1 B4 B1 B1 B1, os20, runin 0.16.7, 10sec/10sec, hangs still occur one, two, audio pops during suspend and resume are still present.

7dad6c10 - arm-3.0-wip-wfi (BROKEN)

afa391a5 - arm-3.0-wip-wfi

9177e6a8 - arm-3.0-wip-wfi

  • olpc-1.75: back off the hardware clock gating for MMC devices
  • purpose: kill off the SET_BLOCK_COUNT errors seen by Quozl in the prior tests
  • result of test: no change (the patch did not affect the outcome).
  • testing by
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 passed, ec host,
      • C1 SKU202 hung, after 11 minutes, SET_BLOCK_COUNT eMMC failure at kernel timestamp 1159.762854, host
      • B1 SKU199 hung, after 2.5 hours, SET_BLOCK_COUNT eMMC failure at kernel timestamp 7409.481875, host
      • B1 SKU199 hung,
      • B4 SKU199 hung, (connecting serial port afterwards did not show streaming eMMC messages),
      • B1 SKU198 hung.
    • James Cameron, Q4C11, os20, single, dortc,
      • C1 SKU202 manually stopped early, had been needing keyboard wakeup,
      • B1 SKU199 manually stopped early,
      • B1 SKU199 manually stopped early, had been needing keyboard wakeup,
      • B4 SKU199 manually stopped early.
  • purpose: test theory that battery state of charge changes are associated with hangs
    • Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test disabled
      • 2xC1 SKU201 hung on resume after 10.5 hours host#3/ec#8 & host #6
      • 1xC1 SKU202 hung after 11.75 due disabling pm_async or keyboard events during runin host#4, possibly while resetting system #3 in front of it.
      • All systems then reset with pm_async disabled, 4 C1 SKU201 & 3 C1 SKU 202 total.
      • 1xC1 SKU202 hung citing phantom keyboard events followed by an illegal instruction, pm_async disabled [2]
      • 1xC1 SKU201 hung with MMC problems, pm_async disabled [3]
    • Samuel Greenfeld, os23, 3 B1 SKU198 4 B1 SKU199, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test enabled, runin-camera, runin-wlan disabled, pm_async enabled'
      • 1xB1 SKU 199 failure to properly handle EC interrupt on resume [4]
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin, without runin-battery, without battery inserted.
      • C1 SKU201 hung, after 67 minutes, host, C1 SKU202 hung, after 15 minutes, host, B1 SKU199 x 2, B4 SKU199 hung, after 90 minutes, host tail, B1 SKU198.
  • purpose: disable asynchronous S/R
  • purpose: no-suspend-contention runin branch
    • five units run to 100 suspend cycles, 2.5 hours each, no issues.

58360582 - arm-3.0-wip-wfi

  • Revert "olpc-ec-1-75: clean up cmd state locking and other things"
  • purpose: test the old EC driver across runin and s/r.
  • result of test: no change (the patch did not affect the outcome).
  • additional tests: against normal runin, and 10sec/10sec S/R runin.
  • testing by
    • Richard Smith, os21, C1 SKU201, three runs, against two-stage runin, pass.
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 stopped, SOC display loss at EC timestamp 5380673 kernel timestamp 745.92 (same symptom as seen in #4239902), ec host,
      • C1 SKU202 stopped, eMMC failure, host
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 hung, ec host,
      • C1 SKU202 hung, within 30 minutes, host
      • B4 SKU199 hung, after two hours,
      • B1 SKU199 was going fine for 2:15, but manually stopped, host
      • B1 SKU199 hung, after one hour,
      • B1 SKU198 hung.
    • Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build
      • C1 SKU201 hung on resume after 21 cycles (10s on/10s suspend), host#3 ec#8
      • C1 SKU201 hung on resume overnight (10s on/10s suspend), host#6
      • C1 SKU202 EC communications failure overnight host#1 ec#7
      • Three remaining C1 systems running default runin suspend cycle timings did not hang after 18 hours.

2d8e7cc - arm-3.0-wip-wfi

  • sdhci: ignore interrupts received after suspend
  • zImage-2d8e7cc-wip-wfi
  • os: os20, ofw: q4c08 or q4c09, runin in build, runin-fscheck disabled, runin-battery disabled, runin-sus set to 10sec/10sec
  • purpose, test looking for non-ec related hangs.
  • result of test: no change (the patch did not affect the outcome).
  • testing by
    • Samuel Greenfeld
      • Passed, os21 not os20, B1 test bed, seven units.
    • James Cameron,
      • C1 SKU201 host ec, hung at 31 cycles 23:46:01 remaining,
      • C1 SKU202 host, hung at 36 cycles 23:43:41 remaining,
      • B1 SKU199 host, hung at 1737 cycles 11:25:02 remaining,
      • B4 SKU199, hung at 398 cycles 12:03:29 remaining,

6f125d7 - arm-3.0-wip

  • sdhci: ignore interrupts received after suspend (same commit comment but different git branch than above)
  • zImage-6f125d7-wip
  • os: os20, ofw: q4c08, runin in build, runin-sus disabled.
  • purpose: verification of a build for manufacturing testing.
  • result of test: success, the kernel is stable for runin testing in manufacturing if used without suspend and resume.
  • testing by
    • Samuel Greenfeld
      • Four C1 SKU 201, Three C1 SKU202 passed 24 hr testing
    • James Cameron,
      • C1 SKU201 host ec, passed,
      • C1 SKU202 host, passed.
  • was added to build os23.
  • os23 testing by
    • James Cameron, C1 SKU201, C1 SKU202, B4, B1, B1, B1, all passed 24 hr testing.

4239902

  • os20 q4c09 runin 0.16.7 10sec 10sec 24hrs
  • testing by
    • James Cameron, fail, one unit hung, all units lost EC communications (blank SOC display, hang after runin pass),