XO-1.75/Kernel/Testing: Difference between revisions

From OLPC
Jump to navigation Jump to search
(add 9a3a8436 wip-graphics testing)
Line 27: Line 27:
**Samuel Greenfeld
**Samuel Greenfeld
***2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each
***2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each
***1xSKU 201 hung after ~20 hours of testing
***1xSKU 201 hung after ~20 hours of testing with 10s suspend
***5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each
***5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each



Revision as of 21:02, 19 January 2012

Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel.

See also:


9a3a8436 - arm-3.0-wip-graphics

  • link to git hash via http
  • Jon Nettleton's graphics RPM repo is required (ABI breakage)
  • purpose, test new graphics driver in runin with a somewhat ideally stable kernel
  • testing by
    • Samuel Greenfeld
      • 2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each
      • 1xSKU 201 hung after ~20 hours of testing with 10s suspend
      • 5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each

daf43181 - arm-3.0-wip-wfi

fc631432 - arm-3.0-wip-wfi

  • testconfig: reenable no_console_suspend
  • EC logged in S7 or S47.
  • Disable runin camera test (disabled mmp_camera driver)
  • purpose, gather more information about why we hang, since removing no_console_suspend led to hanging significantly less often.
  • testing by
    • Samuel Greenfeld
      • Seems to hang on all test systems within 2 hours with 10s suspend cycles (multiple failure modes, including #11528), possibly aggravated by sending serial data to the XO during the suspend/resume process. But failures have also been seen after rebooting systems and ignoring them.

cfb3aa76 - arm-3.0-wip-wfi

  • cfg80211: amend regulatory NULL dereference fix
  • EC logged in S7 or S47. Do not use no_console_suspend.
  • purpose: fixes a wireless oops; still need to verify TxFIFO sanity check patch
  • testing by
    • Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles
      • SKU202 #0 failed 0:8:26 (hours:minutes:seconds) remaining after 3139 cycles power LED on full charge seen
      • SKU199 #6 failed 0:53:10 (hours:minutes:seconds) remaining 2990 cycles power LED on battery LED off (discharging)
      • SKU201 #4 failed second run 22:51 (hours:mins) remaining after 441 suspend cycles
      • SKU202 #0 failed second run 21:31 (hours:mins) remaining 322 cycles
      • SKU202 #1 failed second run 18:36 (hours:mins) remaining 689 cycles
      • SKU199 #6 failed second run 13:29 (hours:mins) remaining 1354 cycles
      • SKU201 #4 failed third run 20:08 (hours:mins) remaining 510 cycles
      • SKU202 #0 failed third run 18:42 (hours:mins) remaining 1694 cycles
      • Logs available here

be581eb4 - arm-3.0-wip-wfi

  • testconfig: turn off no_console_suspend
  • EC logged in S7 or S47, aggressive suspending in runin
  • do not use no_console_suspend, use same configuration as caused hangs previously,
  • purpose: Verify the 'olpc-ec: do a full reset if the TxFIFO sanity check fails' patch.
  • testing by
    • James Cameron, os24 remastered with kernel, C1 SKU201 passed, C1 SKU202 hung (log), B1 hung, (log).

f1f8f7fe - arm-3.0-wip, EC 4.01

  • git hash,
  • rpm,
  • build,
  • do not use no_console_suspend, include light sensor test
  • make sure to manually upgrade the EC firmware to cl2-4_0_4_01
  • purpose, test if actually getting the EC to be quiet during suspend improves things
  • testing by
    • John Watlington:
      • SKU201 (3) 24hr cycles, 600+ cycles ongoing;
      • SKU202 (3) 24hr cycles, 600+ cycles ongoing
      • (12x) SKU198/199/just plain weird, ongoing
      • (10x) SKU200, ongoing;
      • (4x) SKU201, ongoing;
      • (2x) SKU202, ongoing;
      • (2x) SKU203, ongoing;
      • (2x) SKU204, ongoing;

f1f8f7fe - arm-3.0-wip

  • git hash,
  • rpm,
  • build,
  • do not use no_console_suspend, use same configuration as caused hangs previously,
  • purpose, test if debug console messages may cause hangs,
  • testing by
    • Jon Nettleton, C1, 13:20 hours / 1966 cycles ongoing, os20,
    • James Cameron,
      • C1 SKU201, C1 SKU202, B1, all passed,
      • C1 SKU201, C1 SKU202, B1, B1, B4, with os24, all passed.
    • John Watlington, with light-sensor test:
      • C1 SKU201 (2) 24hr loop + 2000 cycles stopped;
      • C1 SKU202 (2) 24hr loop + 2100 cycles stopped
      • (10x) SKU198/199/just plain weird, (2) 24hr loop, stopped
      • (8x) SKU200, (2) 24 hr loop, stopped;
      • (3x) SKU201, (2) 24 hr loop, stopped;
      • SKU202, (2) 24hr loop, stopped;
      • SKU198, hang in first 24 hrs, all hangs either screen blank w. power light on or turned off.
      • SKU199, hang in first 24 hr`s
      • (2x) SKU200, hang in first 24 hrs
      • SKU201, hang in first 24 hrs
      • SKU202, hang in first 24 hrs
    • Samuel Greenfeld, Q4C12, patched 0.17.0 runin to support kernel RPM version
      • C1 SKU202 failure after 1805 cycles (cause unclear but EC locked up) #11571
      • C1 SKU202 failure after 3455 cycles (EC transmission queue full followed by null pointer dereference) #11573
      • B1 SKU199 failure after 2062 cycles (cause unknown; no serial logs available) #11572
      • 3xC1 SKU201, 2xB1 SKU 198, 2xB1 SKU 199 passed 24 hours of testing.
      • All 10 systems mentioned above re-imaged to os24 with normal 24-hour runin settings; passed.
    • Chia-Hsiu, 6 x C2, 2 x C1, using os24, ongoing
      • one 24 run completed successfully
    • C2 SKU200 x7 @ Miami office using os24 - ongoing
      • 20 hrs with >50 s/r cycles

1220526f - arm-3.0-wip-wfi

  • debug: add debugging to workqueues, irqs, etc
  • test dependencies: save your System.map from your build
  • purpose: attempt to track down hangs to specific driver callbacks
  • testing by
    • James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async,
      • C1 SKU201, hang 1, log, hang 2, [http://dev.laptop.org/~quozl/z/1RkTrN.txt log, hang 3, a repeating invalid page fault with storage LED blinking, FIQ entry but insane, log, fiq, hang 4, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 5, in resume, log, stopped.
      • C2 SKU202, hang 1, with spontaneous kdb entry, and a backtrace, log, hang 2, log, hang 3, not with the workqueue debugging active at the time, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 4, now in EARLY resume, log, stopped.
      • B1, hang 1, log, hang 2, log, hang 3, log, stopped.
      • B1 (pass), B4 (pass), B1 (pass), with -ENODEV return from serial_pxa_init(), and no calls to mmp2_add_uart(), ongoing, 3483 cycles, 23 hours,

0a896953 - arm-3.0-wip-wfi

  • olpc-ec: consolidate code and ensure TxFIFO is empty before writing to it
  • EC logged in S7, runin 0.17.0 or 0.16.7 (not C2 builds) if possible
  • Past kernels tested briefly in the past ~12 hours suggest that the fixes in said kernels may fix all known easily reproducible hangs. Need to prove/disprove this.
  • If you have enough systems, try using stock runin 0.17.0-style settings as well as anything known to cause particular types of failures. (EC failures seem to be common within a few hours with 10s on/10s off suspend, only using runin-main/gtk/common/tests/sus/battery.)
  • testing by
    • James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around Sat Jan 7 08:24:00 UTC 2012,
      • C1 SKU201, 534 tests, 530 pass, 4 fail,
      • C1 SKU202, 2777 tests, 2711 pass, 1 WARNING at olpc_ec_1_75_cmd (log), 66 fail,
      • one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
      • one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
    • Samuel Greenfeld, five C1 units, runin 0.16.7 setup like 0.17 (light sensor test disabled), 0 to pm_async
      • one C1 SKU201 hang during resume, normal runin settings, see host log
      • one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing, see host log ec. Unit discovered after powered off due to critical battery failure (runin discharge test).
      • one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing + cl2-4_0_3_07rsmith-586 EC code (S47 debug), see host log ec log.
      • one B1 SKU199 hang during resume, normal runin settings with 10s suspend timing, see host log.
    • Jon Nettleton

ebf24ea6 - arm-3.0-wip-wfi

196c2f806 - arm-3.0-wip-wfi

  • http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=196c2f80
  • testing by
    • James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume.
      • C1 SKU202 reported EC related kernel problems, see log, should be fixed in later kernels.

46e079fe - arm-3.0-wip-wfi

  • http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=46e079fe
  • EC logged in S7
  • Should fix the hang caused by suspend being aborted due to a wakeup event. May also help with other IRQ hangs as I have fixed a previous mistake I made when moving the audio island code.
  • Testing aborted suspends can be done by running this command and then hitting the keyboard right away.

echo $(cat /sys/power/wakeup_count) > /sys/power/wakeup_count && echo mem > /sys/power/state

  • testing by
    • James Cameron, with a local debugging patch, saw one instance of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201, see log tail, and several instances of the hang in dpm_resume().

bfc1b92b - arm-3.0-wip-wfi

b7f22e1d - arm-3.0-wip-wfi

10ebd28f - arm-3.0-wip-wfi

ae48be89 - arm-3.0-wip-wfi

  • a tar.gz by James,
  • testing by
    • Richard?
    • James Cameron, 10sec/10sec, with audio disabled, patch,

5ba0b446 - arm-3.0-wip-wfi

  • olpc-ec: allow unknown commands to be executed
  • test dependencies: EC logging (S7), kernel logging with EC debug enabled ("echo 1 > /sys/module/olpc_ec_1_75/parameters/ec_debug"), pm_async disabled.
  • purpose: improve EC driver; fewer (or no?) races/crashes..
  • testing by
    • Samuel Greenfeld
      • C1 SKU 201 EC communications failure (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin. host ec

ff199462 - arm-3.0-wip-wfi

  • olpc-ec: ensure gpio cmd is left low if something screws up
  • EC logged in S7
  • Still has FIQ debugger
  • Should help reset EC bus to avoid subsequent failures after the first command fails.
  • testing by
    • Samuel Greenfeld
      • C1 hang on resume (#3), OLS runin disabled, Q4C11 normal EC code host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done ... 72.616ms ... mmp-camera mmp-camera.0: resume) [1]
      • C1 hang on resume (#6), OLS runin disabled, Q4C11 normal EC code
      • C1 hang on resume (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin host ec

3d4cf36c - arm-3.0-wip-wfi

  • mmp2_fiq_debugger.c add file missing from other merge
  • EC logged in S7
  • This enables the FIQ debugger. Reproduce kernel hangs, then from a serial console send a Break which should drop you to a debug prompt. run bt to see what is going on.
  • testing by
    • James Cameron, 10sec/10sec,
      • C1 SKU201 hang at mmp-camera mmp-camera.0: resume, host,
      • C1 SKU202 hang at mmp-camera mmp-camera.0: resume, host,
      • C1 SKU202 hang at mmp-camera mmp-camera.0: resume, host,
      • no response to BREAK,
    • James Cameron, 0sec/3sec, FIQ was verified as working before starting,
      • C1 SKU201 host (known audio problem that doesn't need to be reported again),
      • C1 SKU202 host (known audio problem),
      • B4, and two B1 hung, no response to BREAK.

26f404e3 - arm-3.0-wip-wfi

  • Revert "olpc-ec: don't process/ack packets when there's an underrun error"
  • EC logged in S7
  • further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf. Test that audio pop during suspend/resume is gone from jnettlet.
  • testing by
    • Samuel Greenfeld
      • 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress.
        • 1xC1 SKU201 (#6) running aggressive (10s on/10s off) suspend failed after 15 minutes during resume, power & full battery LEDs on, all other LEDs off. All tests except the battery test were running on this unit at the time of failure. host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done)
        • 1xC1 SKU201 (#3) running aggressive (10s on/10s off) suspend failed at a point TBD with EC communications failure and the eMMC root filesystem remounted read-only. It continued suspend & resume testing after the failure point(s). host ec
    • James Cameron
      • C1 C1 B4 B1 B1 B1, os20, runin 0.16.7, 10sec/10sec, hangs still occur one, two, audio pops during suspend and resume are still present.

7dad6c10 - arm-3.0-wip-wfi (BROKEN)

afa391a5 - arm-3.0-wip-wfi

9177e6a8 - arm-3.0-wip-wfi

  • olpc-1.75: back off the hardware clock gating for MMC devices
  • purpose: kill off the SET_BLOCK_COUNT errors seen by Quozl in the prior tests
  • result of test: no change (the patch did not affect the outcome).
  • testing by
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 passed, ec host,
      • C1 SKU202 hung, after 11 minutes, SET_BLOCK_COUNT eMMC failure at kernel timestamp 1159.762854, host
      • B1 SKU199 hung, after 2.5 hours, SET_BLOCK_COUNT eMMC failure at kernel timestamp 7409.481875, host
      • B1 SKU199 hung,
      • B4 SKU199 hung, (connecting serial port afterwards did not show streaming eMMC messages),
      • B1 SKU198 hung.
    • James Cameron, Q4C11, os20, single, dortc,
      • C1 SKU202 manually stopped early, had been needing keyboard wakeup,
      • B1 SKU199 manually stopped early,
      • B1 SKU199 manually stopped early, had been needing keyboard wakeup,
      • B4 SKU199 manually stopped early.
  • purpose: test theory that battery state of charge changes are associated with hangs
    • Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test disabled
      • 2xC1 SKU201 hung on resume after 10.5 hours host#3/ec#8 & host #6
      • 1xC1 SKU202 hung after 11.75 due disabling pm_async or keyboard events during runin host#4, possibly while resetting system #3 in front of it.
      • All systems then reset with pm_async disabled, 4 C1 SKU201 & 3 C1 SKU 202 total.
      • 1xC1 SKU202 hung citing phantom keyboard events followed by an illegal instruction, pm_async disabled [2]
      • 1xC1 SKU201 hung with MMC problems, pm_async disabled [3]
    • Samuel Greenfeld, os23, 3 B1 SKU198 4 B1 SKU199, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test enabled, runin-camera, runin-wlan disabled, pm_async enabled'
      • 1xB1 SKU 199 failure to properly handle EC interrupt on resume [4]
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin, without runin-battery, without battery inserted.
      • C1 SKU201 hung, after 67 minutes, host, C1 SKU202 hung, after 15 minutes, host, B1 SKU199 x 2, B4 SKU199 hung, after 90 minutes, host tail, B1 SKU198.
  • purpose: disable asynchronous S/R
  • purpose: no-suspend-contention runin branch
    • five units run to 100 suspend cycles, 2.5 hours each, no issues.

58360582 - arm-3.0-wip-wfi

  • Revert "olpc-ec-1-75: clean up cmd state locking and other things"
  • purpose: test the old EC driver across runin and s/r.
  • result of test: no change (the patch did not affect the outcome).
  • additional tests: against normal runin, and 10sec/10sec S/R runin.
  • testing by
    • Richard Smith, os21, C1 SKU201, three runs, against two-stage runin, pass.
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 stopped, SOC display loss at EC timestamp 5380673 kernel timestamp 745.92 (same symptom as seen in #4239902), ec host,
      • C1 SKU202 stopped, eMMC failure, host
    • James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
      • C1 SKU201 hung, ec host,
      • C1 SKU202 hung, within 30 minutes, host
      • B4 SKU199 hung, after two hours,
      • B1 SKU199 was going fine for 2:15, but manually stopped, host
      • B1 SKU199 hung, after one hour,
      • B1 SKU198 hung.
    • Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build
      • C1 SKU201 hung on resume after 21 cycles (10s on/10s suspend), host#3 ec#8
      • C1 SKU201 hung on resume overnight (10s on/10s suspend), host#6
      • C1 SKU202 EC communications failure overnight host#1 ec#7
      • Three remaining C1 systems running default runin suspend cycle timings did not hang after 18 hours.

2d8e7cc - arm-3.0-wip-wfi

  • sdhci: ignore interrupts received after suspend
  • zImage-2d8e7cc-wip-wfi
  • os: os20, ofw: q4c08 or q4c09, runin in build, runin-fscheck disabled, runin-battery disabled, runin-sus set to 10sec/10sec
  • purpose, test looking for non-ec related hangs.
  • result of test: no change (the patch did not affect the outcome).
  • testing by
    • Samuel Greenfeld
      • Passed, os21 not os20, B1 test bed, seven units.
    • James Cameron,
      • C1 SKU201 host ec, hung at 31 cycles 23:46:01 remaining,
      • C1 SKU202 host, hung at 36 cycles 23:43:41 remaining,
      • B1 SKU199 host, hung at 1737 cycles 11:25:02 remaining,
      • B4 SKU199, hung at 398 cycles 12:03:29 remaining,

6f125d7 - arm-3.0-wip

  • sdhci: ignore interrupts received after suspend (same commit comment but different git branch than above)
  • zImage-6f125d7-wip
  • os: os20, ofw: q4c08, runin in build, runin-sus disabled.
  • purpose: verification of a build for manufacturing testing.
  • result of test: success, the kernel is stable for runin testing in manufacturing if used without suspend and resume.
  • testing by
    • Samuel Greenfeld
      • Four C1 SKU 201, Three C1 SKU202 passed 24 hr testing
    • James Cameron,
      • C1 SKU201 host ec, passed,
      • C1 SKU202 host, passed.
  • was added to build os23.
  • os23 testing by
    • James Cameron, C1 SKU201, C1 SKU202, B4, B1, B1, B1, all passed 24 hr testing.

4239902

  • os20 q4c09 runin 0.16.7 10sec 10sec 24hrs
  • testing by
    • James Cameron, fail, one unit hung, all units lost EC communications (blank SOC display, hang after runin pass),