XO-1.75/Kernel/Testing: Difference between revisions
< XO-1.75
Jump to navigation
Jump to search
(add 9a3a8436 wip-graphics testing) |
|||
Line 27: | Line 27: | ||
**Samuel Greenfeld |
**Samuel Greenfeld |
||
***2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each |
***2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each |
||
***1xSKU 201 hung after ~20 hours of testing |
***1xSKU 201 hung after ~20 hours of testing with 10s suspend |
||
***5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each |
***5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each |
||
Revision as of 21:02, 19 January 2012
Kernel developers, put your new kernel in a new section here. Latest at top. Testers place links to reports indented under the kernel.
See also:
9a3a8436 - arm-3.0-wip-graphics
- link to git hash via http
- Jon Nettleton's graphics RPM repo is required (ABI breakage)
- purpose, test new graphics driver in runin with a somewhat ideally stable kernel
- testing by
- Samuel Greenfeld
- 2xSKU 202, 2xSKU 201 passed 21 hours of testing with 10s suspend, ~3600 suspend cycles each
- 1xSKU 201 hung after ~20 hours of testing with 10s suspend
- 5xSKU 200 ran normal runin settings for 21 hours, no issues, ~58 suspend cycles each
- Samuel Greenfeld
daf43181 - arm-3.0-wip-wfi
- TTY: serial_core: Fix crash if DCD drop during suspend
- make sure no_console_suspend is enabled
- see if this helps or fixes serial hangs
- testing by
- James Cameron, os24 remastered with testconfig kernel, without no_console_suspend, runin 9641e47 (aggressive, 4min watchdog, 0 to pm_async, no camera),
- Samuel Greenfeld
- C1 SKU 202 mmc communication problem (no reboot by runin)
- C1 SKU 202 hang
- C1 SKU 201 hang
- C1 SKU 201 hang
- C2 hang followed by power off; no serial connection
fc631432 - arm-3.0-wip-wfi
- testconfig: reenable no_console_suspend
- EC logged in S7 or S47.
- Disable runin camera test (disabled mmp_camera driver)
- purpose, gather more information about why we hang, since removing no_console_suspend led to hanging significantly less often.
- testing by
- Samuel Greenfeld
- Seems to hang on all test systems within 2 hours with 10s suspend cycles (multiple failure modes, including #11528), possibly aggravated by sending serial data to the XO during the suspend/resume process. But failures have also been seen after rebooting systems and ignoring them.
- Samuel Greenfeld
cfb3aa76 - arm-3.0-wip-wfi
- cfg80211: amend regulatory NULL dereference fix
- EC logged in S7 or S47. Do not use no_console_suspend.
- purpose: fixes a wireless oops; still need to verify TxFIFO sanity check patch
- testing by
- Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles
- SKU202 #0 failed 0:8:26 (hours:minutes:seconds) remaining after 3139 cycles power LED on full charge seen
- SKU199 #6 failed 0:53:10 (hours:minutes:seconds) remaining 2990 cycles power LED on battery LED off (discharging)
- SKU201 #4 failed second run 22:51 (hours:mins) remaining after 441 suspend cycles
- SKU202 #0 failed second run 21:31 (hours:mins) remaining 322 cycles
- SKU202 #1 failed second run 18:36 (hours:mins) remaining 689 cycles
- SKU199 #6 failed second run 13:29 (hours:mins) remaining 1354 cycles
- SKU201 #4 failed third run 20:08 (hours:mins) remaining 510 cycles
- SKU202 #0 failed third run 18:42 (hours:mins) remaining 1694 cycles
- Logs available here
- Samuel Greenfeld, 5 C1s, 1 B1, 10s aggressive suspend cycles
be581eb4 - arm-3.0-wip-wfi
- testconfig: turn off no_console_suspend
- EC logged in S7 or S47, aggressive suspending in runin
- do not use no_console_suspend, use same configuration as caused hangs previously,
- purpose: Verify the 'olpc-ec: do a full reset if the TxFIFO sanity check fails' patch.
- testing by
f1f8f7fe - arm-3.0-wip, EC 4.01
- git hash,
- rpm,
- build,
- do not use no_console_suspend, include light sensor test
- make sure to manually upgrade the EC firmware to cl2-4_0_4_01
- purpose, test if actually getting the EC to be quiet during suspend improves things
- testing by
- John Watlington:
- SKU201 (3) 24hr cycles, 600+ cycles ongoing;
- SKU202 (3) 24hr cycles, 600+ cycles ongoing
- (12x) SKU198/199/just plain weird, ongoing
- (10x) SKU200, ongoing;
- (4x) SKU201, ongoing;
- (2x) SKU202, ongoing;
- (2x) SKU203, ongoing;
- (2x) SKU204, ongoing;
- John Watlington:
f1f8f7fe - arm-3.0-wip
- git hash,
- rpm,
- build,
- do not use no_console_suspend, use same configuration as caused hangs previously,
- purpose, test if debug console messages may cause hangs,
- testing by
- Jon Nettleton, C1, 13:20 hours / 1966 cycles ongoing, os20,
- James Cameron,
- C1 SKU201, C1 SKU202, B1, all passed,
- C1 SKU201, C1 SKU202, B1, B1, B4, with os24, all passed.
- John Watlington, with light-sensor test:
- C1 SKU201 (2) 24hr loop + 2000 cycles stopped;
- C1 SKU202 (2) 24hr loop + 2100 cycles stopped
- (10x) SKU198/199/just plain weird, (2) 24hr loop, stopped
- (8x) SKU200, (2) 24 hr loop, stopped;
- (3x) SKU201, (2) 24 hr loop, stopped;
- SKU202, (2) 24hr loop, stopped;
- SKU198, hang in first 24 hrs, all hangs either screen blank w. power light on or turned off.
- SKU199, hang in first 24 hr`s
- (2x) SKU200, hang in first 24 hrs
- SKU201, hang in first 24 hrs
- SKU202, hang in first 24 hrs
- Samuel Greenfeld, Q4C12, patched 0.17.0 runin to support kernel RPM version
- C1 SKU202 failure after 1805 cycles (cause unclear but EC locked up) #11571
- C1 SKU202 failure after 3455 cycles (EC transmission queue full followed by null pointer dereference) #11573
- B1 SKU199 failure after 2062 cycles (cause unknown; no serial logs available) #11572
- 3xC1 SKU201, 2xB1 SKU 198, 2xB1 SKU 199 passed 24 hours of testing.
- All 10 systems mentioned above re-imaged to os24 with normal 24-hour runin settings; passed.
- Chia-Hsiu, 6 x C2, 2 x C1, using os24, ongoing
- one 24 run completed successfully
- C2 SKU200 x7 @ Miami office using os24 - ongoing
- 20 hrs with >50 s/r cycles
1220526f - arm-3.0-wip-wfi
- debug: add debugging to workqueues, irqs, etc
- test dependencies: save your System.map from your build
- purpose: attempt to track down hangs to specific driver callbacks
- testing by
- James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async,
- C1 SKU201, hang 1, log, hang 2, [http://dev.laptop.org/~quozl/z/1RkTrN.txt log, hang 3, a repeating invalid page fault with storage LED blinking, FIQ entry but insane, log, fiq, hang 4, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 5, in resume, log, stopped.
- C2 SKU202, hang 1, with spontaneous kdb entry, and a backtrace, log, hang 2, log, hang 3, not with the workqueue debugging active at the time, log, restarted with CONFIG_VIDEO_MMP_CAMERA=n and CONFIG_VIDEO_OV7670=n, hang 4, now in EARLY resume, log, stopped.
- B1, hang 1, log, hang 2, log, hang 3, log, stopped.
- B1 (pass), B4 (pass), B1 (pass), with -ENODEV return from serial_pxa_init(), and no calls to mmp2_add_uart(), ongoing, 3483 cycles, 23 hours,
- James Cameron, three units, testconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog for first run only, 0 to pm_async,
0a896953 - arm-3.0-wip-wfi
- olpc-ec: consolidate code and ensure TxFIFO is empty before writing to it
- EC logged in S7, runin 0.17.0 or 0.16.7 (not C2 builds) if possible
- Past kernels tested briefly in the past ~12 hours suggest that the fixes in said kernels may fix all known easily reproducible hangs. Need to prove/disprove this.
- If you have enough systems, try using stock runin 0.17.0-style settings as well as anything known to cause particular types of failures. (EC failures seem to be common within a few hours with 10s on/10s off suspend, only using runin-main/gtk/common/tests/sus/battery.)
- testing by
- James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around Sat Jan 7 08:24:00 UTC 2012,
- C1 SKU201, 534 tests, 530 pass, 4 fail,
- C1 SKU202, 2777 tests, 2711 pass, 1 WARNING at olpc_ec_1_75_cmd (log), 66 fail,
- 50 hang during resume,
- 16 fail with repeating eMMC SET_BLOCK_COUNT,
- see tail of each fail in one place
- one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
- one B1 hang with with repeating eMMC SET_BLOCK_COUNT, see host log,
- Samuel Greenfeld, five C1 units, runin 0.16.7 setup like 0.17 (light sensor test disabled), 0 to pm_async
- one C1 SKU201 hang during resume, normal runin settings, see host log
- one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing, see host log ec. Unit discovered after powered off due to critical battery failure (runin discharge test).
- one C1 SKU201 hang during resume, normal runin settings with 10s suspend timing + cl2-4_0_3_07rsmith-586 EC code (S47 debug), see host log ec log.
- one B1 SKU199 hang during resume, normal runin settings with 10s suspend timing, see host log.
- Jon Nettleton
- James Cameron, six units, defconfig, runin 0.17.0, 10sec/10sec, 60sec watchdog, 0 to pm_async, started around Sat Jan 7 08:24:00 UTC 2012,
ebf24ea6 - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=ebf24ea6
- test dependencies: EC logged in S7, use whatever magic possible to get the EC errors
- Another attempt to fix the problem of EC communication failures
- testing by
196c2f806 - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=196c2f80
- testing by
- James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume.
- C1 SKU202 reported EC related kernel problems, see log, should be fixed in later kernels.
- James Cameron, with a local debugging patch, saw two instances of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201 and C1 SKU202, two test units with full runin 0sec/3sec (extreme) suspend and resume failed to show any hangs, currently going are five test units with full runin 10sec/10sec (aggressive) suspend and resume.
46e079fe - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=46e079fe
- EC logged in S7
- Should fix the hang caused by suspend being aborted due to a wakeup event. May also help with other IRQ hangs as I have fixed a previous mistake I made when moving the audio island code.
- Testing aborted suspends can be done by running this command and then hitting the keyboard right away.
echo $(cat /sys/power/wakeup_count) > /sys/power/wakeup_count && echo mem > /sys/power/state
- testing by
- James Cameron, with a local debugging patch, saw one instance of eMMC "error -110 sending SET_BLOCK_COUNT command", on C1 SKU201, see log tail, and several instances of the hang in dpm_resume().
bfc1b92b - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=bfc1b92b
- EC logged in S7
- Should fix the (known) cause of failing EC commands on s/r. Please test both this and the prior kernel.
- testing by
b7f22e1d - arm-3.0-wip-wfi
- http://dev.laptop.org/git/olpc-kernel/commit/?h=arm-3.0-wip-wfi&id=b7f22e1d
- EC logged in S7
- Should fix the problem where, upon a single EC error, all further EC commands fail. Please test!
- testing by
10ebd28f - arm-3.0-wip-wfi
- olpc-ec: log a warning if we manage to process an EC command deep inside of suspend,
- a tar.gz by James,
- testing by
- Richard?
ae48be89 - arm-3.0-wip-wfi
- a tar.gz by James,
- testing by
5ba0b446 - arm-3.0-wip-wfi
- olpc-ec: allow unknown commands to be executed
- test dependencies: EC logging (S7), kernel logging with EC debug enabled ("echo 1 > /sys/module/olpc_ec_1_75/parameters/ec_debug"), pm_async disabled.
- purpose: improve EC driver; fewer (or no?) races/crashes..
- testing by
ff199462 - arm-3.0-wip-wfi
- olpc-ec: ensure gpio cmd is left low if something screws up
- EC logged in S7
- Still has FIQ debugger
- Should help reset EC bus to avoid subsequent failures after the first command fails.
- testing by
- Samuel Greenfeld
- C1 hang on resume (#3), OLS runin disabled, Q4C11 normal EC code host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done ... 72.616ms ... mmp-camera mmp-camera.0: resume) [1]
- C1 hang on resume (#6), OLS runin disabled, Q4C11 normal EC code
- C1 hang on resume (#3), OLS runin disabled, Q4C11 modified EC code ecimage-0.3.07pgf-668.bin host ec
- Samuel Greenfeld
3d4cf36c - arm-3.0-wip-wfi
- mmp2_fiq_debugger.c add file missing from other merge
- EC logged in S7
- This enables the FIQ debugger. Reproduce kernel hangs, then from a serial console send a Break which should drop you to a debug prompt. run bt to see what is going on.
- testing by
- James Cameron, 10sec/10sec,
- James Cameron, 0sec/3sec, FIQ was verified as working before starting,
26f404e3 - arm-3.0-wip-wfi
- Revert "olpc-ec: don't process/ack packets when there's an underrun error"
- EC logged in S7
- further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf. Test that audio pop during suspend/resume is gone from jnettlet.
- testing by
- Samuel Greenfeld
- 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress.
- 1xC1 SKU201 (#6) running aggressive (10s on/10s off) suspend failed after 15 minutes during resume, power & full battery LEDs on, all other LEDs off. All tests except the battery test were running on this unit at the time of failure. host (mmp2_pm_finish: Enable audio island ... mmp2_pm_finish: Done)
- 1xC1 SKU201 (#3) running aggressive (10s on/10s off) suspend failed at a point TBD with EC communications failure and the eMMC root filesystem remounted read-only. It continued suspend & resume testing after the failure point(s). host ec
- 3xC1 running os23 with olpc-runin-tests-0.16.7-1 installed instead of bringup build, Q4C11, all tests (2 with EC serial), 3xC1 running with battery test disabled. Test in progress.
- James Cameron
- Samuel Greenfeld
7dad6c10 - arm-3.0-wip-wfi (BROKEN)
- XO-1.75: always send the suspend hint command to the EC, even on C-series
- EC logged in S7
- further code to work around EC communications race; discussion with pgf brought up another issue, and include patch from pgf.
- testing by
- Samuel Greenfeld
- 1xC1 SKU202 tested; logs "olpc-ec-1.75: SSP reports TX underrun" and "SSP reports RX overrun" every three seconds shortly after the kernel starts initializing. This causes the XO to never fully boot, making this kernel unusable. log with EC at S7 log with EC at S47
- Samuel Greenfeld
afa391a5 - arm-3.0-wip-wfi
- Revert "olpc-1.75: back off the hardware clock gating for MMC devices"
- EC logged in S7
- test code to work around an EC communications race; needs further work, but this should hopefully keep things from hanging.
- testing by
9177e6a8 - arm-3.0-wip-wfi
- olpc-1.75: back off the hardware clock gating for MMC devices
- purpose: kill off the SET_BLOCK_COUNT errors seen by Quozl in the prior tests
- result of test: no change (the patch did not affect the outcome).
- testing by
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
- C1 SKU201 passed, ec host,
- C1 SKU202 hung, after 11 minutes, SET_BLOCK_COUNT eMMC failure at kernel timestamp 1159.762854, host
- B1 SKU199 hung, after 2.5 hours, SET_BLOCK_COUNT eMMC failure at kernel timestamp 7409.481875, host
- B1 SKU199 hung,
- B4 SKU199 hung, (connecting serial port afterwards did not show streaming eMMC messages),
- B1 SKU198 hung.
- James Cameron, Q4C11, os20, single, dortc,
- C1 SKU202 manually stopped early, had been needing keyboard wakeup,
- B1 SKU199 manually stopped early,
- B1 SKU199 manually stopped early, had been needing keyboard wakeup,
- B4 SKU199 manually stopped early.
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
- purpose: test theory that battery state of charge changes are associated with hangs
- Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test disabled
- 2xC1 SKU201 hung on resume after 10.5 hours host#3/ec#8 & host #6
- 1xC1 SKU202 hung after 11.75 due disabling pm_async or keyboard events during runin host#4, possibly while resetting system #3 in front of it.
- All systems then reset with pm_async disabled, 4 C1 SKU201 & 3 C1 SKU 202 total.
- 1xC1 SKU202 hung citing phantom keyboard events followed by an illegal instruction, pm_async disabled [2]
- 1xC1 SKU201 hung with MMC problems, pm_async disabled [3]
- Samuel Greenfeld, os23, 3 B1 SKU198 4 B1 SKU199, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test enabled, runin-camera, runin-wlan disabled, pm_async enabled'
- 1xB1 SKU 199 failure to properly handle EC interrupt on resume [4]
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin, without runin-battery, without battery inserted.
- Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build, runin-battery test disabled
- purpose: disable asynchronous S/R
- purpose: no-suspend-contention runin branch
- five units run to 100 suspend cycles, 2.5 hours each, no issues.
58360582 - arm-3.0-wip-wfi
- Revert "olpc-ec-1-75: clean up cmd state locking and other things"
- purpose: test the old EC driver across runin and s/r.
- result of test: no change (the patch did not affect the outcome).
- additional tests: against normal runin, and 10sec/10sec S/R runin.
- testing by
- Richard Smith, os21, C1 SKU201, three runs, against two-stage runin, pass.
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
- James Cameron, Q4C11, os20, 10sec/10sec S/R runin,
- Samuel Greenfeld, os23, 4 C1 SKU201 2 C1 SKU202, olpc-runin-tests-0.16.7-1 installed instead of bringup build
- C1 SKU201 hung on resume after 21 cycles (10s on/10s suspend), host#3 ec#8
- C1 SKU201 hung on resume overnight (10s on/10s suspend), host#6
- C1 SKU202 EC communications failure overnight host#1 ec#7
- Three remaining C1 systems running default runin suspend cycle timings did not hang after 18 hours.
2d8e7cc - arm-3.0-wip-wfi
- sdhci: ignore interrupts received after suspend
- zImage-2d8e7cc-wip-wfi
- os: os20, ofw: q4c08 or q4c09, runin in build, runin-fscheck disabled, runin-battery disabled, runin-sus set to 10sec/10sec
- purpose, test looking for non-ec related hangs.
- result of test: no change (the patch did not affect the outcome).
- testing by
6f125d7 - arm-3.0-wip
- sdhci: ignore interrupts received after suspend (same commit comment but different git branch than above)
- zImage-6f125d7-wip
- os: os20, ofw: q4c08, runin in build, runin-sus disabled.
- purpose: verification of a build for manufacturing testing.
- result of test: success, the kernel is stable for runin testing in manufacturing if used without suspend and resume.
- testing by
- was added to build os23.
- os23 testing by
- James Cameron, C1 SKU201, C1 SKU202, B4, B1, B1, B1, all passed 24 hr testing.
4239902
- os20 q4c09 runin 0.16.7 10sec 10sec 24hrs
- testing by
- James Cameron, fail, one unit hung, all units lost EC communications (blank SOC display, hang after runin pass),
- C1 SKU201 pass, host log at point of SOC loss, ec log at point of SOC loss
- C1 SKU202 hang at 1821 cycles, remaining 10:35:38, no serial cable was attached,
- B4 SKU199 pass,
- B1 SKU199 pass,
- B1 SKU199 pass,
- B1 SKU198 pass,
- James Cameron, fail, one unit hung, all units lost EC communications (blank SOC display, hang after runin pass),