This page details information about the USB suspend/resume bug:
When we start communicating on ep2 after resume (that is, after the host controller has been shutdown and reinitialized), we get the XtransactionError bit set in the status field of the corresponding qTD structure. Host fails to transmit any more packets on ep2 after this point. Note that communication to the device via ep0 is normal and the configuration phase of USB works normally, which seems to indicate that the datapath is OK. All transactions are EHCI.
The error is detected by the kernel via the status field for the current qTD token structure for the failed command, and it is (from the EHCI spec):
Page 54 of EHCI spec 1.0
Table 3-16. qTD Token (DWord 2) (cont.)
bit 7:0 Status Field Description
3 Transaction Error (XactErr). Set to a one by the Host Controller during status update in the case where the host did not receive a valid response from the device (Timeout, CRC, Bad PID, etc.). Refer to Section 188.8.131.52 for summary of the conditions that affect this bit. If the host controller sets this bit to a one, then it remains a one for the duration of the transfer.
You need two machines: one OLPC and another host.
- associate both to a wireless AP
- on the host do:ping -f xo-laptop
- on the XO laptop do: echo mem > /sys/power/state
Note that the ping has to start before the suspend attempt - if the flood starts after system has suspended (after olpc_do_sleep is printed on the console), the issue doesn't occur.
- OS: Build 544
- Firmware: build q2c22
- New traces:
Note: The old traces were incorrect. Please don't use them
The USB line appears clean on traces -> http://dev.laptop.org/~jcrouse/OLPC_Suspend_Resume_Cycle_Issue_01.zip
- Ellisys traces: http://dev.laptop.org/~rsmith/crash.zip (shows 5 successful suspend/resumes and 1 failure)
- The error condition happens on the way down, not on the way up (possibly proven by flood pings not causing the error condition after the machine has suspended)
- It would be good to know if we can have a more detailed look at what XtransactionError means on the EHCI controller in the Geode. Exactly what conditions (in a more detailed way that is described in the EHCI spec) can cause it? The spec says the XTransactionError means "CRC error" - but they don't see it on the analyzer.
- given a XTransactionError on the bus what is the proper way(s) of recovering from that as quickly as possible? We cannot afford to reset the wireless device and reload its firmware.
- When CERR is cleared in the kernel structure, we have infinite retries
- After a fake suspend (echo test > /sys/power/olpc-pm) communication with wireless is lost (unknown why?)