User:JordanCrouse/USBBug

From OLPC
Jump to: navigation, search

This page details information about the USB suspend/resume bug:

Summary

When we start communicating on ep2 after resume (that is, after the host controller has been shutdown and reinitialized), we get the XtransactionError bit set in the status field of the corresponding qTD structure. Host fails to transmit any more packets on ep2 after this point. Note that communication to the device via ep0 is normal and the configuration phase of USB works normally, which seems to indicate that the datapath is OK. All transactions are EHCI.

The error is detected by the kernel via the status field for the current qTD token structure for the failed command, and it is (from the EHCI spec):


Page 54 of EHCI spec 1.0

Table 3-16. qTD Token (DWord 2) (cont.)

bit 7:0 Status Field Description

3 Transaction Error (XactErr). Set to a one by the Host Controller during status update in the case where the host did not receive a valid response from the device (Timeout, CRC, Bad PID, etc.). Refer to Section 4.15.1.1 for summary of the conditions that affect this bit. If the host controller sets this bit to a one, then it remains a one for the duration of the transfer.

Reproduce

You need two machines: one OLPC and another host.

  • associate both to a wireless AP
  • on the host do:ping -f xo-laptop
  • on the XO laptop do: echo mem > /sys/power/state

Note that the ping has to start before the suspend attempt - if the flood starts after system has suspended (after olpc_do_sleep is printed on the console), the issue doesn't occur.

Trac Bugs

http://dev.laptop.org/ticket/1752

http://dev.laptop.org/ticket/2621

Tools

Analyzers used:

  • LeCroy
  • Ellisys

Software

  • OS: Build 544
  • Firmware: build q2c22

Observations

Note: The old traces were incorrect. Please don't use them

Assumptions

  • The error condition happens on the way down, not on the way up (possibly proven by flood pings not causing the error condition after the machine has suspended)

Questions

  • It would be good to know if we can have a more detailed look at what XtransactionError means on the EHCI controller in the Geode. Exactly what conditions (in a more detailed way that is described in the EHCI spec) can cause it? The spec says the XTransactionError means "CRC error" - but they don't see it on the analyzer.
  • given a XTransactionError on the bus what is the proper way(s) of recovering from that as quickly as possible? We cannot afford to reset the wireless device and reload its firmware.

Related??

  • When CERR is cleared in the kernel structure, we have infinite retries
  • After a fake suspend (echo test > /sys/power/olpc-pm) communication with wireless is lost (unknown why?)