B4 Suspend ECR

From OLPC
Revision as of 09:20, 3 October 2007 by Wad (talk | contribs)
Jump to navigation Jump to search
  This page is monitored by the OLPC team.

This page documents a hardware modification to repair a design flaw in the B3/B4/C1 laptop motherboards. It is primarily a fix for crashing during suspend/resume (Trac ticket #1835), but includes changes which should improve the reliability of a laptop under all conditions, especially when running from low battery.

This modification started as a search for the cause of a intermittent USB problem which occurred when reconnecting to the wireless networking interface after a resume. The USB controller was acting in a manner which indicated a hardware problem. A number of developers from Cozybit, AMD, Marvell, RedHat and OLPC worked on the problem for over a month, identifying aggravating conditions and finally reproducing the problem reliably by flood pinging (Trac ticket #1752) sending simple "are you there" packets as fast as the network allows) a machine that was repeatedly suspending and resuming.

While testing for this problem, Chris Ball at OLPC was frustrated by another problem (Trac ticket #1835) which appeared while he was trying to reproduce #1752: a crashing of the machine earlier in the resume process. This problem was more unit dependent --- it had never been noticed by other developers working on the problem. Mitch Bradley developed a test in Open Firmware (wackup) which allowed fast and easy reproduction of this earlier crash, using the wireless networking interface to resume the machine a second after it entered suspend.

Using this test, OLPC and Quanta began widespread testing of laptops. It was determined that while some laptops ran for over a hundred thousand suspend/resume cycles before crashing, all did eventually crash. It was determined that two different crashes were occurring, one where the processor never succeeded in turning off the Microphone LED (the first action on boot, as the microphone power enabling the LED defaults on at power-up), and one where it crashed soon afterwards. The former was labeled 1835a and the latter 1835b.

Problem 1835a turned out to be a clocking issue: the system clock to the PCI interface on the Southbridge didn't have enough time to stabilize before the on-chip power-good signal was asserted. This problem mainly afflicted systems using a particular clock generator (used in half of the B3s, but not in B4/C build machines), but was present in all laptops.

In searching for the cause of 1835b, it was noticed that the problems got worse when running from a low battery. This led to the discovery that the switch on the +3.3V line used in suspend/resume was not being properly turned on when the battery was low. This is corrected by changing to a slightly better transistor. At the same time, the target voltage of the WLAN_3.3V supply was increased by 60 mV to insure sufficient voltage after the switch.

Problem 1835b turned out to be a number of problems, all afflicting the power supplies of the laptop under varying conditions. We had three optional power supplies/switches which were rarely (or never) exercised, but which caused transient dips in critical supplies. These were:

  • SD card power - caused a dip in the +3.3V rail
  • Camera power - caused a dip in the +3.3V rail
  • DCON power - caused a dip in the main memory supply

But we still saw a variety of crashes which had been combined into 1835b, the most worrisome were those where the laptop rebooted or powered off, or showed symptoms of memory corruption. These were due to insufficient high frequency bypassing on different power rails around the laptop.



This page is still in flux. Unless you are a developer working with OLPC please wait for a final fix !

Production Change

In production, the CPU supervisor used in the ECO of existing machines will be replaced with a cheaper circuit based on an RC circuit and a buffer with hysteresis. This provides a relatively precise delay of the reset signal to the Southbridge, at a reduced cost. If the voltage does temporarily drops below the design specifications, there is a good chance that the system will continue to operate normally.

This circuit is included here for reference only, as it is much more complicated to implement.


Final ECO

The final ECO includes ECO #42 and replacing Q1 with an AO3420 (or some other N-channel MOSFET w. an Rds of 25 mOhm or less at a Vgs of 5V). Since the AO3420 is not available in the retail market, I am looking for a suitable alternative.

ECO #43

This consists of six parts:

  • Tweaking the compensation of the WLAN_3.3V power supply by changing R20 to a 33K resistor (SMD0402, 5%).
  • Increasing the voltage of the WLAN_3.3V power supply. There are two common methods:
    • Increasing R22 to 32.37K by adding a 768 ohm resistor (SMD0402, 1%) in series to the existing 31.6K resistor
    • Decreasing R21 by adding a 330Kohm (SMD0402, 1%) resistor in parallel with the existing 10K resistor
  • Installing a 3.10V (3.05 to 3.15V) CPU supervisor, with a delay of at least 16 mS, driving the PWG signal on the motherboard (this is version A, version B uses the production ECO circuit.)
  • Modifying the DCON power switch for soft turn-on, by adding a 4.7 uF, 10V capacitor across R141.
  • Modifying the SD Card power switch for soft turn-on, by adding a 0.1 uF, 6.3V capacitor between Q49, pin 2 and ground.
  • Modifying the enable circuitry for the Camera Power switch for staggered turn-on. This consists of cutting the trace underneath U48, between pins 3 and 4. A 4.7K resistor is then added between pin 3 and pin 4 (conveniently connected to a test pad near pin 3) and a 0.1 uF capacitor is added between pin 3 and ground (available at U48, pin 2).

The first three of these were described as ECO #42. The remainder are the same on B4 and C1 builds.

ECO #42

This consists of three parts:

  • Tweaking the compensation of the WLAN_3.3V power supply by changing R20 to a 33K resistor (SMD0402, 5%).
  • Increasing the voltage of the WLAN_3.3V power supply. There are two common methods:
    • Increasing R22 to 32.37K by adding a 768 ohm resistor (SMD0402, 1%) in series to the existing 31.6K resistor
    • Decreasing R21 by adding a 330Kohm (SMD0402, 1%) resistor in parallel with the existing 10K resistor
  • Installing a 3.10V (3.05 to 3.15V) CPU supervisor, with a delay of at least 16 mS, driving the PWG signal on the motherboard

The first two changes are the same for all builds. R20, R21, and R22 are all located right next to U15, the WLAN_3.3V regulator, on the top of the motherboard.

C56, on the bottom of the motherboard underneath the Southbridge, must be removed.

The third change differs from B4 to C1 builds.

Older Versions

1835 ECO II

The original ECO had laptops refusing to boot when powered by battery. This eventually led to a more exhaustive study of the laptop power supplies, but initially resulted in a second ECO, using a CPU supervisor with a lower voltage threshold (2.85V - 3.0V), the Micrel MIC811S.

This generally stopped the rebooting problems, but indicated a larger problem --- a significant droop on +3.3V! On some laptops, the rebooting continued (Trac ticket #3537).

1835 ECO

The original ECO which established this as a valid fix for Trac ticket #1835 was installing a 3.10V (3.05 to 3.15V) CPU supervisor, with a delay of at least 16 mS between "power good" and deassertion of the Southbridge reset signal (PWG) on the motherboard.

The supervisor used was a MCP130T-315 from Microchip.