Wireless Issues Apr08

From OLPC
Revision as of 04:04, 23 April 2008 by Leejc (talk | contribs) (Marked up ticket numbers)
Jump to navigation Jump to search

This is an attempt to start a detailed taxonomy of the network-related problems we face as of April-08. It is a network-centric view (please see the "focus" box, bellow) so, detailing is higher on the network issues and lower or absent on another aspects (example, suspend and resume, datastore, etc).

DISCLAIMER:
Please contribute, so this gets bigger than the perspective of one person.

An issue here, may be:

  • implementation issues (bugs).
  • design issues (poor choices).

In terms of details, expect to find the following:

  • Focus: Libertas driver/firmware
  • Slightly blurred: School server and middleware issues
  • Completely Blurred: Application issues. UI issues, Hardware issues.


Applications and Sugar

Deals with: problems that should be fixed/enhanced in activities (Read, Chat, etc) or in the User Interface (mesh view).

Summary: Some of our collaboration issues are related to application bugs. One good example is #6774. We don't seem to have many of the UI issues on the mesh view as we've had in the past (dialogue boxes for entering keys, etc) but the information provided in the home view and the blinking of some circles may be incorrect/incoherent. We need to check this.

Tickets:

  • <trac>6774</trac> - Read fails to transfer document when using salut. [this seems specific to Read]
  • <trac>5459</trac> - second circle in sugar home view provides false information [apparently an UI issue - check overall UI behaviour]
  • what else?

School server

Deals with: applications that are part of or related to the school server.

Summary: As I gather, there are two major problem here: (1) DHCP fails sometimes and we don't know exaclty why (we should try to find the root cause) and (2) Ejabberd fails frequently. This is clearly a limitation of the software. Attentive reading through the lists and tickets support that the ejjaber server is not stable.

Tickets:

  • <trac>4153</trac> - Connect to linklocal instead of school mesh, DHCP failure. [why?]
  • <trac>5963</trac> - Laptop fails to reliably connect with school mesh portal [this is closed and it is a duplicate of <trac>4153</trac>].
  • <trac>6287</trac> - Associating with one mesh prevents you from successfully with a different one.[bad title, this is another instance of "cannot find the school server"]
  • <trac>5908</trac> - Laptop unable to connect to schoolserver jabber server. [The real question of reliability of ejjaberd should be addressed, instead of suggesting that a reactive protocol is not a good choice (even if this is the case)].

Middleware

Deals with: Everything that cannot be fixed in the libertas driver/firmware or at the application belongs here. (NetworkManager, Sugar Presence Service, Telepathy salut, Telepathy gabble, Avahi/MDNS)

Summary: We have two major categories of problems here. (1) we have a scalability issue, due to the use of MDNS and due to the avahi implementation and (2) we have bugs in the presence/telepathy-salut/gabble to address. A decision should be made if we are switching to another presence mechanism (like cerebro) because if we are not, we need to keep optimizing/fixing the current one. Take, for example, the <trac>6572</trac> optimization, do we need it or not?

Tickets:

  • <trac>5335</trac> - More mdns traffic then expected [anedoctal, could be merged with 5078]
  • <trac>5078</trac> - A more mesh-friendly presence protocol for salut [same issue as above]
  • <trac>6553</trac> - No XOs in the mesh view and avahi seemed crashed [no build information in the ticket]
  • <trac>6572</trac> - Replace key with hash to reduce avahi TXT size [will we invest time in avahi?]
  • <trac>6889</trac> - Using wired connection, gabble does not attempt reconnect to jabber server
  • <trac>6888</trac> - Laptop connects to presence server, but not seen by other laptops
  • <trac>6886</trac> - Laptop stops running Gabble and reverts to Salut
  • <trac>6881</trac> - Laptop unable to connect to schoolserver presence service
  • <trac>6882</trac> - Laptop was running both salut and gabble at same time
  • <trac>6883</trac> - Other laptops aren't displayed in Neighborhood View
  • <trac>6884</trac> - Incorrect number of laptops shown in neighborhood view
  • <trac>6855</trac> - Need to extend the network scan to look for school server as AP [actually this is asking salut to shut up in infra mode]
  • <trac>6750</trac> - Incorrect wireless setting after resume [NM and suspend/resume interaction problem]
  • <trac>6872</trac> - 703 and 702 builds - network manager keeps trying to connect to mesh network even after associated w/ AP [at first, not consistent with my observations]
  • <trac>6855</trac> - Need to extend the network scan to look for school server as AP [what is requested? - it seems another instance of shut salut down]
  • what else?

Hardware

Deals with: Problems that cannot/should not be fixed by software.

Summary: There are some reports of bad wireless interfaces (poor radio sensitivity). Are they frequent? It doesn't seem so. We cannot expect every device to have the same range. The only report I've seem was not validated (see 4068). My feeling is: low priority.

Tickets:

  • <trac>4901</trac> - Ctest cannot see wireless APs [antenna sensitivity reduced, *when compared to other XO*]
  • <trac>4068</trac> - Range of communication between 2 XOs is limited only to 20 meters [I tested this two XOs and they behaved as expected]

Libertas driver/firmware

Those that need to be fixed/enhanced in the libertas driver/firmware.

Since this is the focal point of this page a more detailed itemization follows:

Infra mode issues

Deals with issues that prevent association or correct operation under infrastructure mode

Summary: We currently have a known compatibility issues with preN routers (<trac>5527</trac>) and an assortment of association problems (support for cloaked access point, failure to associate in channel different from 1,6,or 11 and others). No clear patterns or major issues it seems.

Tickets:

  • <trac>6279</trac> - Cannot see Linksys AP on channel 9
  • <trac>2097</trac> - Can't do DHCP at vmware @ 5CC
  • <trac>5527</trac> - G1G1 users complain that the XO affects their local network
  • <trac>4975</trac> - Association fails
  • <trac>6811</trac> - WLAN doesn't reassociate with known access points
  • <trac>6117</trac> - Can't connect to Access Point if SSID is not broadcast [same as 6537]
  • <trac>6537</trac> - Support for Cloaked Access Points [a duplicate of 6117, but more detailed discussion]

Path discovery issues

Deals with: issues (design and bugs) in the reactive path discovery mechanism.

Summary: Apart from a bug (<trac>6589</trac>) the issue here is the inherent burstiness of a reactive protocol. Optimizations are being studied and include changing the route expiration time (from 10 to 20s) and some other timing tweaks (rreq_delay) and possibly adjustments in the link costs.

Tickets:

  • <trac>6589</trac> - xo stops responding to mesh path requests frames

Improvements to scalability

Deals with: everything that can improve scalability (by freeing airtime or implementing various adaptive behaviours).

Summary: Air time is precious - control over management frames is very important. Control over probe response retries was introduced in 22.p6 and Adaptive Contention Window based on the number of neighbours was introduced in 22p8. Control over beacon frequency were fixed in 22.p8. Current research in this item is focusing on the route discovery mechanism.

Tickets:

  • <trac>4927</trac> - [firmware] beacon interval gets reset by other operations [beacon control is fixed in 22.p8]

Active antenna

Deals with: problems specific to the use of the standalone active antennae

Summary:

Tickets:

  • programming

Improvements to testability

Deals with: issues on testing capabilities

Summary: Two issues just waiting for driver patch approval. (1) capturing traffic from XO is ineffective (because it keeps sending out beacon frames during the capture) and (2) NIC statistics - an important debug information - are garbage.

Tickets:

  • <trac>6709</trac> - beaconing while monitoring [driver patch waiting for approval]
  • <trac>6666</trac> - ethtool -S msh0 returning noise [driver patch waiting for approval]

Interface issues

Deals with: issues where the network design choices conflict with other design choices.

Summary: Interface with suspend/resume feature. Right now, the activity in this front is the introduction of a multicast filter on the firmware (22.p8), so an XO will wake up only to certain multicast frames (not to all of them). This needs support in the kernel (driver should inform the multicast addresses), otherwise collaboration will break <trac>6818</trac>

Tickets:

  • <trac>6818</trac> - Driver does not set link level multicast addresses into firmware when ip address assigned to mesh interfaceMesh view not working with 22.p8/p9

Miscellanea

Deals with: issues that seem related to driver/firmware but cannot be clearly classified

Summary:

Tickets:

  • <trac>6529</trac> - Multicast ping over eth0 (not mesh) sometimes produces duplicate packets [this may provide us with useful information, but it is hardly an issue by itself]
  • <trac>6527</trac> - Mesh does not forward multicast packets (most of the time) - [same as above - duplicated ipv6 pings are not an issue in itself]

Tickets in Limbo

I am writing another page with a list of tickets that we should close or update (and possibly bring to this page).