Wireless Issues Apr08

From OLPC
Jump to: navigation, search

This is an attempt to start a detailed taxonomy of the network-related problems we face as of April-08. It is a network-centric view (please see the "focus" box, bellow) so, detailing is higher on the network issues and lower or absent on another aspects (example, suspend and resume, datastore, etc).

An issue here, may be:

  • implementation issues (bugs).
  • design issues (poor choices).

In terms of details, expect to find the following:

  • Focus: Libertas driver/firmware
  • Slightly blurred: School server and middleware issues
  • Completely Blurred: Application issues. UI issues, Hardware issues.


Applications and Sugar

Deals with: problems that should be fixed/enhanced in activities (Read, Chat, etc) or in the User Interface (mesh view).

Summary: There are some interface issues to address. The circle in the pallet don't always bring the correct information and there are sometimes two circles (*). The blinking of the icons is not always correct too. Some of the UI issues are related to the way we store past connectivity information in the networks.cfg file (that gets obsoleted easily). (*) Two circles make sense if you are, for example, an MPP, but that's not the case.


Tickets:

  • <trac>6774</trac> - Read fails to transfer document when using salut. [this seems specific to Read]
  • <trac>5459</trac> - second circle in sugar home view provides false information [apparently an UI issue - check overall UI behaviour]
  • what other open tickets fall in this category?

School server

Deals with: applications that are part of or related to the school server.

Summary: As I gather, there are two major problem here: (1) DHCP fails sometimes and we don't know exactly why (we need to find the root cause) and (2) Ejabberd fails frequently (it is reportedly an unstable software).


Tickets:

  • <trac>4153</trac> - Connect to linklocal instead of school mesh, DHCP failure. [why?]
  • <trac>5963</trac> - Laptop fails to reliably connect with school mesh portal [this is closed and it is a duplicate of <trac>4153</trac>].
  • <trac>6287</trac> - Associating with one mesh prevents you from successfully with a different one.[bad title, this is another instance of "cannot find the school server"]
  • <trac>5908</trac> - Laptop unable to connect to schoolserver jabber server. [The real question of reliability of ejjaberd should be addressed].
  • what other open tickets fall in this category?

Middleware

Deals with: Everything that cannot be fixed in the libertas driver/firmware or at the application belongs here. (NetworkManager, Sugar Presence Service, Telepathy salut, Telepathy gabble, Avahi/MDNS)

Summary: We have two major categories of problems here. (1) we have a scalability issue, due to the use of MDNS and due to the avahi implementation and (2) we have bugs in the presence/telepathy-salut/gabble that are being addressed. A decision should be made if we are switching to cerebro.


Tickets:

  • <trac>5335</trac> - More mdns traffic then expected [anedoctal, could be merged with 5078]
  • <trac>5078</trac> - A more mesh-friendly presence protocol for salut [same issue as above]
  • <trac>6553</trac> - No XOs in the mesh view and avahi seemed crashed [no build information in the ticket]
  • <trac>6572</trac> - Replace key with hash to reduce avahi TXT size [will we invest time in avahi?]
  • <trac>6889</trac> - Using wired connection, gabble does not attempt reconnect to jabber server
  • <trac>6888</trac> - Laptop connects to presence server, but not seen by other laptops
  • <trac>6886</trac> - Laptop stops running Gabble and reverts to Salut
  • <trac>6881</trac> - Laptop unable to connect to schoolserver presence service
  • <trac>6882</trac> - Laptop was running both salut and gabble at same time
  • <trac>6883</trac> - Other laptops aren't displayed in Neighborhood View
  • <trac>6884</trac> - Incorrect number of laptops shown in neighborhood view
  • <trac>6855</trac> - Need to extend the network scan to look for school server as AP [actually this is asking salut to shut up in infra mode]
  • <trac>6750</trac> - Incorrect wireless setting after resume [NM and suspend/resume interaction problem]
  • <trac>6872</trac> - 703 and 702 builds - network manager keeps trying to connect to mesh network even after associated w/ AP [at first, not consistent with my observations]
  • <trac>6855</trac> - Need to extend the network scan to look for school server as AP [what is requested? - it seems another instance of shut salut down]
  • what other open tickets fall in this category?

Hardware

Deals with: Problems that cannot/should not be fixed by software.

Summary: There are some reports of bad wireless interfaces (poor radio sensitivity). Are they frequent? It doesn't seem so. We cannot expect every device to have the same range. The only report I've seem was not validated (see 4068). My feeling is: low priority.

Tickets:

  • <trac>4901</trac> - Ctest cannot see wireless APs [antenna sensitivity reduced, *when compared to other XO*]
  • <trac>4068</trac> - Range of communication between 2 XOs is limited only to 20 meters [I tested this two XOs and they behaved as expected]
  • what other open tickets fall in this category?

Libertas driver/firmware

Those that need to be fixed/enhanced in the libertas driver/firmware.

Since this is the focal point of this page a more detailed itemization follows:

Infra mode issues

Deals with issues that prevent association or correct operation under infrastructure mode

Summary: Two major areas: (1) There are issues with preN APs (like an AirGo model) (ticket #5527) and (2) an environment with a lot of Linksys WRT54g is still a hostile scenario (due to the way they automatically detect wds peers), but this can be worked around by disabling the lazy wds feature on the AP. There is also an assortment of anedoctal reports on incompatibility or reassociation problems but they don't seem different from what you'd get in any other wireless device.

Tickets:

  • <trac>6279</trac> - Cannot see Linksys AP on channel 9
  • <trac>2097</trac> - Can't do DHCP at vmware @ 5CC
  • <trac>5527</trac> - G1G1 users complain that the XO affects their local network
  • <trac>4975</trac> - Association fails
  • <trac>6811</trac> - WLAN doesn't reassociate with known access points
  • <trac>6117</trac> - Can't connect to Access Point if SSID is not broadcast [same as 6537]
  • <trac>6537</trac> - Support for Cloaked Access Points [a duplicate of 6117, but more detailed discussion]
  • what other open tickets fall in this category?

Path discovery issues

Deals with: issues (design and bugs) in the reactive path discovery mechanism.

Summary: In terms of known bugs, we recently addressed a problem that the XO not always answered to RREQs (#6589). Other than that, we are dealing with optmizations now. Main cause of problems seems to be the burstiness of the path discovery mechanism. Tests reveal that there are many possible improvements here, dynamically changing contention window size, expiration route time or rreq delay to name a few - but the trend seems to be that we need to _adapt_ to dense mesh envoronments (like a school).

Tickets:

  • <trac>6589</trac> - xo stops responding to mesh path requests frames
  • what other open tickets fall in this category?

Improvements to scalability

Deals with: everything that can improve scalability (by freeing airtime or implementing various adaptive behaviours).

Summary: Air time is precious - control over management frames is very important and we had significant improvements in this item. Control over probe response retries was introduced in 22.p6 and Adaptive Contention Window based on the number of neighbours was introduced in 22p8. Control over beacon frequency were fixed in 22.p8. Current research in this item is focusing on the route discovery mechanism.

Tickets:

  • <trac>4927</trac> - [firmware] beacon interval gets reset by other operations [beacon control is fixed in 22.p8]
  • what other open tickets fall in this category?

Interface issues

Deals with: issues where the network design choices conflict with other design choices.

Summary: Interface with suspend/resume feature. Right now, the activity in this front is the introduction of a multicast filter on the firmware (22.p8), so an XO will wake up only to certain multicast frames (not to all of them). This needs support in the kernel (driver should inform the multicast addresses), otherwise collaboration will break <trac>6818</trac>

Tickets:

  • <trac>6818</trac> - Driver does not set link level multicast addresses into firmware when ip address assigned to mesh interfaceMesh view not working with 22.p8/p9
  • <trac>6528</trac>, <trac>4616</trac> - System often drops the packet that wakes the laptop from suspend.
  • what other open tickets fall in this category?

Miscellanea

Deals with: issues that seem related to driver/firmware but cannot be clearly classified

Summary: We recently implemented a multicast filter in order to support suspend/resume (#4616, #6818). I believe that once idle suspend is activated by default we will need to put a lot of effort to test if the network is behaving like expected (this is something we started but not finished).

We are currently in the process of approval of new driver patches necessary to switch to firmware release 22.p10, that fixes many bugs and implement new features (refer to #6931).

Tickets:

  • <trac>6529</trac> - Multicast ping over eth0 (not mesh) sometimes produces duplicate packets [this may provide us with useful information, but it is hardly an issue by itself]
  • <trac>6527</trac> - Mesh does not forward multicast packets (most of the time) - [same as above - duplicated ipv6 pings are not an issue in itself] This bug has nothing to do with "duplicated" packets. If the mesh isn't sending multicast packets to their destinations, then of course there's a problem; the presence service depends on the network to carry multicast packets to their destinations! 209.237.225.236 13:09, 8 May 2008 (EDT)
  • what other open tickets fall in this category?

Feature requests

Deals with: requested/suggested features that are not implemented in the firmware/driver.

Summary: We currently do not support power save mode and cloaked APs. The first is not easy to implement in the current firmware (according to Marvell). The second needs change to the UI and always support for active scan in our starting mechanism.

Tickets:

  • <trac>5418</trac> - [firmware] powersave mode non-functional. [firmware does not support powersaving]
  • what other open tickets fall in this category?