Wireless Issues Apr08: Difference between revisions

From OLPC
Jump to navigation Jump to search
(Don't ignore 6527; also 6528 and 4616 (system drops packet causing a resume))
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This is an '''attempt''' to start a detailed taxonomy of the '''network-related''' problems we face as of '''April-08.'''
This is an '''attempt''' to start a detailed taxonomy of the '''network-related''' problems we face as of '''April-08.'''
It is a '''network-centric view''' (please see the "focus" box, bellow) so, detailing is higher on the network issues and lower or absent on another aspects (example, suspend and resume, datastore, etc).
It is a '''network-centric view''' (please see the "focus" box, bellow) so, detailing is higher on the network issues and lower or absent on another aspects (example, suspend and resume, datastore, etc).

'''DISCLAIMER:'''
Please contribute, so this gets bigger than the perspective of one person.


An issue here, may be:
An issue here, may be:
Line 18: Line 15:
'''Deals with:''' problems that should be fixed/enhanced in activities (Read, Chat, etc) or in the User Interface (mesh view).
'''Deals with:''' problems that should be fixed/enhanced in activities (Read, Chat, etc) or in the User Interface (mesh view).


'''Summary:''' Some of our collaboration issues are related to application bugs. One good example is #6774. We don't seem to have many of the UI issues on the mesh view as we've had in the past (dialogue boxes for entering keys, etc) but the information provided in the home view and the blinking of some circles may be incorrect/incoherent. We need to check this.
'''Summary:''' There are some interface issues to address. The circle in the pallet don't always bring the correct information and there are sometimes two circles (*). The blinking of the icons is not always correct too. Some of the UI issues are related to the way we store past connectivity information in the networks.cfg file (that gets obsoleted easily).
(*) Two circles make sense if you are, for example, an MPP, but that's not the case.



'''Tickets:'''
'''Tickets:'''
* <trac>6774</trac> - Read fails to transfer document when using salut. [this seems specific to Read]
* <trac>6774</trac> - Read fails to transfer document when using salut. [this seems specific to Read]
* <trac>5459</trac> - second circle in sugar home view provides false information [apparently an UI issue - check overall UI behaviour]
* <trac>5459</trac> - second circle in sugar home view provides false information [apparently an UI issue - check overall UI behaviour]
* what other open tickets fall in this category?
* what else?


==School server==
==School server==
'''Deals with:''' applications that are part of or related to the school server.
'''Deals with:''' applications that are part of or related to the school server.


'''Summary:''' As I gather, there are two major problem here: (1) DHCP fails sometimes and we don't know exaclty why (we should try to find the root cause) and (2) Ejabberd fails frequently. This is clearly a limitation of the software. Attentive reading through the lists and tickets support that the ejjaber server is not stable.
'''Summary:''' As I gather, there are two major problem here: (1) DHCP fails sometimes and we don't know exactly why (we need to find the root cause) and (2) Ejabberd fails frequently (it is reportedly an unstable software).



'''Tickets:'''
'''Tickets:'''
Line 34: Line 34:
* <trac>5963</trac> - Laptop fails to reliably connect with school mesh portal [this is closed and it is a duplicate of <trac>4153</trac>].
* <trac>5963</trac> - Laptop fails to reliably connect with school mesh portal [this is closed and it is a duplicate of <trac>4153</trac>].
* <trac>6287</trac> - Associating with one mesh prevents you from successfully with a different one.[bad title, this is another instance of "cannot find the school server"]
* <trac>6287</trac> - Associating with one mesh prevents you from successfully with a different one.[bad title, this is another instance of "cannot find the school server"]
* <trac>5908</trac> - Laptop unable to connect to schoolserver jabber server. [The real question of reliability of ejjaberd should be addressed, instead of suggesting that a reactive protocol is not a good choice (even if this is the case)].
* <trac>5908</trac> - Laptop unable to connect to schoolserver jabber server. [The real question of reliability of ejjaberd should be addressed].
* what other open tickets fall in this category?


==Middleware==
==Middleware==
Line 40: Line 41:


'''Summary:'''
'''Summary:'''
We have two major categories of problems here. (1) we have a scalability issue, due to the use of MDNS and due to the avahi implementation and (2) we have bugs in the presence/telepathy-salut/gabble to address. A decision should be made if we are switching to another presence mechanism (like cerebro) because if we are not, we need to keep optimizing/fixing the current one. Take, for example, the <trac>6572</trac> optimization, do we need it or not?
We have two major categories of problems here. (1) we have a scalability issue, due to the use of MDNS and due to the avahi implementation and (2) we have bugs in the presence/telepathy-salut/gabble that are being addressed. A decision should be made if we are switching to cerebro.



'''Tickets:'''
'''Tickets:'''
Line 58: Line 60:
* <trac>6872</trac> - 703 and 702 builds - network manager keeps trying to connect to mesh network even after associated w/ AP [at first, not consistent with my observations]
* <trac>6872</trac> - 703 and 702 builds - network manager keeps trying to connect to mesh network even after associated w/ AP [at first, not consistent with my observations]
*<trac>6855</trac> - Need to extend the network scan to look for school server as AP [what is requested? - it seems another instance of shut salut down]
*<trac>6855</trac> - Need to extend the network scan to look for school server as AP [what is requested? - it seems another instance of shut salut down]
* what other open tickets fall in this category?
* what else?


==Hardware==
==Hardware==
Line 69: Line 71:
* <trac>4901</trac> - Ctest cannot see wireless APs [antenna sensitivity reduced, *when compared to other XO*]
* <trac>4901</trac> - Ctest cannot see wireless APs [antenna sensitivity reduced, *when compared to other XO*]
* <trac>4068</trac> - Range of communication between 2 XOs is limited only to 20 meters [I tested this two XOs and they behaved as expected]
* <trac>4068</trac> - Range of communication between 2 XOs is limited only to 20 meters [I tested this two XOs and they behaved as expected]
* what other open tickets fall in this category?


==Libertas driver/firmware==
==Libertas driver/firmware==
Line 78: Line 81:
'''Deals with''' issues that prevent association or correct operation under infrastructure mode
'''Deals with''' issues that prevent association or correct operation under infrastructure mode


'''Summary:''' Two major areas: (1) There are issues with preN APs (like an AirGo model) (ticket #5527) and (2) an environment with a lot of Linksys WRT54g is still a hostile scenario (due to the way they automatically detect wds peers), but this can be worked around by disabling the lazy wds feature on the AP. There is also an assortment of anedoctal reports on incompatibility or reassociation problems but they don't seem different from what you'd get in any other wireless device.
'''Summary:''' We currently have a known compatibility issues with preN routers (<trac>5527</trac>) and an assortment of association problems (support for cloaked access point, failure to associate in channel different from 1,6,or 11 and others). No clear patterns or major issues it seems.


'''Tickets:'''
'''Tickets:'''
Line 88: Line 91:
* <trac>6117</trac> - Can't connect to Access Point if SSID is not broadcast [same as 6537]
* <trac>6117</trac> - Can't connect to Access Point if SSID is not broadcast [same as 6537]
* <trac>6537</trac> - Support for Cloaked Access Points [a duplicate of 6117, but more detailed discussion]
* <trac>6537</trac> - Support for Cloaked Access Points [a duplicate of 6117, but more detailed discussion]
* what other open tickets fall in this category?


===Path discovery issues===
===Path discovery issues===
'''Deals with:''' issues (design and bugs) in the reactive path discovery mechanism.
'''Deals with:''' issues (design and bugs) in the reactive path discovery mechanism.


'''Summary:''' In terms of known bugs, we recently addressed a problem that the XO not always answered to RREQs (#6589). Other than that, we are dealing with optmizations now. Main cause of problems seems to be the burstiness of the path discovery mechanism. Tests reveal that there are many possible improvements here, dynamically changing contention window size, expiration route time or rreq delay to name a few - but the trend seems to be that we need to _adapt_ to dense mesh envoronments (like a school).
'''Summary:''' Apart from a bug (<trac>6589</trac>) the issue here is the inherent burstiness of a reactive protocol. Optimizations are being studied and include changing the route expiration time (from 10 to 20s) and some other timing tweaks (rreq_delay) and possibly adjustments in the link costs.


'''Tickets:'''
'''Tickets:'''
* <trac>6589</trac> - xo stops responding to mesh path requests frames
* <trac>6589</trac> - xo stops responding to mesh path requests frames
* what other open tickets fall in this category?


===Improvements to scalability===
===Improvements to scalability===
'''Deals with:''' everything that can improve scalability (by freeing airtime or implementing various adaptive behaviours).
'''Deals with:''' everything that can improve scalability (by freeing airtime or implementing various adaptive behaviours).


'''Summary:''' Air time is precious - control over management frames is very important and we had significant improvements in this item. Control over probe response retries was introduced in 22.p6 and Adaptive Contention Window based on the number of neighbours was introduced in 22p8. Control over beacon frequency were fixed in 22.p8. Current research in this item is focusing on the route discovery mechanism. (firmware 22.p8 is still not in released builds due to lack of driver support for some features - refer to http://wiki.laptop.org/go/Wireless_Driver_Required_Changes)
'''Summary:''' Air time is precious - control over management frames is very important and we had significant improvements in this item. Control over probe response retries was introduced in 22.p6 and Adaptive Contention Window based on the number of neighbours was introduced in 22p8. Control over beacon frequency were fixed in 22.p8. Current research in this item is focusing on the route discovery mechanism.


'''Tickets:'''
'''Tickets:'''
* <trac>4927</trac> - [firmware] beacon interval gets reset by other operations [beacon control is fixed in 22.p8]
* <trac>4927</trac> - [firmware] beacon interval gets reset by other operations [beacon control is fixed in 22.p8]
* what other open tickets fall in this category?

===Active antenna===
'''Deals with:''' problems specific to the use of the standalone active antennae

'''Summary:'''

'''Tickets:'''
* programming

===Improvements to testability===
'''Deals with:''' issues on testing capabilities

'''Summary:''' With recent bug fixes we are ok in this item: (1) ethtool -S <trac>6666</trac> was fixed in the driver and (2) Bug <trac>6709</trac> fixed in firmware 22.p8 (not yet released to a build).

'''Tickets:'''
* <trac>6709</trac> - beaconing while monitoring [driver patch waiting for approval]
* <s><trac>6666</trac> - ethtool -S msh0 returning noise [driver patch waiting for approval]</s> fixed


===Interface issues===
===Interface issues===
Line 129: Line 118:
'''Tickets:'''
'''Tickets:'''
* <trac>6818</trac> - Driver does not set link level multicast addresses into firmware when ip address assigned to mesh interfaceMesh view not working with 22.p8/p9
* <trac>6818</trac> - Driver does not set link level multicast addresses into firmware when ip address assigned to mesh interfaceMesh view not working with 22.p8/p9
* <trac>6528</trac>, <trac>4616</trac> - System often drops the packet that wakes the laptop from suspend.
* what other open tickets fall in this category?


===Miscellanea===
===Miscellanea===
Line 134: Line 125:


'''Summary:'''
'''Summary:'''
We recently implemented a multicast filter in order to support suspend/resume (#4616, #6818). I believe that once idle suspend is activated by default we will need to put a lot of effort to test if the network is behaving like expected (this is something we started but not finished).

We are currently in the process of approval of new driver patches necessary to switch to firmware release 22.p10, that fixes many bugs and implement new features (refer to #6931).


'''Tickets:'''
'''Tickets:'''
* <trac>6529</trac> - Multicast ping over eth0 (not mesh) sometimes produces duplicate packets [this may provide us with useful information, but it is hardly an issue by itself]
* <trac>6529</trac> - Multicast ping over eth0 (not mesh) sometimes produces duplicate packets [this may provide us with useful information, but it is hardly an issue by itself]
* <trac>6527</trac> - Mesh does not forward multicast packets (most of the time) - [same as above - duplicated ipv6 pings are not an issue in itself]
* <trac>6527</trac> - Mesh does not forward multicast packets (most of the time) - [same as above - duplicated ipv6 pings are not an issue in itself] <b>This bug has nothing to do with "duplicated" packets. If the mesh isn't sending multicast packets to their destinations, then of course there's a problem; the presence service depends on the network to carry multicast packets to their destinations! [[User:209.237.225.236|209.237.225.236]] 13:09, 8 May 2008 (EDT)</b>
* what other open tickets fall in this category?


===Feature requests===
===Feature requests===
'''Deals with:''' requested/suggested features that are not implemented in the firmware
'''Deals with:''' requested/suggested features that are not implemented in the firmware/driver.


'''Summary:''' We currently do not support power save mode and cloaked APs. The first is not easy to implement in the current firmware (according to Marvell). The second needs change to the UI and always support for active scan in our starting mechanism.
'''Summary:''' Just started that item. Right now it is limited to the power saving mode (not implemented due to complexity)


'''Tickets:'''
'''Tickets:'''
*<trac>5418</trac> - [firmware] powersave mode non-functional. [firmware does not support powersaving]
*<trac>5418</trac> - [firmware] powersave mode non-functional. [firmware does not support powersaving]
* what other open tickets fall in this category?

Latest revision as of 17:09, 8 May 2008

This is an attempt to start a detailed taxonomy of the network-related problems we face as of April-08. It is a network-centric view (please see the "focus" box, bellow) so, detailing is higher on the network issues and lower or absent on another aspects (example, suspend and resume, datastore, etc).

An issue here, may be:

  • implementation issues (bugs).
  • design issues (poor choices).

In terms of details, expect to find the following:

  • Focus: Libertas driver/firmware
  • Slightly blurred: School server and middleware issues
  • Completely Blurred: Application issues. UI issues, Hardware issues.


Applications and Sugar

Deals with: problems that should be fixed/enhanced in activities (Read, Chat, etc) or in the User Interface (mesh view).

Summary: There are some interface issues to address. The circle in the pallet don't always bring the correct information and there are sometimes two circles (*). The blinking of the icons is not always correct too. Some of the UI issues are related to the way we store past connectivity information in the networks.cfg file (that gets obsoleted easily). (*) Two circles make sense if you are, for example, an MPP, but that's not the case.


Tickets:

  • <trac>6774</trac> - Read fails to transfer document when using salut. [this seems specific to Read]
  • <trac>5459</trac> - second circle in sugar home view provides false information [apparently an UI issue - check overall UI behaviour]
  • what other open tickets fall in this category?

School server

Deals with: applications that are part of or related to the school server.

Summary: As I gather, there are two major problem here: (1) DHCP fails sometimes and we don't know exactly why (we need to find the root cause) and (2) Ejabberd fails frequently (it is reportedly an unstable software).


Tickets:

  • <trac>4153</trac> - Connect to linklocal instead of school mesh, DHCP failure. [why?]
  • <trac>5963</trac> - Laptop fails to reliably connect with school mesh portal [this is closed and it is a duplicate of <trac>4153</trac>].
  • <trac>6287</trac> - Associating with one mesh prevents you from successfully with a different one.[bad title, this is another instance of "cannot find the school server"]
  • <trac>5908</trac> - Laptop unable to connect to schoolserver jabber server. [The real question of reliability of ejjaberd should be addressed].
  • what other open tickets fall in this category?

Middleware

Deals with: Everything that cannot be fixed in the libertas driver/firmware or at the application belongs here. (NetworkManager, Sugar Presence Service, Telepathy salut, Telepathy gabble, Avahi/MDNS)

Summary: We have two major categories of problems here. (1) we have a scalability issue, due to the use of MDNS and due to the avahi implementation and (2) we have bugs in the presence/telepathy-salut/gabble that are being addressed. A decision should be made if we are switching to cerebro.


Tickets:

  • <trac>5335</trac> - More mdns traffic then expected [anedoctal, could be merged with 5078]
  • <trac>5078</trac> - A more mesh-friendly presence protocol for salut [same issue as above]
  • <trac>6553</trac> - No XOs in the mesh view and avahi seemed crashed [no build information in the ticket]
  • <trac>6572</trac> - Replace key with hash to reduce avahi TXT size [will we invest time in avahi?]
  • <trac>6889</trac> - Using wired connection, gabble does not attempt reconnect to jabber server
  • <trac>6888</trac> - Laptop connects to presence server, but not seen by other laptops
  • <trac>6886</trac> - Laptop stops running Gabble and reverts to Salut
  • <trac>6881</trac> - Laptop unable to connect to schoolserver presence service
  • <trac>6882</trac> - Laptop was running both salut and gabble at same time
  • <trac>6883</trac> - Other laptops aren't displayed in Neighborhood View
  • <trac>6884</trac> - Incorrect number of laptops shown in neighborhood view
  • <trac>6855</trac> - Need to extend the network scan to look for school server as AP [actually this is asking salut to shut up in infra mode]
  • <trac>6750</trac> - Incorrect wireless setting after resume [NM and suspend/resume interaction problem]
  • <trac>6872</trac> - 703 and 702 builds - network manager keeps trying to connect to mesh network even after associated w/ AP [at first, not consistent with my observations]
  • <trac>6855</trac> - Need to extend the network scan to look for school server as AP [what is requested? - it seems another instance of shut salut down]
  • what other open tickets fall in this category?

Hardware

Deals with: Problems that cannot/should not be fixed by software.

Summary: There are some reports of bad wireless interfaces (poor radio sensitivity). Are they frequent? It doesn't seem so. We cannot expect every device to have the same range. The only report I've seem was not validated (see 4068). My feeling is: low priority.

Tickets:

  • <trac>4901</trac> - Ctest cannot see wireless APs [antenna sensitivity reduced, *when compared to other XO*]
  • <trac>4068</trac> - Range of communication between 2 XOs is limited only to 20 meters [I tested this two XOs and they behaved as expected]
  • what other open tickets fall in this category?

Libertas driver/firmware

Those that need to be fixed/enhanced in the libertas driver/firmware.

Since this is the focal point of this page a more detailed itemization follows:

Infra mode issues

Deals with issues that prevent association or correct operation under infrastructure mode

Summary: Two major areas: (1) There are issues with preN APs (like an AirGo model) (ticket #5527) and (2) an environment with a lot of Linksys WRT54g is still a hostile scenario (due to the way they automatically detect wds peers), but this can be worked around by disabling the lazy wds feature on the AP. There is also an assortment of anedoctal reports on incompatibility or reassociation problems but they don't seem different from what you'd get in any other wireless device.

Tickets:

  • <trac>6279</trac> - Cannot see Linksys AP on channel 9
  • <trac>2097</trac> - Can't do DHCP at vmware @ 5CC
  • <trac>5527</trac> - G1G1 users complain that the XO affects their local network
  • <trac>4975</trac> - Association fails
  • <trac>6811</trac> - WLAN doesn't reassociate with known access points
  • <trac>6117</trac> - Can't connect to Access Point if SSID is not broadcast [same as 6537]
  • <trac>6537</trac> - Support for Cloaked Access Points [a duplicate of 6117, but more detailed discussion]
  • what other open tickets fall in this category?

Path discovery issues

Deals with: issues (design and bugs) in the reactive path discovery mechanism.

Summary: In terms of known bugs, we recently addressed a problem that the XO not always answered to RREQs (#6589). Other than that, we are dealing with optmizations now. Main cause of problems seems to be the burstiness of the path discovery mechanism. Tests reveal that there are many possible improvements here, dynamically changing contention window size, expiration route time or rreq delay to name a few - but the trend seems to be that we need to _adapt_ to dense mesh envoronments (like a school).

Tickets:

  • <trac>6589</trac> - xo stops responding to mesh path requests frames
  • what other open tickets fall in this category?

Improvements to scalability

Deals with: everything that can improve scalability (by freeing airtime or implementing various adaptive behaviours).

Summary: Air time is precious - control over management frames is very important and we had significant improvements in this item. Control over probe response retries was introduced in 22.p6 and Adaptive Contention Window based on the number of neighbours was introduced in 22p8. Control over beacon frequency were fixed in 22.p8. Current research in this item is focusing on the route discovery mechanism.

Tickets:

  • <trac>4927</trac> - [firmware] beacon interval gets reset by other operations [beacon control is fixed in 22.p8]
  • what other open tickets fall in this category?

Interface issues

Deals with: issues where the network design choices conflict with other design choices.

Summary: Interface with suspend/resume feature. Right now, the activity in this front is the introduction of a multicast filter on the firmware (22.p8), so an XO will wake up only to certain multicast frames (not to all of them). This needs support in the kernel (driver should inform the multicast addresses), otherwise collaboration will break <trac>6818</trac>

Tickets:

  • <trac>6818</trac> - Driver does not set link level multicast addresses into firmware when ip address assigned to mesh interfaceMesh view not working with 22.p8/p9
  • <trac>6528</trac>, <trac>4616</trac> - System often drops the packet that wakes the laptop from suspend.
  • what other open tickets fall in this category?

Miscellanea

Deals with: issues that seem related to driver/firmware but cannot be clearly classified

Summary: We recently implemented a multicast filter in order to support suspend/resume (#4616, #6818). I believe that once idle suspend is activated by default we will need to put a lot of effort to test if the network is behaving like expected (this is something we started but not finished).

We are currently in the process of approval of new driver patches necessary to switch to firmware release 22.p10, that fixes many bugs and implement new features (refer to #6931).

Tickets:

  • <trac>6529</trac> - Multicast ping over eth0 (not mesh) sometimes produces duplicate packets [this may provide us with useful information, but it is hardly an issue by itself]
  • <trac>6527</trac> - Mesh does not forward multicast packets (most of the time) - [same as above - duplicated ipv6 pings are not an issue in itself] This bug has nothing to do with "duplicated" packets. If the mesh isn't sending multicast packets to their destinations, then of course there's a problem; the presence service depends on the network to carry multicast packets to their destinations! 209.237.225.236 13:09, 8 May 2008 (EDT)
  • what other open tickets fall in this category?

Feature requests

Deals with: requested/suggested features that are not implemented in the firmware/driver.

Summary: We currently do not support power save mode and cloaked APs. The first is not easy to implement in the current firmware (according to Marvell). The second needs change to the UI and always support for active scan in our starting mechanism.

Tickets:

  • <trac>5418</trac> - [firmware] powersave mode non-functional. [firmware does not support powersaving]
  • what other open tickets fall in this category?