Presentations/August 2008 Networking Talk

From OLPC
Jump to: navigation, search

On August 26, 2008, Ricardo Carrano presented "Recent Investigations and Future Developments in the Wireless Front".

Slides are here.

Notes

ricardo's talk on networking
quasi-transcription notes by cscott
implementation suggestions
I1: detect and adapt
a user mode daemon that estimates network environment
sparse vs dense, etc
infra vs mesh
xo vs active antenna
and tweaks various network parameters to match
long digression as we fail to agree what a "dense mesh" means
i contend the measure should be local, others argue for measures based on total numbers of connected machines, etc
another parameter for estimation: overall noise level -- quiet, or noisy like 1cc?
"Mesh Adaptation Daemon"
next slide. parameters to be measured by the MAD:
idle denseness / active denseness / congestion
mobility / link quality
ac powered / battery powered / low battery
we forward packets for the mesh only if we have sufficient battery, eg.
Density vs Multicast Rate: increase speed (which also increases error rate) as density increases.
AC powered: if AC powered, we can also assume mobility is low, which then means we can increase route expiration time and rreq_delay.
we can also use path errors, and denseness/congestion status (as well as power status) to estimate mobility
back to density vs multicast rate. increased speed also decreases reception distance, so increase speed only if we think we're dense enough
power vs metrics: if a node runs on battery, we should advertise worse metrics, so that it is not preferred for routes.
OK, moving on to implementation point 2
I2: Management traffic
we're talking about beacons, probe request/response, etc.
reduce the amount of traffic we generate here
graph of beacon frequency vs number of nodes
for # of nodes from 1-10
1 XO: 9Hz beacon
10 XOs: 12 Hz
anecdotally: 50 XOs: <20 Hz.
but still: 1Hz would be enough. That would save 1% of airtime.
next slide: probe storms.
one XO sends a probe, everyone tries to respond at once, fails, and then we start trying to retry, etc.
the slide shows only 10 laptops, saturating the network during one of these storms.
proposal: only retry twice, not 9 times.
this should improve us from 20 to 25 laptops, roughly. not a huge improvement, but worthwhile.
michalis: the source of these probe storms is network manager scans; NM scans every 2 seconds
if not associated, then 2 seconds, if associated then it backs off to 2 minutes or so.
unless you are in ad hoc mode, you can get away with a totally passive scan; we should do this.
proposal: we should be doing a passive scan.
we have a switch to do a partly-passive scan: we send out the probes, but we disable the responses from the XO. this isn't turned on by default.
what we *do* currently is reduce the number of retries from 10 (the default) to 2.
Implementation Proposal 3
I3: Rate Adaptation Logic
XO can transmit frames at many data rates; we should use the highest we can get away with
the higher the rate, the less airtime it consumes (but the higher the probability of corruption)
Marvell's firmware uses ARF, the first algorithm created to do rate adaptation.
we try to broadcast at highest rate. if it fails three times, fail down to next lower rate, repeat.
if we are successful 10 times, then try to increase the rate by one step.
main issue:no distinction between failures due to noise and those due to congestion.
so in a congestion environment, we fail and thus lower the rate, which makes things worse: now even more congestion!
so more transmissions fail, and we lower the rate even further, etc.
this is mesh mode only; in infrastructure mode the AP mediates the rate adaptation algorithm.
in this next slide, CBR = constant bitrate. We're adding a steady stream of 1500 byte frames at 50ms intervals.
This problem can't be fixed in the current generation of the marvell chipset, due to memory limitations
the workaround for the current generation is MAD (again)
we estimate congestion, and determine when the rate adaptation algorithm is just making things worse, and set the hardware to forbid rates below (say) 22Mb.
this prevents us from falling all the way down to 1Mb and making the congestion 50x worse.
Implementation note 6
I6: Metrics
Costs associated with probe requests at various bit rates.
Currently 54Mbps=11, 36=28, 11=46, 1=64.
Proposed values: 54Mb=963, 36=1073, 11=1997, 1=12906
and for active antenna: 54Mb=962, 36=1072, 11=1996, 1=12905. (ie, one better)
this prefers routes via the active antenna
also, the difference between 11Mb and 1Mb more accurately reflects the amount of airtime taken by the lower rate.
Better yet, use MAD to take other metrics into account, like battery and mobility.
http://wiki.laptop.org/go/Path_discovery_metric
it's not all about airtime, although airtime is important.
also: queuing frames at intermediate notes: memory and CPU requirements of this.
we renormalize the costs so that everything is time based
factoring in the times required to queue a hop, so that they are directly comparable to the airtimes.
by biasing the active antenna slightly down, we go via the active antenna when it's convenient.
some confusion here
ricardo clarified that *path metrics* are not a reasonable means to fix congestion issues
even though other *network parameters* can be used to address congestion (like, say, beacon rate)
Implementation note 7
I7: NWB efficiency
NWB = Network Wide Broadcast
We are using a simple flood fill algorithm when we need to reach all the nodes in the mesh.
we can't remove broadcast entirely, because some information inherently needs to reach all the nodes: presence info, and path discovery mechanism.
proposal: SBA (Scalable Broadcast Algorithm)
Skipping I8, I9, which are nortel recommendations.
I10: Route Expiration Time
Paths time out after X seconds.
X=10, according to ricardo.
Slide: colorful graph
10 laptops, pinging a multicast address once a second
x axis is real time, showing periodicity of the network utilization
y-axis is airtime utilization.
tradeoff between timeout and mobility
however, we redo path discovery if we see the path is broken
so this is really a path optimiality tradeoff: how long do we keep using a suboptimal path which is not completely broken.
proposal: immediately double the route timeout to 20 s
iwpriv msh0 route_exp_time 20 <- something like this.
I11: Contention window
how long we wait to see if the airtime is being utilized
currently XO uses [7,15] window
standard values at [31, 1023]
proposal: use the standard.
one experiment: we are retrying 71% of the time. switching to standard value dropped this to 25%.
for scaling to larger numbers of contending nodes, we may need to investigate more sophisticated contention management strategies
skipping diagnose and test slide.
princeton slide: what they've got ready to go
hash cache: more more efficient than squid
squid: 10% of storage required in memory for index
tcp improvements: tell tcp up front what bandwidth to expect.
applicable to single hop stuff; may be applicable to multi hop mesh (not clear)
planet lab: mechanism to deploy and manage school servers
next slide: thin firmware
it's in 2.6.27.
thin firmware enables: XO as access point
we can also then run open80211s (o11s)
slide lists the stuff which is implemented to date.
digression here about open80211s; apparently the o11s implementation adds even more management traffic to the spec
trying to allow multiple essids to share the same spectrum
80211s targetting in-home multimedia networks
8.2 recommendations.
wireless: new driver in 2.6.25; firmware 22.p18.
collaboration: need to generate failure logs and send them to collabora