Presentations/August 2008 Networking Talk: Difference between revisions

Latest revision as of 19:56, 26 August 2008

On August 26, 2008, Ricardo Carrano presented "Recent Investigations and Future Developments in the Wireless Front".

Slides are here.

Notes

ricardo's talk on networking

quasi-transcription notes by cscott

implementation suggestions

I1: detect and adapt

a user mode daemon that estimates network environment

sparse vs dense, etc

infra vs mesh

xo vs active antenna

and tweaks various network parameters to match

long digression as we fail to agree what a "dense mesh" means

i contend the measure should be local, others argue for measures based on total numbers of connected machines, etc

another parameter for estimation: overall noise level -- quiet, or noisy like 1cc?

"Mesh Adaptation Daemon"

next slide. parameters to be measured by the MAD:

idle denseness / active denseness / congestion

mobility / link quality

ac powered / battery powered / low battery

we forward packets for the mesh only if we have sufficient battery, eg.

Density vs Multicast Rate: increase speed (which also increases error rate) as density increases.

AC powered: if AC powered, we can also assume mobility is low, which then means we can increase route expiration time and rreq_delay.

we can also use path errors, and denseness/congestion status (as well as power status) to estimate mobility

back to density vs multicast rate. increased speed also decreases reception distance, so increase speed only if we think we're dense enough

power vs metrics: if a node runs on battery, we should advertise worse metrics, so that it is not preferred for routes.

OK, moving on to implementation point 2

I2: Management traffic

we're talking about beacons, probe request/response, etc.

reduce the amount of traffic we generate here

graph of beacon frequency vs number of nodes

for # of nodes from 1-10

1 XO: 9Hz beacon

10 XOs: 12 Hz

anecdotally: 50 XOs: <20 Hz.

but still: 1Hz would be enough. That would save 1% of airtime.

next slide: probe storms.

one XO sends a probe, everyone tries to respond at once, fails, and then we start trying to retry, etc.

the slide shows only 10 laptops, saturating the network during one of these storms.

proposal: only retry twice, not 9 times.

this should improve us from 20 to 25 laptops, roughly. not a huge improvement, but worthwhile.

michalis: the source of these probe storms is network manager scans; NM scans every 2 seconds

if not associated, then 2 seconds, if associated then it backs off to 2 minutes or so.

unless you are in ad hoc mode, you can get away with a totally passive scan; we should do this.

proposal: we should be doing a passive scan.

we have a switch to do a partly-passive scan: we send out the probes, but we disable the responses from the XO. this isn't turned on by default.

what we *do* currently is reduce the number of retries from 10 (the default) to 2.

Implementation Proposal 3

I3: Rate Adaptation Logic

XO can transmit frames at many data rates; we should use the highest we can get away with

the higher the rate, the less airtime it consumes (but the higher the probability of corruption)

Marvell's firmware uses ARF, the first algorithm created to do rate adaptation.

we try to broadcast at highest rate. if it fails three times, fail down to next lower rate, repeat.

if we are successful 10 times, then try to increase the rate by one step.

main issue:no distinction between failures due to noise and those due to congestion.

so in a congestion environment, we fail and thus lower the rate, which makes things worse: now even more congestion!

so more transmissions fail, and we lower the rate even further, etc.

this is mesh mode only; in infrastructure mode the AP mediates the rate adaptation algorithm.

in this next slide, CBR = constant bitrate. We're adding a steady stream of 1500 byte frames at 50ms intervals.

This problem can't be fixed in the current generation of the marvell chipset, due to memory limitations

the workaround for the current generation is MAD (again)

we estimate congestion, and determine when the rate adaptation algorithm is just making things worse, and set the hardware to forbid rates below (say) 22Mb.

this prevents us from falling all the way down to 1Mb and making the congestion 50x worse.

Implementation note 6

I6: Metrics

Costs associated with probe requests at various bit rates.

Currently 54Mbps=11, 36=28, 11=46, 1=64.

Proposed values: 54Mb=963, 36=1073, 11=1997, 1=12906

and for active antenna: 54Mb=962, 36=1072, 11=1996, 1=12905. (ie, one better)

this prefers routes via the active antenna

also, the difference between 11Mb and 1Mb more accurately reflects the amount of airtime taken by the lower rate.

Better yet, use MAD to take other metrics into account, like battery and mobility.

http://wiki.laptop.org/go/Path_discovery_metric

it's not all about airtime, although airtime is important.

also: queuing frames at intermediate notes: memory and CPU requirements of this.

we renormalize the costs so that everything is time based

factoring in the times required to queue a hop, so that they are directly comparable to the airtimes.

by biasing the active antenna slightly down, we go via the active antenna when it's convenient.

some confusion here

ricardo clarified that *path metrics* are not a reasonable means to fix congestion issues

even though other *network parameters* can be used to address congestion (like, say, beacon rate)

Implementation note 7

I7: NWB efficiency

NWB = Network Wide Broadcast

We are using a simple flood fill algorithm when we need to reach all the nodes in the mesh.

we can't remove broadcast entirely, because some information inherently needs to reach all the nodes: presence info, and path discovery mechanism.

proposal: SBA (Scalable Broadcast Algorithm)

Skipping I8, I9, which are nortel recommendations.

I10: Route Expiration Time

Paths time out after X seconds.

X=10, according to ricardo.

Slide: colorful graph

10 laptops, pinging a multicast address once a second

x axis is real time, showing periodicity of the network utilization

y-axis is airtime utilization.

tradeoff between timeout and mobility

however, we redo path discovery if we see the path is broken

so this is really a path optimiality tradeoff: how long do we keep using a suboptimal path which is not completely broken.

proposal: immediately double the route timeout to 20 s

iwpriv msh0 route_exp_time 20 <- something like this.

I11: Contention window

how long we wait to see if the airtime is being utilized

currently XO uses [7,15] window

standard values at [31, 1023]

proposal: use the standard.

one experiment: we are retrying 71% of the time. switching to standard value dropped this to 25%.

for scaling to larger numbers of contending nodes, we may need to investigate more sophisticated contention management strategies

skipping diagnose and test slide.

princeton slide: what they've got ready to go

hash cache: more more efficient than squid

squid: 10% of storage required in memory for index

tcp improvements: tell tcp up front what bandwidth to expect.

applicable to single hop stuff; may be applicable to multi hop mesh (not clear)

planet lab: mechanism to deploy and manage school servers

next slide: thin firmware

it's in 2.6.27.

thin firmware enables: XO as access point

we can also then run open80211s (o11s)

slide lists the stuff which is implemented to date.

digression here about open80211s; apparently the o11s implementation adds even more management traffic to the spec

trying to allow multiple essids to share the same spectrum

80211s targetting in-home multimedia networks

8.2 recommendations.

wireless: new driver in 2.6.25; firmware 22.p18.

collaboration: need to generate failure logs and send them to collabora

@@ Line 4: / Line 4: @@
 == Notes ==
+:ricardo's talk on networking
+:quasi-transcription notes by cscott
+:implementation suggestions
+:I1: detect and adapt
+:a user mode daemon that estimates network environment
+:sparse vs dense, etc
+:infra vs mesh
+:xo vs active antenna
+:and tweaks various network parameters to match
+:long digression as we fail to agree what a "dense mesh" means
+:i contend the measure should be local, others argue for measures based on total numbers of connected machines, etc
+:another parameter for estimation: overall noise level -- quiet, or noisy like 1cc?
+:"Mesh Adaptation Daemon"
+:next slide.  parameters to be measured by the MAD:
+:idle denseness / active denseness / congestion
+:mobility / link quality
+:ac powered / battery powered / low battery
+:we forward packets for the mesh only if we have sufficient battery, eg.
+:Density vs Multicast Rate: increase speed (which also increases error rate) as density increases.
+:AC powered: if AC powered, we can also assume mobility is low, which then means we can increase route expiration time and rreq_delay.
+:we can also use path errors, and denseness/congestion status (as well as power status) to estimate mobility
+:back to density vs multicast rate.  increased speed also decreases reception distance, so increase speed only if we think we're dense enough
+:power vs metrics: if a node runs on battery, we should advertise worse metrics, so that it is not preferred for routes.
+:OK, moving on to implementation point 2
+:I2: Management traffic
+:we're talking about beacons, probe request/response, etc.
+:reduce the amount of traffic we generate here
+:graph of beacon frequency vs number of nodes
+:for # of nodes from 1-10
+:1 XO: 9Hz beacon
+:10 XOs: 12 Hz
+:anecdotally: 50 XOs: <20 Hz.
+:but still: 1Hz would be enough.  That would save 1% of airtime.
+:next slide: probe storms.
+:one XO sends a probe, everyone tries to respond at once, fails, and then we start trying to retry, etc.
+:the slide shows only 10 laptops, saturating the network during one of these storms.
+:proposal: only retry twice, not 9 times.
+:this should improve us from 20 to 25 laptops, roughly.  not a huge improvement, but worthwhile.
+:michalis: the source of these probe storms is network manager scans; NM scans every 2 seconds
+:if not associated, then 2 seconds, if associated then it backs off to 2 minutes or so.
+:unless you are in ad hoc mode, you can get away with a totally passive scan; we should do this.
+:proposal: we should be doing a passive scan.
+:we have a switch to do a partly-passive scan: we send out the probes, but we disable the responses from the XO.  this isn't turned on by default.
+:what we *do* currently is reduce the number of retries from 10 (the default) to 2.
+:Implementation Proposal 3
+:I3: Rate Adaptation Logic
+:XO can transmit frames at many data rates; we should use the highest we can get away with
+:the higher the rate, the less airtime it consumes (but the higher the probability of corruption)
+:Marvell's firmware uses ARF, the first algorithm created to do rate adaptation.
+:we try to broadcast at highest rate.  if it fails three times, fail down to next lower rate, repeat.
+:if we are successful 10 times, then try to increase the rate by one step.
+:main issue:no distinction between failures due to noise and those due to congestion.
+:so in a congestion environment, we fail and thus lower the rate, which makes things worse: now even more congestion!
+:so more transmissions fail, and we lower the rate even further, etc.
+:this is mesh mode only; in infrastructure mode the AP mediates the rate adaptation algorithm.
+:in this next slide, CBR = constant bitrate.  We're adding a steady stream of 1500 byte frames at 50ms intervals.
+:This problem can't be fixed in the current generation of the marvell chipset, due to memory limitations
+:the workaround for the current generation is MAD (again)
+:we estimate congestion, and determine when the rate adaptation algorithm is just making things worse, and set the hardware to forbid rates below (say) 22Mb.
+:this prevents us from falling all the way down to 1Mb and making the congestion 50x worse.
+:Implementation note 6
+:I6: Metrics
+:Costs associated with probe requests at various bit rates.
+:Currently 54Mbps=11, 36=28, 11=46, 1=64.
+:Proposed values: 54Mb=963, 36=1073, 11=1997, 1=12906
+:and for active antenna: 54Mb=962, 36=1072, 11=1996, 1=12905. (ie, one better)
+:this prefers routes via the active antenna
+:also, the difference between 11Mb and 1Mb more accurately reflects the amount of airtime taken by the lower rate.
+:Better yet, use MAD to take other metrics into account, like battery and mobility.
+:http://wiki.laptop.org/go/Path_discovery_metric
+:it's not all about airtime, although airtime is important.
+:also: queuing frames at intermediate notes: memory and CPU requirements of this.
+:we renormalize the costs so that everything is time based
+:factoring in the times required to queue a hop, so that they are directly comparable to the airtimes.
+:by biasing the active antenna slightly down, we go via the active antenna when it's convenient.
+:some confusion here
+:ricardo clarified that *path metrics* are not a reasonable means to fix congestion issues
+:even though other *network parameters* can be used to address congestion (like, say, beacon rate)
+:Implementation note 7
+:I7: NWB efficiency
+:NWB = Network Wide Broadcast
+:We are using a simple flood fill algorithm when we need to reach all the nodes in the mesh.
+:we can't remove broadcast entirely, because some information inherently needs to reach all the nodes: presence info, and path discovery mechanism.
+:proposal: SBA (Scalable Broadcast Algorithm)
+:Skipping I8, I9, which are nortel recommendations.
+:I10: Route Expiration Time
+:Paths time out after X seconds.
+:X=10, according to ricardo.
+:Slide: colorful graph
+:10 laptops, pinging a multicast address once a second
+:x axis is real time, showing periodicity of the network utilization
+:y-axis is airtime utilization.
+:tradeoff between timeout and mobility
+:however, we redo path discovery if we see the path is broken
+:so this is really a path optimiality tradeoff: how long do we keep using a suboptimal path which is not completely broken.
+:proposal: immediately double the route timeout to 20 s
+:iwpriv msh0 route_exp_time 20 <- something like this.
+:I11: Contention window
+:how long we wait to see if the airtime is being utilized
+:currently XO uses [7,15] window
+:standard values at [31, 1023]
+:proposal: use the standard.
+:one experiment: we are retrying 71% of the time.  switching to standard value dropped this to 25%.
+:for scaling to larger numbers of contending nodes, we may need to investigate more sophisticated contention management strategies
+:skipping diagnose and test slide.
+:princeton slide: what they've got ready to go
+:hash cache: more more efficient than squid
+:squid: 10% of storage required in memory for index
+:tcp improvements: tell tcp up front what bandwidth to expect.
+:applicable to single hop stuff; may be applicable to multi hop mesh (not clear)
+:planet lab: mechanism to deploy and manage school servers
+:next slide: thin firmware
+:it's in 2.6.27.
+:thin firmware enables: XO as access point
+:we can also then run open80211s (o11s)
+:slide lists the stuff which is implemented to date.
+:digression here about open80211s; apparently the o11s implementation adds even more management traffic to the spec
+:trying to allow multiple essids to share the same spectrum
+:80211s targetting in-home multimedia networks
+:8.2 recommendations.
+:wireless: new driver in 2.6.25; firmware 22.p18.
+:collaboration: need to generate failure logs and send them to collabora

Presentations/August 2008 Networking Talk: Difference between revisions

Latest revision as of 19:56, 26 August 2008

Notes

Navigation menu

Search