Presentations/August 2008 Networking Talk: Difference between revisions
Jump to navigation
Jump to search
(Link to Ricardo's slides.) |
(→Notes: Paste in notes from IRC log.) |
||
Line 4: | Line 4: | ||
== Notes == |
== Notes == |
||
:ricardo's talk on networking |
|||
:quasi-transcription notes by cscott |
|||
:implementation suggestions |
|||
:I1: detect and adapt |
|||
:a user mode daemon that estimates network environment |
|||
:sparse vs dense, etc |
|||
:infra vs mesh |
|||
:xo vs active antenna |
|||
:and tweaks various network parameters to match |
|||
:long digression as we fail to agree what a "dense mesh" means |
|||
:i contend the measure should be local, others argue for measures based on total numbers of connected machines, etc |
|||
:another parameter for estimation: overall noise level -- quiet, or noisy like 1cc? |
|||
:"Mesh Adaptation Daemon" |
|||
:next slide. parameters to be measured by the MAD: |
|||
:idle denseness / active denseness / congestion |
|||
:mobility / link quality |
|||
:ac powered / battery powered / low battery |
|||
:we forward packets for the mesh only if we have sufficient battery, eg. |
|||
:Density vs Multicast Rate: increase speed (which also increases error rate) as density increases. |
|||
:AC powered: if AC powered, we can also assume mobility is low, which then means we can increase route expiration time and rreq_delay. |
|||
:we can also use path errors, and denseness/congestion status (as well as power status) to estimate mobility |
|||
:back to density vs multicast rate. increased speed also decreases reception distance, so increase speed only if we think we're dense enough |
|||
:power vs metrics: if a node runs on battery, we should advertise worse metrics, so that it is not preferred for routes. |
|||
:OK, moving on to implementation point 2 |
|||
:I2: Management traffic |
|||
:we're talking about beacons, probe request/response, etc. |
|||
:reduce the amount of traffic we generate here |
|||
:graph of beacon frequency vs number of nodes |
|||
:for # of nodes from 1-10 |
|||
:1 XO: 9Hz beacon |
|||
:10 XOs: 12 Hz |
|||
:anecdotally: 50 XOs: <20 Hz. |
|||
:but still: 1Hz would be enough. That would save 1% of airtime. |
|||
:next slide: probe storms. |
|||
:one XO sends a probe, everyone tries to respond at once, fails, and then we start trying to retry, etc. |
|||
:the slide shows only 10 laptops, saturating the network during one of these storms. |
|||
:proposal: only retry twice, not 9 times. |
|||
:this should improve us from 20 to 25 laptops, roughly. not a huge improvement, but worthwhile. |
|||
:michalis: the source of these probe storms is network manager scans; NM scans every 2 seconds |
|||
:if not associated, then 2 seconds, if associated then it backs off to 2 minutes or so. |
|||
:unless you are in ad hoc mode, you can get away with a totally passive scan; we should do this. |
|||
:proposal: we should be doing a passive scan. |
|||
:we have a switch to do a partly-passive scan: we send out the probes, but we disable the responses from the XO. this isn't turned on by default. |
|||
:what we *do* currently is reduce the number of retries from 10 (the default) to 2. |
|||
:Implementation Proposal 3 |
|||
:I3: Rate Adaptation Logic |
|||
:XO can transmit frames at many data rates; we should use the highest we can get away with |
|||
:the higher the rate, the less airtime it consumes (but the higher the probability of corruption) |
|||
:Marvell's firmware uses ARF, the first algorithm created to do rate adaptation. |
|||
:we try to broadcast at highest rate. if it fails three times, fail down to next lower rate, repeat. |
|||
:if we are successful 10 times, then try to increase the rate by one step. |
|||
:main issue:no distinction between failures due to noise and those due to congestion. |
|||
:so in a congestion environment, we fail and thus lower the rate, which makes things worse: now even more congestion! |
|||
:so more transmissions fail, and we lower the rate even further, etc. |
|||
:this is mesh mode only; in infrastructure mode the AP mediates the rate adaptation algorithm. |
|||
:in this next slide, CBR = constant bitrate. We're adding a steady stream of 1500 byte frames at 50ms intervals. |
|||
:This problem can't be fixed in the current generation of the marvell chipset, due to memory limitations |
|||
:the workaround for the current generation is MAD (again) |
|||
:we estimate congestion, and determine when the rate adaptation algorithm is just making things worse, and set the hardware to forbid rates below (say) 22Mb. |
|||
:this prevents us from falling all the way down to 1Mb and making the congestion 50x worse. |
|||
:Implementation note 6 |
|||
:I6: Metrics |
|||
:Costs associated with probe requests at various bit rates. |
|||
:Currently 54Mbps=11, 36=28, 11=46, 1=64. |
|||
:Proposed values: 54Mb=963, 36=1073, 11=1997, 1=12906 |
|||
:and for active antenna: 54Mb=962, 36=1072, 11=1996, 1=12905. (ie, one better) |
|||
:this prefers routes via the active antenna |
|||
:also, the difference between 11Mb and 1Mb more accurately reflects the amount of airtime taken by the lower rate. |
|||
:Better yet, use MAD to take other metrics into account, like battery and mobility. |
|||
:http://wiki.laptop.org/go/Path_discovery_metric |
|||
:it's not all about airtime, although airtime is important. |
|||
:also: queuing frames at intermediate notes: memory and CPU requirements of this. |
|||
:we renormalize the costs so that everything is time based |
|||
:factoring in the times required to queue a hop, so that they are directly comparable to the airtimes. |
|||
:by biasing the active antenna slightly down, we go via the active antenna when it's convenient. |
|||
:some confusion here |
|||
:ricardo clarified that *path metrics* are not a reasonable means to fix congestion issues |
|||
:even though other *network parameters* can be used to address congestion (like, say, beacon rate) |
|||
:Implementation note 7 |
|||
:I7: NWB efficiency |
|||
:NWB = Network Wide Broadcast |
|||
:We are using a simple flood fill algorithm when we need to reach all the nodes in the mesh. |
|||
:we can't remove broadcast entirely, because some information inherently needs to reach all the nodes: presence info, and path discovery mechanism. |
|||
:proposal: SBA (Scalable Broadcast Algorithm) |
|||
:Skipping I8, I9, which are nortel recommendations. |
|||
:I10: Route Expiration Time |
|||
:Paths time out after X seconds. |
|||
:X=10, according to ricardo. |
|||
:Slide: colorful graph |
|||
:10 laptops, pinging a multicast address once a second |
|||
:x axis is real time, showing periodicity of the network utilization |
|||
:y-axis is airtime utilization. |
|||
:tradeoff between timeout and mobility |
|||
:however, we redo path discovery if we see the path is broken |
|||
:so this is really a path optimiality tradeoff: how long do we keep using a suboptimal path which is not completely broken. |
|||
:proposal: immediately double the route timeout to 20 s |
|||
:iwpriv msh0 route_exp_time 20 <- something like this. |
|||
:I11: Contention window |
|||
:how long we wait to see if the airtime is being utilized |
|||
:currently XO uses [7,15] window |
|||
:standard values at [31, 1023] |
|||
:proposal: use the standard. |
|||
:one experiment: we are retrying 71% of the time. switching to standard value dropped this to 25%. |
|||
:for scaling to larger numbers of contending nodes, we may need to investigate more sophisticated contention management strategies |
|||
:skipping diagnose and test slide. |
|||
:princeton slide: what they've got ready to go |
|||
:hash cache: more more efficient than squid |
|||
:squid: 10% of storage required in memory for index |
|||
:tcp improvements: tell tcp up front what bandwidth to expect. |
|||
:applicable to single hop stuff; may be applicable to multi hop mesh (not clear) |
|||
:planet lab: mechanism to deploy and manage school servers |
|||
:next slide: thin firmware |
|||
:it's in 2.6.27. |
|||
:thin firmware enables: XO as access point |
|||
:we can also then run open80211s (o11s) |
|||
:slide lists the stuff which is implemented to date. |
|||
:digression here about open80211s; apparently the o11s implementation adds even more management traffic to the spec |
|||
:trying to allow multiple essids to share the same spectrum |
|||
:80211s targetting in-home multimedia networks |
|||
:8.2 recommendations. |
|||
:wireless: new driver in 2.6.25; firmware 22.p18. |
|||
:collaboration: need to generate failure logs and send them to collabora |
Latest revision as of 19:56, 26 August 2008
On August 26, 2008, Ricardo Carrano presented "Recent Investigations and Future Developments in the Wireless Front".
Slides are here.
Notes
- ricardo's talk on networking
- quasi-transcription notes by cscott
- implementation suggestions
- I1: detect and adapt
- a user mode daemon that estimates network environment
- sparse vs dense, etc
- infra vs mesh
- xo vs active antenna
- and tweaks various network parameters to match
- long digression as we fail to agree what a "dense mesh" means
- i contend the measure should be local, others argue for measures based on total numbers of connected machines, etc
- another parameter for estimation: overall noise level -- quiet, or noisy like 1cc?
- "Mesh Adaptation Daemon"
- next slide. parameters to be measured by the MAD:
- idle denseness / active denseness / congestion
- mobility / link quality
- ac powered / battery powered / low battery
- we forward packets for the mesh only if we have sufficient battery, eg.
- Density vs Multicast Rate: increase speed (which also increases error rate) as density increases.
- AC powered: if AC powered, we can also assume mobility is low, which then means we can increase route expiration time and rreq_delay.
- we can also use path errors, and denseness/congestion status (as well as power status) to estimate mobility
- back to density vs multicast rate. increased speed also decreases reception distance, so increase speed only if we think we're dense enough
- power vs metrics: if a node runs on battery, we should advertise worse metrics, so that it is not preferred for routes.
- OK, moving on to implementation point 2
- I2: Management traffic
- we're talking about beacons, probe request/response, etc.
- reduce the amount of traffic we generate here
- graph of beacon frequency vs number of nodes
- for # of nodes from 1-10
- 1 XO: 9Hz beacon
- 10 XOs: 12 Hz
- anecdotally: 50 XOs: <20 Hz.
- but still: 1Hz would be enough. That would save 1% of airtime.
- next slide: probe storms.
- one XO sends a probe, everyone tries to respond at once, fails, and then we start trying to retry, etc.
- the slide shows only 10 laptops, saturating the network during one of these storms.
- proposal: only retry twice, not 9 times.
- this should improve us from 20 to 25 laptops, roughly. not a huge improvement, but worthwhile.
- michalis: the source of these probe storms is network manager scans; NM scans every 2 seconds
- if not associated, then 2 seconds, if associated then it backs off to 2 minutes or so.
- unless you are in ad hoc mode, you can get away with a totally passive scan; we should do this.
- proposal: we should be doing a passive scan.
- we have a switch to do a partly-passive scan: we send out the probes, but we disable the responses from the XO. this isn't turned on by default.
- what we *do* currently is reduce the number of retries from 10 (the default) to 2.
- Implementation Proposal 3
- I3: Rate Adaptation Logic
- XO can transmit frames at many data rates; we should use the highest we can get away with
- the higher the rate, the less airtime it consumes (but the higher the probability of corruption)
- Marvell's firmware uses ARF, the first algorithm created to do rate adaptation.
- we try to broadcast at highest rate. if it fails three times, fail down to next lower rate, repeat.
- if we are successful 10 times, then try to increase the rate by one step.
- main issue:no distinction between failures due to noise and those due to congestion.
- so in a congestion environment, we fail and thus lower the rate, which makes things worse: now even more congestion!
- so more transmissions fail, and we lower the rate even further, etc.
- this is mesh mode only; in infrastructure mode the AP mediates the rate adaptation algorithm.
- in this next slide, CBR = constant bitrate. We're adding a steady stream of 1500 byte frames at 50ms intervals.
- This problem can't be fixed in the current generation of the marvell chipset, due to memory limitations
- the workaround for the current generation is MAD (again)
- we estimate congestion, and determine when the rate adaptation algorithm is just making things worse, and set the hardware to forbid rates below (say) 22Mb.
- this prevents us from falling all the way down to 1Mb and making the congestion 50x worse.
- Implementation note 6
- I6: Metrics
- Costs associated with probe requests at various bit rates.
- Currently 54Mbps=11, 36=28, 11=46, 1=64.
- Proposed values: 54Mb=963, 36=1073, 11=1997, 1=12906
- and for active antenna: 54Mb=962, 36=1072, 11=1996, 1=12905. (ie, one better)
- this prefers routes via the active antenna
- also, the difference between 11Mb and 1Mb more accurately reflects the amount of airtime taken by the lower rate.
- Better yet, use MAD to take other metrics into account, like battery and mobility.
- http://wiki.laptop.org/go/Path_discovery_metric
- it's not all about airtime, although airtime is important.
- also: queuing frames at intermediate notes: memory and CPU requirements of this.
- we renormalize the costs so that everything is time based
- factoring in the times required to queue a hop, so that they are directly comparable to the airtimes.
- by biasing the active antenna slightly down, we go via the active antenna when it's convenient.
- some confusion here
- ricardo clarified that *path metrics* are not a reasonable means to fix congestion issues
- even though other *network parameters* can be used to address congestion (like, say, beacon rate)
- Implementation note 7
- I7: NWB efficiency
- NWB = Network Wide Broadcast
- We are using a simple flood fill algorithm when we need to reach all the nodes in the mesh.
- we can't remove broadcast entirely, because some information inherently needs to reach all the nodes: presence info, and path discovery mechanism.
- proposal: SBA (Scalable Broadcast Algorithm)
- Skipping I8, I9, which are nortel recommendations.
- I10: Route Expiration Time
- Paths time out after X seconds.
- X=10, according to ricardo.
- Slide: colorful graph
- 10 laptops, pinging a multicast address once a second
- x axis is real time, showing periodicity of the network utilization
- y-axis is airtime utilization.
- tradeoff between timeout and mobility
- however, we redo path discovery if we see the path is broken
- so this is really a path optimiality tradeoff: how long do we keep using a suboptimal path which is not completely broken.
- proposal: immediately double the route timeout to 20 s
- iwpriv msh0 route_exp_time 20 <- something like this.
- I11: Contention window
- how long we wait to see if the airtime is being utilized
- currently XO uses [7,15] window
- standard values at [31, 1023]
- proposal: use the standard.
- one experiment: we are retrying 71% of the time. switching to standard value dropped this to 25%.
- for scaling to larger numbers of contending nodes, we may need to investigate more sophisticated contention management strategies
- skipping diagnose and test slide.
- princeton slide: what they've got ready to go
- hash cache: more more efficient than squid
- squid: 10% of storage required in memory for index
- tcp improvements: tell tcp up front what bandwidth to expect.
- applicable to single hop stuff; may be applicable to multi hop mesh (not clear)
- planet lab: mechanism to deploy and manage school servers
- next slide: thin firmware
- it's in 2.6.27.
- thin firmware enables: XO as access point
- we can also then run open80211s (o11s)
- slide lists the stuff which is implemented to date.
- digression here about open80211s; apparently the o11s implementation adds even more management traffic to the spec
- trying to allow multiple essids to share the same spectrum
- 80211s targetting in-home multimedia networks
- 8.2 recommendations.
- wireless: new driver in 2.6.25; firmware 22.p18.
- collaboration: need to generate failure logs and send them to collabora