Path Discovery Mechanism/Sanity

From OLPC
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page is dedicated to the following investigation: Does the Path Discovery Mechanism (PDM) behaves as expected in terms of the path length?

There's been observations that the implemented PDM may wrongly result in unnecessarily long path lengths. The canonical example is the ultra-dense mesh cloud of a classroom, where all the nodes are within the transmission range of each other. In such a scenario, does the PDM form multihop paths too often? How often?

And if so, can we expect that all the paths will be single hop in this scenario? Should we try to fix this or reduce this and how could we do it?

In trying to answer the questions above, others will came. For instance: how data traffic and PDM traffic influence each other. Will the burstiness of PDM's traffic significantly degrade data throughput? And will data traffic prevent path formation or induce unnecessarily long paths?

This page is dedicated to the formulation of this questions and, hopefully, to some of the answers.

The inpatient might prefer to skip to the Summary, by the end of the page.


Question #1 - Is it true that multihop paths form even if all the nodes are close to each other?

To answer this initial question, we'll use a Ten Kids Under the Tree testbed in which ten XOs are placed in a very dense configuration (less than 2 square-meters area) and every node are within transmission range of each other.

In such a situation one would expect the paths to be mostly, if not all, single-hop paths, because a successful transmission at 54Mbps between any two nodes is theoretically always possible. A path with two or mode hops will have to be explained and, the longer the path, the more under optimal and suspicious the results would be.

The test

The tests were performed in a quiet environment (noise floor < -96dbm), and build 703 (Update.1-703) with libertas firmware 22.p10 was used. In order to use such firmware release, a special test kernel is necessary. Also, the following configurations were applied:

  • mesh ttl was set to 5
  • multicast pings were enabled

The workload used for this particular test were multicast pings (addressed to 224.0.0.1, default size, 1Hz). The test was performed three times, (1 minute for each test).

What are we measuring ?

One possibility would be just log the forwarding table of each node during the test and then check the path lengths. That's a good test. But this method was designed to be non-intrusive and, because of that, applicable to any traffic captures available.

So, to have a taste (or measure, if you will) of how (in)sane the PDM is in terms of path length, we'll analyse traffic captures and calculate a simple index for every path - we'll call this index the Data Path Length Index, or DPLI.

The DPLI is calculated over the average hop count for each data frame exchanged between two nodes (see example in table 1). Of course this index is not strictly necessary (the table itself brings interesting information) and of course it has no meaning by itself. The DLPI is not an actual path length, but an indicator of the actual path lengths that formed during the test. Apart from confirming that multihop frames are being formed, it gives us some baselines for future comparisons and make it easier to analyse overall results.

Table 1, shows the DPLIs for a node (B) in a real test. We can see that the paths to C, J and H are pure single hop paths. But for all other neighbours a longer path did exist at some point. One should keep in mind that the paths are constantly rediscovered because of (1) the expiration (which defaults to 10 seconds) of a path or (2) of a failure to forward traffic over the path. We can also clearly note that the path from B to D was much longer than any other path in this test, and this may be worth investigating.


Table 1 Node B's DPLIs for each neighbour
neigh 1hop 2hops 3hops 4hops 5hops DPLI
G 90.0% 10.0% 0.0% 0.0% 0.0% 1.05
A 82.1% 17.9% 0.0% 0.0% 0.0% 1.09
J 100.0% 0.0% 0.0% 0.0% 0.0% 1.00
D 59.0% 24.0% 4.0% 6.0% 7.0% 1.33
F 87.1% 8.6% 4.3% 0.0% 0.0% 1.08
H 100.0% 0.0% 0.0% 0.0% 0.0% 1.00
E 89.6% 10.4% 0.0% 0.0% 0.0% 1.05
I 86.5% 13.5% 0.0% 0.0% 0.0% 1.07
C 100.0% 0.0% 0.0% 0.0% 0.0% 1.00


Test results

Table 2, consolidates the results for the 3 runs of the multicast ping test described above. It clearly indicates that multihop paths do form in the Ten Kids Under the Tree scenario.

Table 2 Consolidated result for the ten kids under the tree scenario
node DPLI stdev
A 1.07 0.08
C 1.05 0.07
B 1.07 0.08
E 1.04 0.06
D 1.06 0.07
G 1.05 0.07
F 1.05 0.07
I 1.04 0.05
H 1.06 0.07
J 1.06 0.08

Global result: 1.05 (stdev = 0.01)

The answer

Yes. It is true that multihop paths form even in a Kids Under the Tree scenario. Exactly 9.99% of the data frames captured in this test were transmitted over a multihop path. Is it excessive? What are the root causes for this? Keep reading ;-)

Question #2 - Does PDM traffic self-interfere?

We already confirmed that multihop paths will form in the Kids Under the Tree scenario or in a classroom, now it is time to investigate this deeper.

With 10 XOs the measured DPLI was 1.05, maybe not too high. Can it become intolerable in case more nodes are added or more traffic is generated?

Another suspicion worth investigating is that path discovery traffic can self-interfere. There is a lot of sense to this, since PDM is a bursty traffic in nature.

The test

To investigate this, we varied the route expiration time in the ten kids under the tree scenario and repeated the same test as above. We calculated the DPLI and also the percentage of data frames that were transmitted over multihop paths.

Test results

Table 3 shows what we've got. Increasing the route expiration time will cause a decrease in the path length. The percentage of frames transmitted over a multihop path drops from 9.99 to 6.24 percent if we increase the route expiration time from 10 seconds (the default) to 30 seconds. And if we fix it in 5s, this percentage grows to more than 16%.

Table 3 - The self-interference of PDM traffic
Route exp time (s) Mean DPLI Stdev DPLI Multihop frames %
5 1.09 0.02 16.58
10 1.05 0.01 9.99
15 1.05 0.01 8.72
20 1.05 0.01 9.65
25 1.04 0.01 7.44
30 1.03 0.01 6.24

It seems fair to say that increasing the number of XOs in a classroom will increase the PDM traffic and cause longer paths as a result.

The answer

Yes. The more you have path discovery traffic the longer your paths will get. But why exctly? That's question #3

Question #3 - Why more traffic will cause longer paths?

Since PDM traffic self-interferes it seems safe to assume that any traffic will potentially interfere with the path discovery traffic and, as a result, increase the average path length. But why so?

There nay be more than one answer here. Here is one:

Answer #3.1: Not all nodes decode the same frames (even in a dense scenario)

A common misunderstanding about dense wireless environments is that every node will capture the same traffic simply because they are very close and share a common transmission area. If this was the case, either a frame would be received by all the nodes or a collision would render a frame useless to all of them. But this is not true at all.

  • Obstructions and reflections and the resulting multipath effect change the probability that a frame is successfully decoded by a node. Even nodes some centimetres apart will experience different signal levels.
  • The internal condition of the node (from the signal levels to the protocol stack) may cause a frame that was successfully decoded by a nearby neighbour to to be lost.
  • Slight differences in the antenna gain and in the radio sensitivity between two neighbours (to name only two items) are normal and may also explain some differences.
  • And, more interestingly, even if none of the above held true, and we had this perfect scenario (no obstructions or reflections, no differences in the devices or in the internal states), two nodes would still not always successfully decode the same subset of the transmitted frames because of the capture effect capture effect (not to be confused with the channel capture effect).

The above can be easily proved. It is enough to compare traffic captures and check for differences and that's what we did in the following tests.

Test #1

In this experiment, 4 XOs (B,D,G,I) captured traffic while 2 other XOs (A,J) exchanged UDP datagrams. Each node transmitted 100 datagrams to a multicast address. So, ideally, each of the four sniffers would capture 400 UDP frames (100 from A, 100 from J, the 100 from J retransmitted by A, and the 100 from A retransmitted by J). The results are shown in Table 5.

Table 5 Number of captured frames
node fromA from J retransmitted by J retransmitted by A
A - 97 - -
B 98 100 98 97
D 98 100 98 97
G 98 98 97 97
I 98 100 97 97
J 98 - - -


Note that we are not interested in the fact that not all sniffers captured all the transmitted frames, but in the fact that nodes G and I captured less frames,than nodes B and D though they were sitting on the same table. Actually, less is not the point. The point is that they captured a different subset of the transmitted datagrams. Both G and I captured 97 frames retransmitted by J, but only 96 frames in common. A slight difference that nevertheless proves our point.

Stressing the point: we not just saying that some frames will be lost (and they will), we are actually saying that they may be lost to some of the nodes in our dense environment. It's this difference in perspective that will explain, at least in part, the multihop paths. But is this difference specific to a class of traffic? To prove that this can affect any traffic irrespectively we performed the following test.

Test #2

This time nodes A and J exchanged traffic while E and F captured the frames. Table 6 shows the difference of the traffic captured by E and F, paying special attention to PDM traffic.

Table 6 Comparison on the number captured frames
Node total captured frames from A to A PREQs from A PREQs from A (54Mbps) from J to J PREQs from J PREQs from J (54Mbps)
E 13,618 1678 469 145 29 1790 430 145 36
F 13,496 1687 466 150 32 1681 428 140 36

Table 6 shows that is possible that a PREQ be successfully decoded by a group of nodes, but not by other nodes like, for instance, the intended destination of the PREQ (we'll see why this is relevant shortly). But is this related to the modulation technique or data rate used by? Table 7 proves that this can be observed irrespective of the datarate:

Table 7 - Captured PDM
Node PREQs PREPs 54 36 11 1
E 7334 326 4525 1741 1947 2992
F 7298 328 4484 1735 1923 2978


Note that though F decoded 0.9% less frames than E that does not mean that it will perform worse in respect of every frame. Node F actually captured a little more of the PREPs transmitted at 54Mbps.

Also interesting is the fact that the frames transmitted at more the robust datarates of 1Mbps (like the PREPs in table 7) or 2Mbps (like the UDP datagrams of the test #1) are not free of this effect either. We have no reason to believe that the differences in the set of decoded frames are dependent to datarate but neither we state that they are not. This is just irrelevant for now.


So it is true that not all nodes will decode the same frames, still why this will cause multihop paths?

Let's study an example. The following scenario is possible because it is true that not all nodes will decode the same frames (even if close to each other):

Node "A" broadcasts a path request at 54Mbps to find a path to node "J". Node "J" does not decode this frame, but some other nodes around, say node "C", do. Some milliseconds before that, node "C" will rebroadcast the PREQ from "A", and this time, "J" may listen and respond with a "PREP". The result is a two hops path, where a single hop would be preferred (in principle).

Note that it doesn't make any difference if B receives or not any of the other PREQs sent from A (at the datarates of 36, 11 or 1Mbps). And this is where the metrics calculation enters stage - a two 54Mbps PREQ (second line of table 4) will beat one 36Mbps (third line of table 8).

Table 8 - Possible combinations of link costs for mesh ttl5 - 12 lower costs
total cost datarates for each link
13 54
26 54,54
28 36
39 54,54,54
41 54,36
46 11
52 54,54,54,54
54 54,54,36
56 36,36
59 54,11
64 1
65 54,54,54,54,54


In short

The fact that not all nodes decode the same subset of frames is one cause for the multihop paths observed in the ten kids under the tree (or the classroom) scenario. We don't have a prove (yet?) that this is the only possible cause, but we already know that this is a cause.

Question #4 - Can we fix or reduce this (supposedly unnecessary multihop paths) by adjusting metric costs?

In answer #3.1 we pointed out two causes (acting together) for the creation of the multihop paths in the very dense scenario of a classroom: The fact that nodes have a different perspective of the traffic and the way paths are selected in the PDM. Could we tune our metrics differently so to remove multihop paths?

Before we try to answer this, we should acknowledge that even if we used hop count as a metric the fact that nodes do not decode the same subset of the transmitted frames is in itself a cause for multihop paths. In fact, this is the very reason that we need multihop paths. This is a feature, not a bug!

So, no: changing metrics will never completely eliminate the issue. The question is: how we can try to reduce its probability (and can we afford to do it?)

If we argue that in a ultra dense mesh cloud mutihops paths must be avoided at any cost, that's a completely different issue. That's the equivalent of proposing that ad-hoc links be used, instead of a true manet. But we can argue that in a classroom multihop links are something unnecessary and it would be good to reduce them (since removing is not viable) without killing the mesh. How can we do that?

We may have many ways to achieve this, based on the adjustments of metrics (the present questions) or based on any other technique (that's question #5).

If fixing is not possible, how about reducing?

Let's give a more detailed look at the scenario proposed in answer #3.1. In that example, a two multihop path formed because:

  • The target "J" did not decode the 54Mbps PREQ (with cost 13) from "A"
  • "A" won't retransmit a 54Mbps PREQ, unless the whole PDM cycle fails. Instead it will send, a cluster of PREQs back-to-back. (Consisting of three more PREQs, at 36, 11 and 1Mbps).
  • "C", an intermediary node, decoded this original PREQ from "A" and retransmitted it in a new cluster (the 54Mbps PREQ on this new cluster will have cost of 26).
  • Though "J" may have decoded the 36Mbps PREQ from "A", its associated cost is 28 so, when it receives the PREQ from "C" with the cost 26, the two hops path (via "C") will be selected.

If the cost associated to the 36Mbps PREQ sent from "A" and received by "J" was smaller than 26, that particular multihop path wouldn't form.

So, it is possible to adjust metrics to avoid this. Actually we did not give an example for the selection of longer than two hops paths. And if we did, there is a chance that we could change the metrics somehow to fix this.

This approach clearly does not help us now. It was good to prove a point, but to understand how can we change cost in order to reduce the multihop paths requires another, more generic, method. Fortunately it is not hard to propose one.

It seems that what we'd really like in a dense scenario is a hop count metrics

In the end, if we expect nodes to be able to communicate directly, and we consider that multihop paths in a classroom are fundamentally bad, what we want is to switch the metrics to a hop count. Note that this is not the same as switching to mesh ttl 1 (what actually kills the mesh).

And here's why. The greater the differences in link costs associated to each datarate the farther you are getting from hopcount metrics.

In other words, if we broadcast PREQs at a single rate, or if we broadcast them at different rates but with the same associated cost, that would be equivalent to implement a hop count metric. Likewise, if we reduce the differences between the associated costs for each datarate we would have something closer to a hop count metric.

We just gave an example of this when we said that if PREQs at 36Mbps advertised a cost inferior to 26 the two hops path (13 +13) would not form.

Of course there is a reason for a link of 36Mpbs to be more costly than a 54Mpbs capable link. And that's airtime. One simple example: a 1000 bytes frame transmitted at 1Mbps will take about 8.19ms to transmit, while it would take about 174us to transmit the same frame at 54Mbps (47 times less).

But, back to the metrics. If the costs of all the PREQs in a cluster were the same, you would actually have a hop count going on. Though this is not an option under normal conditions it seems reasonable in a classroom.

Now we need to step back and look at the big picture: we are implying that in a classroom, or in a ten nodes under the tree scenario, any two links are equally capable i.e. that we don't think that a multihop path is a good thing at all. If we assume this, we could just switch to a hop count metric. But note that we could still have multihop paths, they would just be less probable (back to the example in #3.1, if "J" does not receive any of the PREQs from "A", there will always be node "C" to help).

Question #5 - What else could we do?

Switching to hop count in a classroom may not be as simple as it sounds, not to mention that it is not a tested scenario. Fortunately that's not the only thing we can do.

As we said, PDM traffic self-interferes and this is another chance to attack the problem: reduce PDM airtime consumption.

But this is the subject of another page. Let's summarize what we've discovered so far.

Summary

It is true that multihop paths form where you wouldn't expect them - in the ultra dense mesh cloud of a classroom. We saw reasons to believe that the more nodes you add to the mesh the longer the paths will get. We identified at least one cause for this, and this cause is composed of two contributing factors (1) the way that link costs are calculated and (2) the fact that not all nodes will decode the same frames (even being quite close to each other).

We also found out that this phenomenon cannot be completely eliminated, but we identified two ways to reduce its occurrence in a classroom. The first is to turn to a hopcount-like metrics and the second is to reduce the path discovery mechanism traffic.

Next, we need to analyse the pros and cons of each approach. Hopefully we'll get concrete evidence to support the use one of them, both or even none of them.