Content Distribution Architecture

From OLPC
Revision as of 16:19, 31 July 2008 by Jg (talk | contribs) (School Type)
Jump to: navigation, search

Content Distribution Architecture

Taxonomy

Network

There are a number of different networking technologies; these share different fundamental capabilities. These include:

  1. Broadcast only - it may be broadcast only (e.g. video channels, or vertical retrace based data traffic)
  2. Multicast - does the technology support multicast traffic (traffic can be grouped as to interest)
  3. Unicast - does the technology support unicast connection.

Connectivity

Latency

  1. Low - what is experienced in most of the developed world, using broadband technologies.
  2. High - Some wireless technologies impose high latencies, often since data has been an afterthought in either the wireless technology or in the wireless deployment. In addition, one or more satellite links are common in some countries.

Capacity

Regulatory

There may be additional issues due to regulatory constraints. There are countries which signed extremely poor wide area peering agreements, such that bandwidth out of the country is hideously expensive, putting an extremely high premium on in-country mirroring of content.

Some countries have no in-country peering between ISP's, sometimes resulting in international back-haul being the norm for interoperation between ISP's operating in country, rather than the exception. Longer term solutions are up to the local governments involved; organizations like the Packet Clearing House may be very useful to tap the expertise required to restore sanity to such network situations, though the country involved will also need to *want* to solve these problems, as the economic interests of the ISP's may not map to the interests of the country involved. These interests go well beyond Ministries of Education, so proper solutions may be time consuming and politically difficult. Note that in this example, such situations may complicate content distribution, requiring multiple delivery networks.

School Type

Schools come in a number of varieties, that share certain characteristics. The kind of connectivity may also vary depending on environment.

Small Rural Isolated

In many countries, there are a large number of remote schools. While topographically they may at times be able to be interconnected, in some rain-forest areas it may be infeasible to interconnect them, and these are most likely to lack any internet connectivity. While single satellite groundstations may be feasible in some areas (provided by satellites such as IPStar, these services are not available in many parts of the world (e.g. South America, Africa, much of Asia). Similarly, if cell phone service is available, they may have minimal connectivity, but limited bandwidth due to cost and/or regulatory reasons.

Small Rural Clustered

It may be much more economic to provide network backhaul to a school (either via satellite or other technologes), and then interconnect the schools via point to point wireless service (e.g. 802.11 or Canopy or Meraki technologies). Uplinks to satellite in particular can interfere with each other; by aggregating uplink backhaul (even if the school caches may be able to share multicasted content and have their own downlinks) more efficient use of the scarce satellite resource is possible, along with local use of the network interconnecting nearby villages. Reliability may be increased as well, by providing failover for uplinks that might not be economically viable if one is required at each school. Since there are so many schools (and the number of students served at each school is low) the economic reality is that they will be very cost sensitive.

In this case, we would like a set of schools to share caches (enlarging their effective size). Content may be best sent to proxy caches by multicast (f by satellite).

Town

City

Technologies

There are a number of technologies we'll have to discuss; they are described here in short form so that the subsequent discussion can be lucid.

Content Distribution Networks (CDN's)

There are many content distribution networks of which Akamai may be best known. There are open source CDN's as well, less well known, including CoDeeN and Coral. These provide geographic distribution and replication of content. CDN's may prefetch content they have reason to believe will become popular (to handle the flash crowd effect); certainly topical news requires such distribution, as does the core educational curricula; on the first day of school, the first unit of a national curricula had better be well distributed.

With thousands, tens of thousands, and eventually hundreds of thousands of schools in a country, CDN's will certainly become necessary.

Enabling publishing of interesting content and activities is also essential; creative teachers and students are everywhere, and a flash crowd could easily bring a school's network connection to its knees.

CDN's generally use caching web proxies as components. They may use techniques such as multicast or other peer-to-peer technologies (e.g. DHT's) to locate and distribute content efficiently.

Caching Proxies

The web has been designed to support caching of content. These are most often seen at organizational boundaries (though caching is also under the covers of all web browsers), where they are also often used for content policy enforcement (no video, no pornography, etc.). They are often also used to provide geographic locality without a CDN; large corporations may run many proxies). Proxy configuration may be explicit (where your browser is configured to use a particular proxy, either statically or programmatically), or by intercepting port 80 traffic (a transparent proxy, which when abused, can be pernicious).

Caching proxy research was a hot topic in the 1990's, but been moribund in this decade. The most commonly seen open source proxies include the Squid proxy cache and the Apache web server's mod_cache module. There are a number of commercial caching proxies used as well.

The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. Squid (or any existing commercial proxy) cannot be used if a large cache is desired. However, a new caching proxy developed at Princeton by Anirudh Badam, HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution.

Off-line Caching

Particularly early in a deployment in a rural area (a good example is Peru), internet access may not exist at all. We need a mechanism that presents the same basic access to information whether the school is on-line or not, so that uniform documentation may be written: it is not acceptable to have to have two different names for the same piece of content depending on whether they are on line or not (or more precisely, there must be a single name that can be resolved to the same content when a school is off-line as when it is on-line.

Content may be updated on a daily, weekly, or even monthly (or longer) basis. Some schools literally take weeks to travel to (by air, river, 4 wheel drive, horse/mule/donkey and foot).

New school book editions come out on an academic year basis; snapshotting (subsets of) Wikipedia on an on-going basis should occur on a regular basis. News sources, to be topical, need to be updated on a frequent basis. Teachers (and students) should be able to request content from a larger index for later delivery.

Similarly, content published and backup generated by teachers and children at a school should be able to automatically "trickle" up to higher levels and enter archival storage accessible to a CDN, so that locally generated content at a remote school can become available to others in country (or out); this should be seen as a bi-directional connection, not as a uni-directional publishing model, and handles the flash crowd problem. Note that this can/should support administrative use as well, so that local school information is available to regional and national administrations.

There are several pieces of existing technology, such as the Whizzy Digital Courier (based on wwwoffle).

Issues

  1. The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. Squid cannot be used if a large cache is desired. However, a new caching proxy developed at Princeton by Anirudh Badam, HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution.
  2. Many peer-to-peer networking technologies currently available (e.g. bittorret) are very poor in the face of low bandwidth and/or high latency connections to islands of higher bandwidth; we really do not want content from a school to transit the network out of the school more than once.

--Jim 17:34, 30 July 2008 (UTC)