Content Distribution Architecture: Difference between revisions

From OLPC
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 76: Line 76:
There are several pieces of existing technology, such as the [http://en.wikipedia.org/wiki/Wizzy_Digital_Courier Whizzy Digital Courier] (based on [http://www.gedanken.demon.co.uk/wwwoffle/ wwwoffle]).
There are several pieces of existing technology, such as the [http://en.wikipedia.org/wiki/Wizzy_Digital_Courier Whizzy Digital Courier] (based on [http://www.gedanken.demon.co.uk/wwwoffle/ wwwoffle]).


==== Archival and "In the Cloud" Storage ====
==== Storage - Archival and "In the Cloud" ====
Note that CDN's themselves do not store original content: they are oriented towards its replication and delivery to the edge of the network. As some/many/most schools may not be well connected (or connected at all), we cannot use their school servers for archival storage of published content, and to attempt to do so is just setting the schools up to face "flash crowd" problems.
Note that CDN's themselves do not store original content: they are oriented towards its replication and delivery to the edge of the network. As some/many/most schools may not be well connected (or connected at all), we cannot use their school servers for archival storage of published content, and to attempt to do so is just setting the schools up to face "flash crowd" problems.


Line 83: Line 83:
There are a number of such systems in existence, both commercial and non commercial.
There are a number of such systems in existence, both commercial and non commercial.


For truly "archival", in the sense used by serious libraries, in which information must never be lost despite the failure of infrastructure, the [http://www.lockss.org/lockss/Home LOCKSS system] (Lots of Copies Keeps Stuff Safe) is worthy of serious note. LOCKSS (Lots of Copies Keep Stuff Safe) is an international non-profit community initiative that provides tools and support so libraries can easily and cost-effectively preserve today’s web-published materials for tomorrow’s readers. Such systems are clearly appropriate for content that is archival in nature (for example, text books and reference works) and do not change often. LOCKSS servers can (and should) be established at appropriate locations in country, and are dedicated appliances very easy to install.
For truly "archival" storage of content, in the sense used by libraries, in which information must never be lost despite the failure of seriousservers and infrastructure, the [http://www.lockss.org/lockss/Home LOCKSS system] is worthy of serious note. LOCKSS (Lots of Copies Keep Stuff Safe) is an international non-profit community initiative that provides tools and support so libraries can easily and cost-effectively preserve today’s web-published materials for tomorrow’s readers. Such systems are clearly appropriate for content that is archival in nature (for example, text books and reference works) and do not change often. LOCKSS servers can (and should) be established at appropriate locations in country, and are dedicated appliances intentionally very easy to install.


For less critical information (we would guess that backups of children and teacher content falls into this category), there are commercial services (e.g. [http://www.amazon.com/gp/browse.html?node=16427261 Amazon Simple Storage Service (Amazon S3)] and Google itself are examples of such services, or countries can provide such systems themselves, which may be necessary given the high cost of international peering charges some countries face.
For less critical information (we would guess that backups of children and teacher content falls into this category), there are commercial services (e.g. [http://www.amazon.com/gp/browse.html?node=16427261 Amazon Simple Storage Service (Amazon S3)] and Google itself are examples of such services, or countries can provide such systems themselves, which may be necessary given the high cost of international peering charges some countries face.

==== Issues ====
==== Issues ====
# The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. Squid cannot be used if a large cache is desired. However, a new caching proxy developed at Princeton by [http://www.cs.princeton.edu/~abadam/ Anirudh Badam], HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution.
# The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. Squid cannot be used if a large cache is desired. However, a new caching proxy developed at Princeton by [http://www.cs.princeton.edu/~abadam/ Anirudh Badam], HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution.

Latest revision as of 21:34, 4 August 2008

Content Distribution Architecture

Taxonomy

To ensure we're all on the same page about terminology, the first part of this document discusses terminology.

Network

There are a number of different networking technologies; these share different fundamental capabilities. These include:

  1. Broadcast only - it may be broadcast only (e.g. video channels, or vertical retrace based data traffic)?
  2. Multicast - does the technology support multicast traffic (traffic can be grouped as to interest)?
  3. Unicast - does the technology support unicast connection?

Latency

  1. Low - what is experienced in most of the developed world, using broadband technologies.
  2. High - Most current wireless technologies impose high latencies, often since data has been an afterthought in either the wireless technology or in the wireless deployment. These latencies can be expected to drop as future wireless technologies deploy. One or more satellite hops are common in some countries, for example, from country to the rest of the world, and then in country to schools. As of this writing, this is the case in Peru for the Ministry of Education; a school is at least two satellite hops from the rest of the Internet.

Connectivity

  1. None - sometimes referred to "sneakernet". No real time access to the internet is possible, though there may be periodic drops of data, from USB keys or disk drives, with periodicity anywhere from day(s) to month(s). The lack of internet capability may be driven by cost, power, or technical grounds.
  2. Cellphone - intermittent connectivity may be provided by cell phone technologies. Data service over cellphone technologies is typified by limited bandwidth and relatively poor latency. The IP service provided rarely supports multicast or broadcast.
  3. Satellite - connectivity to schools may be provided by satellite. Some satellites (e.g. IPstar, see below) provide broadcast or multicast capability, which may be very useful to transmitting video or topical information needed at multiple schools. Broadcast capability may be possible using vertical retrace technologies in other satellites. Note that many schools near each other may interfere on their uplinks, and therefore aggregating school's uplink traffic where possible is desirable to maximally use what is an expensive, scarce resource. Latency will be high; the speed of light ensures this.
  4. DSL - In many urban areas, DSL service is available, with typical DSL characteristics (potentially low latency, moderate bandwidth). These systems typically do not support IP multicast or broadcast.
  5. Cable modem - In some urban areas, cable TV systems may have deployed. They typically have low latency, and higher bandwidth than DSL systems. These systems typically do not support IP multicast or broadcast.
  6. Fiber - dream on, though in urban areas just being wired for network a school might end up with fiber if very lucky indeed.
  7. Point to point wireless - Schools may be connected to other schools or other nodes in network with a variety of wireless technologies (e.g. 802.11, Canopy, Meraki). These may enable (relatively) high bandwidth, low latency communications among schools (allowing for shared web caching, for example), and optimize the use of scarce uplink satellite bandwidth (and allow engineering of a cost effective redundantly connected network). They may also optimize the use of backhaul via other technolgoies (e.g. DSL, cable, fiber).

Capacity

The capacity of a network technology can vary widely, both for technological reasons of its channel capacity, the engineering tradeoffs made in the network design, and regulatory reasons.

For example, the marginal cost of unused bandwidth in a cell phone or other network is zero, but a scarce resource at some times of day. We are likely to see situations where a school's use of bandwidth off hours is essentially unlimited (to the capacity of the channel and network), while during the day, when there are competing commercial users bandwidth may be severely limited. Large content bundles, therefore, may need to be queued for later delivery, even though some connectivity exists during school hours, which we may need to reserve for interactive use.

Queuing for later transfer of large objects is therefore likely needed even in the case of schools with connectivity.

Regulatory

There may be additional issues due to regulatory constraints. There are countries which signed extremely poor wide area peering agreements, such that bandwidth out of the country is hideously expensive, putting an extremely high premium on in-country mirroring of content.

Some countries have no in-country peering between ISP's, sometimes resulting in international back-haul being the norm for interoperation between ISP's operating in country, rather than the exception. Longer term solutions are up to the local governments involved; organizations like the Packet Clearing House may be very useful to tap the expertise required to restore sanity to such network situations, though the country involved will also need to *want* to solve these problems, as the economic interests of the ISP's may not map to the interests of the country involved. These interests go well beyond Ministries of Education, so proper solutions may be time consuming and politically difficult. Note that in this example, such situations may complicate content distribution, requiring multiple delivery networks.

School Type

Schools come in a number of varieties, that share certain characteristics. The kind of connectivity may also vary depending on environment.

Small Rural Isolated Schools

In many countries, there are a large number of remote schools. While topographically they may at times be able to be interconnected, in some rain-forest areas it may be infeasible to interconnect them, and these have the highest probability to to lack any internet connectivity. While single satellite groundstations may be feasible in some areas (provided by satellites such as IPStar, these services are not available in many parts of the world (e.g. South America, Africa, much of Asia). Similarly, if cell phone service is available, they may have minimal connectivity, but limited bandwidth due to cost and/or regulatory reasons.

These schools range in size from the one room school house to small schools of less than a hundred children, depending on demographic distribution. These schools are also least likely to have conventional electric power sources, and fuel for such sources is very expensive.

Small Rural Clustered schools

It may be much more economic to provide network backhaul to a school (either via satellite or other technologes), and then interconnect the schools via point to point wireless service (e.g. 802.11 or Canopy or Meraki technologies). Uplinks to satellite in particular can interfere with each other; by aggregating uplink backhaul (even if the school caches may be able to share multicasted content and have their own downlinks) more efficient use of the scarce satellite resource is possible, along with local use of the network interconnecting nearby villages. Reliability may be increased as well, by providing failover for uplinks that might not be economically viable if one is required at each school. Since there are so many schools (and the number of students served at each school is low) the economic reality is that they will be very cost sensitive.

In this case, we would like a set of schools to share caches (enlarging their effective size). Content may be best sent to proxy caches/CDN's by multicast (if by satellite if there are multiple downlinks). It may be a single CDN node can serve a cluster of schools which simply have proxy servers for each school, if the cluster is interconnected with good bandwidth connections. But minimizing long-haul traffic (whether satellite or land-line) is clearly key for economic efficiency. And efficient use of satellite up-links is vital.

Town

Towns come in all sizes. The primary difference between this and a city is that if a town is small enough, it may still have a single school with all grades, while being relatively large, (of order 1000 children). This will pose similar issues to cities in terms of internal scaling of network and server infrastructure, but may increase the amount of content needed at a single site (since all grades are being served in the same location).

City

In cities, schools tend to be larger, and often segregated to sets of classes (elementary, middle, high schools, etc.). They often also run two shifts. Schools as large as 3000 students are observed in some countries. Here, economics work for us, both in provision of power and backhaul of networking, though will present other software scaling challenges in the school, both for the 802.11g the laptops use and for servers.

Technologies

There are a number of technologies we'll have to discuss; they are described here in short form so that the subsequent discussion can be lucid.

Content Distribution Networks (CDN's)

There are many content distribution networks of which Akamai may be best known. There are open source CDN's as well, less well known, including CoDeeN and the CoBlitz large file sharing system built on CoDeen, and Coral. These provide geographic distribution and replication of content, of arbitrary sizes. CDN's may prefetch content they have reason to believe will become popular (to handle the flash crowd effect); certainly topical news requires such distribution, as does the core educational curricula; on the first day of school, the first unit of a national curricula had better be well distributed.

With thousands, tens of thousands, and eventually hundreds of thousands of schools in a country, CDN's will certainly become necessary, and may be necessary to provide good service in isolated schools or school clusters.

Enabling publishing of interesting content and activities, and its entry into the CDN is also essential; creative teachers and students are everywhere, and even a very small flash crowd could easily bring an isolated school's network connection to its knees, appearing as a denial of service attack.

CDN's generally use caching web proxies as components. They may also use enhancements to the DNS system for reliability and security (e.g. CoDNS). They may use techniques such as multicast but more typically other peer-to-peer technologies (e.g. DHT's or other techniques) to locate and distribute content efficiently; since multicast is seldom available in the full internet, there is still a premium to transmit the content between nodes in a CDN network over the link just once.

Caching Proxies

HTTP has been designed to support caching of content. These are most often seen at organizational boundaries (though caching is also under the covers of all web browsers), where they are also often used for content policy enforcement (no video, no pornography, etc.). They are often also used to provide geographic locality without a CDN; large corporations may run many proxies). Proxy configuration may be explicit (where your browser is configured to use a particular proxy, either statically or programmatically), or by intercepting port 80 traffic (a transparent proxy, which when abused, can be pernicious).

Caching proxy research was a hot topic in the 1990's, but been moribund in this decade. The most commonly seen open source proxies include the Squid proxy cache and the Apache web server's mod_cache module. There are a number of commercial caching proxies used as well.

The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. For example, squid on OLPC's servers can only support a 10 gigabyte cache given its memory usage: but a single disk drive may have hundreds of gigabytes available for caching sitting idle. However, a new caching proxy developed at Princeton by Anirudh Badam, HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution.

Off-line Caching

Particularly early in a deployment in a rural area (a good example is Peru), internet access may not exist at all. We need a mechanism that presents the same basic access to information whether the school is on-line or not, so that uniform documentation may be written: it is not acceptable to have to have two different names for the same piece of content depending on whether they are on line or not (or more precisely, there must be a single name that can be resolved to the same content when a school is off-line as when it is on-line.

Content may be updated on a daily, weekly, or even monthly (or longer) basis. Some schools literally take weeks to travel to (by air, river, 4 wheel drive, horse/mule/donkey and foot).

New school book editions come out on an academic year basis; snapshotting (subsets of) Wikipedia on an on-going basis should occur on a regular basis. News sources, to be topical, need to be updated on a frequent basis. Teachers (and students) should be able to request content from a larger index for later delivery.

Similarly, content published and backup generated by teachers and children at a school should be able to automatically "trickle" up to higher levels and enter archival storage accessible to a CDN, so that locally generated content at a remote school can become available to others in country (or out); this should be seen as a bi-directional connection, not as a uni-directional publishing model, and is vital to avoid poorly connected school's networks becoming unusable due to interesting content being originated there. Note that this can easily support administrative use as well, so that local school information is available to regional and national administrations.

There are several pieces of existing technology, such as the Whizzy Digital Courier (based on wwwoffle).

Storage - Archival and "In the Cloud"

Note that CDN's themselves do not store original content: they are oriented towards its replication and delivery to the edge of the network. As some/many/most schools may not be well connected (or connected at all), we cannot use their school servers for archival storage of published content, and to attempt to do so is just setting the schools up to face "flash crowd" problems.

Published content should therefore "trickle back" to storage systems from the schools into such systems.

There are a number of such systems in existence, both commercial and non commercial.

For truly "archival" storage of content, in the sense used by libraries, in which information must never be lost despite the failure of seriousservers and infrastructure, the LOCKSS system is worthy of serious note. LOCKSS (Lots of Copies Keep Stuff Safe) is an international non-profit community initiative that provides tools and support so libraries can easily and cost-effectively preserve today’s web-published materials for tomorrow’s readers. Such systems are clearly appropriate for content that is archival in nature (for example, text books and reference works) and do not change often. LOCKSS servers can (and should) be established at appropriate locations in country, and are dedicated appliances intentionally very easy to install.

For less critical information (we would guess that backups of children and teacher content falls into this category), there are commercial services (e.g. Amazon Simple Storage Service (Amazon S3) and Google itself are examples of such services, or countries can provide such systems themselves, which may be necessary given the high cost of international peering charges some countries face.

Issues

  1. The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. Squid cannot be used if a large cache is desired. However, a new caching proxy developed at Princeton by Anirudh Badam, HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution.
  2. Many peer-to-peer networking technologies currently available (e.g. bittorret) are very poor in the face of low bandwidth and/or high latency connections to islands of higher bandwidth; we really do not want content from a school to transit the network out of the school more than once. While we may rail against Comcast's poor network design (and the fact that they interfered with bittorrent without informing customers), the reality of our network environment will often require such restrictions. A good CDN which supports upload of school content is clearly essential; a CDN is in fact another peer-to-peer technology, just one engineered in advance with knowledge of the network topology, something that bittorrent does not have available.
  3. management of school servers will shortly become essential as they deploy, and management of CDN's is a component of this. The best ratio of network managers to systems managed is PlanetLab's system (which manages approximately 1000 systems in 450 machine rooms around the world) using only one and two operators. This depends on Linux containerization technology (currently vserver). We need to track and determine if this is feasible for our school servers, particularly once containerization technology is in mainline Linux.

--Jim 17:34, 30 July 2008 (UTC)