Content Distribution Architecture: Difference between revisions
Line 41: | Line 41: | ||
include [http://www.squid-cache.org/ the Squid proxy cache] and the [http://httpd.apache.org/docs/2.2/mod/mod_cache.html Apache web server's mod_cache module]. There are a number of commercial caching proxies used as well. |
include [http://www.squid-cache.org/ the Squid proxy cache] and the [http://httpd.apache.org/docs/2.2/mod/mod_cache.html Apache web server's mod_cache module]. There are a number of commercial caching proxies used as well. |
||
However, a new caching proxy developed at Princeton by Anirudh Badam, HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution. |
The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. Squid (or any existing commercial proxy) cannot be used if a large cache is desired. However, a new caching proxy developed at Princeton by Anirudh Badam, HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution. |
||
==== Off-line Caching ==== |
==== Off-line Caching ==== |
Revision as of 19:31, 31 July 2008
Content Distribution Architecture
Taxonomy
Network
There are a number of different networking technologies; these share different fundamental capabilities. These include:
- Broadcast only - it may be broadcast only (e.g. video channels, or vertical retrace based data traffic)
- Multicast - does the technology support multicast traffic (traffic can be grouped as to interest)
- Unicast - does the technology support unicast connection.
Connectivity
Latency
- Low - what is experienced in most of the developed world, using broadband technologies.
- High - Some wireless technologies impose high latencies, often since data has been an afterthought in either the wireless technology or in the wireless deployment. In addition, one or more satellite links are common in some countries.
Capacity
Regulatory
There may be additional issues due to regulatory constraints. There are countries which signed extremely poor wide area peering agreements, such that bandwidth out of the country is hideously expensive, putting an extremely high premium on in-country mirroring of content.
Some countries have no in-country peering between ISP's, sometimes resulting in international back-haul being the norm for interoperation between ISP's operating in country, rather than the exception. Longer term solutions are up to the local governments involved; organizations like the Packet Clearing House may be very useful to tap the expertise required to restore sanity to such network situations, though the country involved will also need to *want* to solve these problems, as the economic interests of the ISP's may not map to the interests of the country involved. These interests go well beyond Ministries of Education, so proper solutions may be time consuming and politically difficult. Note that in this example, such situations may complicate content distribution, requiring multiple delivery networks.
School Type
Small Rural Isolated
Small Rural clustered
Town
City
Technologies
There are a number of technologies we'll have to discuss; they are described here in short form so that the subsequent discussion can be lucid.
Content Distribution Networks (CDN's)
There are many content distribution networks of which Akamai may be best known. There are open source CDN's as well, less well known, including CoDeeN and Coral. These provide geographic distribution and replication of content. CDN's may prefetch content they have reason to believe will become popular (to handle the flash crowd effect); certainly topical news requires such distribution, as does the core educational curricula; on the first day of school, the first unit of a national curricula had better be well distributed.
With thousands, tens of thousands, and eventually hundreds of thousands of schools in a country, CDN's will certainly become necessary.
Enabling publishing of interesting content and activities is also essential; creative teachers and students are everywhere, and a flash crowd could easily bring a school's network connection to its knees.
CDN's generally use caching web proxies as components. They may use techniques such as multicast or other peer-to-peer technologies (e.g. DHT's) to locate and distribute content efficiently.
Caching Proxies
The web has been designed to support caching of content. These are most often seen at organizational boundaries (though caching is also under the covers of all web browsers), where they are also often used for content policy enforcement (no video, no pornography, etc.). They are often also used to provide geographic locality without a CDN; large corporations may run many proxies). Proxy configuration may be explicit (where your browser is configured to use a particular proxy, either statically or programmatically), or by intercepting port 80 traffic (a transparent proxy, which when abused, can be pernicious).
Caching proxy research was a hot topic in the 1990's, but been moribund in this decade. The most commonly seen open source proxies include the Squid proxy cache and the Apache web server's mod_cache module. There are a number of commercial caching proxies used as well.
The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. Squid (or any existing commercial proxy) cannot be used if a large cache is desired. However, a new caching proxy developed at Princeton by Anirudh Badam, HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution.
Off-line Caching
Particularly early in a deployment in a rural area (a good example is Peru), internet access may not exist at all. We need a mechanism that presents the same basic access to information whether the school is on-line or not, so that uniform documentation may be written: it is not acceptable to have to have two different names for the same piece of content depending on whether they are on line or not (or more precisely, there must be a single name that can be resolved to the same content when a school is off-line as when it is on-line.
Content may be updated on a daily, weekly, or even monthly (or longer) basis. Some schools literally take weeks to travel to (by air, river, 4 wheel drive, horse/mule/donkey and foot).
New school book editions come out on an academic year basis; snapshotting (subsets of) Wikipedia on an on-going basis should occur on a regular basis. News sources, to be topical, need to be updated on a frequent basis. Teachers (and students) should be able to request content from a larger index for later delivery.
Similarly, content published and backup generated by teachers and children at a school should be able to automatically "trickle" up to higher levels and enter archival storage accessible to a CDN, so that locally generated content at a remote school can become available to others in country (or out); this should be seen as a bi-directional connection, not as a uni-directional publishing model, and handles the flash crowd problem. Note that this can/should support administrative use as well, so that local school information is available to regional and national administrations.
There are several pieces of existing technology, such as the Whizzy Digital Courier (based on wwwoffle).
Issues
- The differing rates of improvement of RAM vs. disk, however, has presented a major problem: Squid cannot be used to support a large disk cache (e.g. hundreds of gigabytes), as it stores URL's in RAM, and the amount of RAM required quickly becomes prohibitive. Squid cannot be used if a large cache is desired. However, a new caching proxy developed at Princeton by Anirudh Badam, HashCache, does not have this limitation, and if licensing terms can be reached, will provide a viable solution.
- Many peer-to-peer networking technologies currently available (e.g. bittorret) are very poor in the face of low bandwidth and/or high latency connections to islands of higher bandwidth; we really do not want content from a school to transit the network out of the school more than once.
--Jim 17:34, 30 July 2008 (UTC)