Apache Proxy CRCsync

From OLPC
Jump to: navigation, search

This is the homepage of the Apache CRCSync project -- a modern HTTP proxy designed to radically increase the efficiency of HTTP traffic over bandwidth and latency constrained connections.

This is very important for schools with very limited connectivity, and lots of users hungry for knowledge. It is also important for users with costly connectivity (such as 3G modems) and bandwidth hungry devices (such as iPhones).

This project is strongly inspired by Andrew Tridgell's rproxy -- as such, this page borrows some of rproxy's notes:

Problem statement

Caches are used to good effect on today's web to improve response times and reduce network usage. For any given resource, such as an HTML page or an image, the client remembers the last instance it retrieved, and it may use it to satisfy future requests. However, the current-system is all-or-nothing: the resource must either be exactly the same as the cached instance, or it is downloaded from scratch.

The web is dominated today by dynamic content: many pages are assembled from databases or are customized for each visitor. In the existing HTTP caching system, this means that many resources cannot be cached at all, even if 80 or 90% of the HTML payload is a template that has not changed at all.

Proposed solution

A far better approach would be for the server to download a description of the changes from the old instance to the new one: a `diff' or `delta'.

Some people have proposed that the server should send the resource as an unchanging template plus variable values, or that the server should retain all old instances and so calculate the differences. These techniques have some value, but they constrain the server-side developer and seem unlikely to be widely adopted.

The rproxy extensions to HTTP allow the server to generate a diff relative to the cached instance in a way that is completely general, and transparent to both the server and user agent.

The CRCSync proxy

rproxy codebase is now a bit dated, and its reliance on the rsync protocol is problematic, as the rsync protocol is encumbered by software patents. To resolve this Rusty Russell has been working on a new implementation based on a rolling CRC algorythm, called crcsync, and there are efforts underway to integrate into the appropriate modules of Apache 2.2.x, as Apache is an excellent and modular http proxy.

Protocol