Yellow Treaps
Here on my (User:CScott) thoughts on solving the partitioning problem in yellow with treaps. It assumes you already know the basic principles of yellow and of treaps. Comments welcome. In particular, there are resource-numbering schemes which attempt to provide a compact interval representation even in the fact of arbitrary insertion/deletion; I'd be very interested in seeing a counter proposal for partitioning based on these. Also, there is a trivial "git-style hashed object store" with reverse indices that answer the question "is there a newer version of X" in a simpler manner. These have O(N) complexity to answer questions about N objects. The algorithms below have roughly O(lg(# upstream objects)), and may involve multiple protocol round-trips. Things to think about...
Michael Stone has written up a more basic (but mathematical) description of yellow which you may be interested in.
Notes on the 'yellow' distributed versioned store
Files have a (locally-unique, not globally-unique) name, n, and a version set vs, which is a set of <person, version#> pairs. The person identifier is a hash of their public key. At certain points the person will be required to sign an object or a communication to prove that they are who they say they are. We will call the pair <person, name>, where 'person' is the original creator, the 'global name' of the object; it is globally-unique. The version number is a locally monotonic integer. More later.
We design a protocol to answer two questions efficiently:
a) do you have all the objects I have?, and b) are there any newer versions of these objects?
We exchange partitions of our object space. We will treap-structure our partitions to allow efficient comparison. Treaps have two keys for each node, a tree key and a heap key, and they maintain both tree constraints and heap constraints, to ensure a canonical tree structure for a given set of nodes. The "heap key" will always be a hash of the tree key; the hash is uniformly distributed, which will tend to balance our tree. The open question is: what should the tree key be?
I propose that the version number be assigned as follows. Let the most-recently assigned version # be called V. V is zero if we've never assigned versions before. Let the new version # be:
V' = max(current local seconds since the epoch, V+1)
This doesn't require global timestamp synchronization, and the protocol is robust against arbitrarily poor time-keeping by any local node, but it tends to give favorable ordering properties on the treaps as described below.
Answering "do you have all the objects that I have?"
For this query, we can identify an object as a pair <global name, version set>. We can expect that the backup store might have older versions of objects which we don't have; we want to ignore their presence. It might also have "newer" versions of things: we don't care about those, either. We just want to ensure that our local state is reconstructable upstream.
We will exchange a treap of the objects (<global name, version set, contents>) in our local store. We will compare this to the upstream treap, ensuring a) that our local store is a subset of the upstream store, and b) that our particular treap is reconstructable.
If we assigned treap keys arbitrarily, we'd expect the treap comparison to be inefficient. For every pair of adjacent objects in our store, we'd expect arbitrarily many objects between them in the upstream store (which has many more objects). Instead, we want to assign treap keys such that our local objects are highly likely to remain adjacent; this means that entire subtrees of our treap can be found in the upstream treap, making for efficient subset testing. Further, upstream needs to store whatever internal treap nodes we have which it does not, in order to allow it to reconstruct our treap contents on a backup restore. This factor also favors treaps which share as much internal structure as possible.
We construct a string by encoding the version set as a list of <person, version #> pairs, ordered so that the *largest* version # comes first. We use a lexicographic sort on this encoding as a treap key. This will tend to group the objects for which a given person made the "last modification" together. By making modifications to the files in one's local store, one tends to move those objects together. [Notice that the version set uniquely identifies the global name; there cannot be different global names with identical version sets.]
In this scheme, if one never shares files, one's treap nodes will be highly likely to be found together in the upstream treap. Shared files which are never modified are likely to be found outside the shared treap subtrees, costing the exchange/creation of log(N) nodes * M for their representation/verification, where N is the number of nodes in the upstream treap. M is as large as the number of shared files, if the shared files were last modified by different people, or as low as 1, if all of our shared files were last modified by a specific person.
Since the tree key sorts first by largest version #, as 'old' objects age out of the local store, the local treap remains compact with respect to the upstream store.
A schoolserver synchronizing with a global store will expect to exchange treap nodes proportional to the number of students it represents ("S#"), since each student's local store will tend to be compact in the global store. We can imagine adding a schoolserver prefix to the student's person identifier to try to make the schoolserver treap compact. However, we'd still end up with O(S#) "holes" in the schoolserver's treap as old objects from each student are aged out from schoolserver storage. The unprefixed cost is O(S# * lg(N_global)), where N_global is the number of objects in the global store, while the cost with the schoolserver prefix is O(lg(N_global) + S# * lg(N_schoolserver)), where N_schoolserver is the number of objects in the school server's store. The savings may not be worth the difficulty of reassociating students who transfer between schools.
Known difficulties: if file sharing is wide-spread, the objects in a given treap (schoolserver or student) may well not be compact, and this algorithm will not perform any more efficiently than simply exchanging the hashes of all the leaf nodes (O(N)). If the schoolserver responds to every treap node request with the node hash of a "covering node" in its store (the lowest treap node which contains all the items in the set named by the local treap node), these hints can be stored locally and used to speed up synchronization even if the local treap is non-compact in the schoolserver's treap. However, this "covering node" hint might be very broad and subject to frequent invalidation, since local treaps may span different subtreaps of the schoolserver treap. In the extreme case, the only "covering node" which could be returned would be the root of the schoolserver's treap, which would be invalidated as soon as another item was added to the schoolserver. We do not yet have a good analysis of how effective (or not) this algorithm would be in worst, common, or randomized cases, although the best case is obviously good!
Answering "do you have any newer versions of these objects?"
If we are looking for a more recent version of a given object, the tree ordering above does not serve our needs. We maintain a second index of our objects, where the tree key is a lexicographic ordering of <global name, version set>. The version set is encoded as a list of <version #, name>, ordered so that largest version # is first. This keeps objects with the same name together, and allows a reasonably efficient range search for objects with a given global name "newer than" a specified version set. For each node in the treap, we keep a "version summary," S, of the objects below it:
S = { (p1, v1), ..., (pN, vN) }, where p1 through pN are unique and for each (p_i, v_i) in S, there exists an object 'o' represented by the summary which has (p_i, v_i) in its version set. for each (p_i, v_i) in a version set of an object represented by the summary, there exists some (p_i, v_j) in S, where v_j >= v_i.
Now we can compare our treaps recursively from the top down. Looking at pair of <upstream, local> nodes, we can efficiently determine whether there may be newer versions of any of our local node by looking at the node summaries. If there may be, then we can efficiently partition the treaps and recurse. (We can partition the local treap with a node from the global heap in log(N) time, where N is the number of nodes in the local treap.) Once we drill down to the level of a single global name, we can pull out all version sets which are "newer than" ours, where vs1 "is newer than or the same as" vs2 if:
for all (p_i, v_i) in vs2, there exists (p_i, v_j) in vs1 and v_j >= v_i
Notes
- These two algorithms/indices can be used in combination. For example, upstream can ask local "do you have newer versions of anything", and only if it does perform the 'what do you have' algorithm.
- We shouldn't trust the local data store's "have i backed this up" flag. We should really do the full "do you have what I have" algorithm each time, because:
- one of friends might back up our (shared) files for us. so we might not know it yet, but our files might already be backed up.
- the schoolserver might have a (minor or catastrophic) accident, and might forget (certain things/everything) it knows. Even if we think that we've backed up something, it might not actually be present on the schoolserver (any more), or it might not have successfully made it to the global store. Even if the file made it to the global store before the schoolserver had its accident, the schoolserver won't necessarily know which parts of the global store it should have locally cached.
- So update all of the "have i backed this up" flags every time we sync.
Open questions
We would like to be able to securely delete files from the global store, for example to protect a student from discovery. We can do this either by using a separate 'person' identifier for each object and "forgetting" the private key for the 'person' used for that object, preventing future linking of the object with its creator, or by keeping a "version history" for objects, so that the current version and the previous versions can all be securely deleted. This should follow updates made by the user, but not necessarily delete file versions that were shared with friends (and perhaps locally modified by them).
SJ would like every object to maintain metadata of the form:
(UID, name, revision, time, links)
where 'links' is a set of typed links to sourced, referenced, or quoted material used to make the object (cf Xanadu). The UID/name pair may be replaced by a (global name, current name) pair, where the global name is as defined above, and preserving a separate 'current name' field to reflect the fact that the object might have been renamed since created. We shouldn't have to artificially disambiguate documents with the same global name which aren't actually related; the metadata is sufficient for this, and the efficiency of the algorithms isn't too badly affected by the conflict. The 'revision' can be our 'version set', which solves the problem of creating globally unique revision strings without explicit synchronization. The 'time' and 'links' are high-level metadata which would allow application-level merging or display. SJ wants to do some common-case history browsing in the journal; this metadata would be sufficient for this.
Given a set of links embedded in an object, the "do you have newer" algorithm above provides an efficient mechanism for the application to query for updates to the referenced objects. If we are just querying for an object at a time, a simple reverse index would be sufficient -- the strength of these algorithms is their efficiency in querying for updates on a set of multiple objects. Ideally we'd query for updates on all objects in the journal at once, and set UI flags for those which have been modified.
Use cases (from SJ)
- Backup synchronization:
(XO x 100) -> SS (SS x 100,000) -> google
- File updates:
(set of objects) -> find updates -> then application-level apply/merge
Wikipedia example:
SJ creates a wiki article on puppies. Its global name is <SJ, "Puppies"> and its version set is { <SJ, 1> } smithbone looks for an article on puppies and gets version { <SJ, 1> }.
Now cscott edits the article. The new version set for cscott's version is { <SJ, 1>, <CScott, 1> }. The object's metadata independently contains the information that CScott based his version on <SJ, 1>.
If smithbone looked at the wiki article at this point, his machine would query for updates to <SJ, "Puppies">, { <SJ, 1> } and find cscott's version: <SJ, "Puppies">, { <SJ, 1> }. This is clearly newer than SJ's version, so it would download the new version and show it to smithbone.
But if, instead, cjb also edited the article based on SJ's version, creating version { <SJ, 1>, <cjb, 1> }, smithbones's query would find both: <SJ, "Puppies">, { <SJ, 1>, <CScott, 1> } and <SJ, "Puppies">, { <SJ, 1>, <cjb, 1> } These are obviously conflicting versions, so the application would refer to the high-level object information. It sees that both of the versions were created from <SJ, 1>, and flags this as a conflict. Smithbone does a merge starting from SJ's version, creating: <SJ, "Puppies">, { <SJ, 1>, <CScott, 1>, <cjb, 1>, <smithbone, 1> } with metadata indicating that this version was a merge of cscott and cjb's versions.
If SJ were to look at his puppies article again, he'd find that there were newer versions out there. The low-level update algorithm would return 3 versions as newer: <SJ, "Puppies">, { <SJ, 1>, <CScott, 1> } <SJ, "Puppies">, { <SJ, 1>, <cjb, 1> } <SJ, "Puppies">, { <SJ, 1>, <CScott, 1>, <cjb, 1>, <smithbone, 1> } It's clear in this case just from the version set that smithbone's version is latest. However, cjb might have done his own merge of CScott's changes, and the update algorithm might also have included: <SJ, "Puppies">, { <SJ, 1>, <CScott, 1>, <cjb, 2> } The high-level metainformation would need to be consulted to determine that smithbone's merge and cjb's merge were identical (if indeed they were). It might create a unified version <SJ, "Puppies">, { <SJ, 1>, <CScott, 1>, <cjb, 2>, <smithbone, 1> } to record this discovery (or trivial merge).
From here these editing steps can be repeated, starting from the version { <SJ, 1>, <CScott, 1>, <cjb, 2>, <smithbone, 1> } which is now on SJ's machine. The update protocols would now ignore { <SJ, 1>, <CScott, 1> } and { <SJ, 1>, <cjb, 1> } as irrelevant when looking for further updates, although dcbw's edit of SJ's original { <SJ, 1>, <dcbw, 1> } would still show up as relevant.