We used [SAGIV] as our starting point for concurrent b-tree algorithms. However, we made significant changes and (we feel) improvements. We have particularly tried to simplify the algorithms to reduce the implementation and especially debugging effort. There are also a lot of complex details involved in building a complete implementation. The goal of this document is to describe and explain the major design decisions and any subtleties that arose.
Instead of storing the full text of each key, store the length of match with the previous key (in this block) and the text of the key which doesn't match.
In non-leaf blocks, only store enough of the split-key to differentiate the two sequential blocks at the lower level.
The first 20 bytes of the block header format is the same for all types of blocks.
We use the B-link tree format (see [SAGIV]). Each leaf block contains some number of (key,value) pairs; each non-leaf (ie. index) block contains (key, down-pointer) pairs. Each block contains one additional key value, the SPLIT key, that is strictly greater than any other key in the block. The tree is augmented by chaining together the nodes of each level of the tree in split-key order (the NEXT pointers).
We allow the tree to be missing any index-insertion or block-deletion updates so long as certain key-order invariants are maintained:
We originally had index nodes in (value,key) order because it made insertions simple, but abandoned that because it made all the insertion/deletion propagation code special-case. In fact, because of the asymmetry of INSERT and DELETE, either one or the other will require that a single operation sometimes modify two blocks (see section Insertion method). However, using (key,value) ordering at all levels of the tree simplified the code considerably.
Split keys seem to be an accepted method for B-tree organization. In a Blink-tree, where one might potentially have to chain forward to find a desired key, the split key allows one to determine when the search can be terminated. Having the split key at the end of the block saves one block access per search. (If the split key were at the front of the block, one would have to access the next block in the chain to determine that a search could terminate.)
It is useful to define a largest key to be the split key of the last block at any given level; the code is made more complex if the last key is some special value (eg. the null string) that doesn't follow lexicographic order, not to mention the problems that arise with inserting into empty (last) blocks. We've reserved the set of strings starting with `0xff' as split keys only; they cannot be used as real key values. Last-key values are of the form xFF<level>. This makes the last key in level N+1 strictly larger than any in level N, so the end-of-chain key for level N need not be treated specially when it is inserted as a key at level N+1.
We make the split keys at index levels act like those at the leaf level, that is, the split key in a block NEVER CHANGES. The advantage is that inserts and deletes cannot cause split-key changes, avoiding the need for code to propagate these changes, which are a pain to test. It also makes the levels genuinely independent, a useful conceptual simplification.
The disadvantage is that this introduces a dead zone between the last key-value pair of an index block and its split key. This complicates the code slightly, and makes the average search path slightly longer than log(N) (by a constant factor of 1+(1/BF), where BF is the branching factor of the tree). More annoyingly, it interacts with insertions for block splits (see INSERTION).
Insertion in a index needs to appear atomic so that the key-order invariant is maintained. Since we are using key/value-ordered index blocks, parent-update insertion has a special extra step: we insert the new key and pointer in index block B, then swap value fields with the next entry in sequence. Unfortunately, when the insertion is the end of the block, the next entry is in the next block B', so two blocks must be modified. (The advantage of value/key ordering is that insertion never modifies more than one block; however, you get screwed in the code for deletions instead.) The code checks for this case and locks both blocks before proceeding. (An alternate solution considered is to back up the split key of B so that the new key would go at the front of B', but that introduces other problems.)
While the above method preserves correctness (even under concurrency), there is unfortunately a 2nd-order screw WHICH STILL NEEDS TO BE IMPLEMENTED: how to write the 2 blocks in an order that preserves correctness in case of disk crash. If only B is written, the database will contain 2 pointers to the block whose split caused the insertion; if only B' is written, the pointer correctness will be violated. (It would seem that this kind of problem is inherent to any operation that needs to maintain consistent changes across multiple blocks.) The fix seems to be to surround the write with a notation in the root that says this special case is being exercised and indicates what blocks were being modified (B,B') and what the pointers should be:
We elect to write out block B' first for consistency with the case where B' is a new block created by a block split. The screw case will create a damaged DB, but the danger of deleting a doubly-pointed-at block is avoided. This is adequate information to recover directly. (JONATHAN?)
(Other alternatives were considered. Backing up the split key at next level allows a subsequent insertion to be atomic, but the backing-up operation has the same two-block consistency problem in keeping ITS parent index consistent. If the parent key isn't backed up, subsequent splits of the following block will result in incorrect updates at the next level.)
Deleting can be divided into two parts: the block to be deleted must be unlinked from both parent and predecessor, after which the block can be reclaimed.
At the moment, we've decided to simplify the delete operation by requiring that both the parent and the previous block be available for modification, else the deletion fails (is deferred). This makes block deletion an atomic change, which avoids several problems introduced by permitting partial deletions (see CONCURRENCY).
Alternatives: [WANG] gives a clever way to do deletion w/o PREV, by swapping blocks. This seemed too hard to make bulletproof for concurrency so we stuck with the PREV method, which after all only does extra block accesses 1/BF of the time.
What about the order for writing the parent and predecessor blocks in a delete? We elect to write out the PARENT block first, so that if the system should crash the chain will still be intact. The block that we were trying to delete will still be in the chain, but it cannot be used, it is "dead." section Deferred Index Updates and concurrency discusses how to deal with "dead" blocks.
One major change from [SAGIV] is that SAGIV assumed multiple copies of a block could exist, which makes reclaiming deleted blocks complex (he suggests a timeout-based protocol). We opted instead to track how many pointers to each block are extant by adding a lock for that purpose, the NAME lock (essentially, a DELETE lock). A routine must get NAME lock on a pointer BEFORE releasing READ/WRITE on the block from which the pointer is obtained. (SAGIV's method is almost certainly more efficient in the sense that our method incurs overhead on every operation as opposed to just on the problem case of a READ request during a WRITE; several empirical studies of such tradeoffs support this conclusion.) On the other hand, NAME lock is useful for other things, such as insuring that the block you are PREVing from can't be deleted while you're looking for its parent or predecessor ...
It is worth remembering that INSERT and DELETE are not symmetric, in the sense that a postponed insertion is NOT equivalent to deleting a (KEY,PTR) pair. The latter operation leaves a block whose pointer is missing unaccessible via the index, while the former leaves the block accessible through the NEXT pointer chain.
This asymmetry has been the death of more proposals for fixing various problems involving concurrency than I care to recall!
EXAMPLE:
Therefore, it is not possible for a subsequent delete to "cancel" an insert; the deletes must either wait for the relevant inserts to complete or else do special work to maintain correctness.
We currently do not support deletion of the last block at any level, that is, the one with the end-of-block key. This is because this deletion requires special case to retain the end-of-block key. One way to achieve this is to copy forward the contents of the next-to-last block, and deleting that instead. [There are details to be worked out, eg. preservation of correctness during this operation.]
[DESCRIBE HOW IT WORKS]
SCREW CASE FIX UNDERSTOOD BUT NOT YET IMPLEMENTED: if down-pointers are missing to blocks immediately after the first block of a level, PREV will miss those blocks. The problem occurs when a PREV-BLOCK of block B at level N occurs, causing a PREV at level N+1. If down-pointers are missing, the block B' associated with B's split key may be some predecessor of B rather than B itself. In this case PREVing at level N+1 is wasteful but correct; it would require fewer block accesses to chain from B' than from PREV of that entry. However, if the B' entry is the FIRST at level N+1, PREV will erroneously conclude that it is at the start of the chain. It either has to (a) always look at B's entry at level N+1 to check that the current block's ptr is really up-to-date, or (b) just check when it hits the START-OF-CHAIN.
We guarantee that the root block number never changes and thus can be used as a unique identifier for a given WB-tree. Other systems provide a unique tree ID by introducing a level of indirection in root references; this is inefficient, as root references are frequent. When the root is split, we allocate a new block to hold the data that would have remained in the root block, then use the old root block for the new root. This does mean that one cannot depend on a block's ID being unchanged if it splits!
NOT IMPLEMENTED.
There are other possibilities for tree layouts. It is possible that some of these may simplify operations that in the current layout are complex, such as the insert-screw-case. Other possible choices include:
My (RJZ's) favorite alternate assumption is allowing multiple versions of blocks to exist, as in [SAGIV]. This would mean:
Perhaps we can explore the interrelationships in some methodical way someday ...
In order to be able to explicitly know when a block is safe to delete, we insist that a user must get NAME lock on a pointer BEFORE releasing READ/WRITE on the block from which the pointer is obtained. NAME lock is useful for other things, such as
Blocked operations are at the moment simply going to fail with an error code of RETRYERR, meaning that they can be safely retried later. The current idea is to use this whenever a READ-WRITE conflict occurs. (This would not be necessary using SAGIV's method.) However, since various other lockout and wait conditions can occur -- waits for block reads, waits on NAME locks, waits on interlocked operations -- some such facility would be needed anyway, so it seems reasonable to try to use it to handle READ-WRITE blocking as well.
NOT REALLY IMPLEMENTED YET due to the complexity of reorganizing the code to pass up the appropriate information in all cases. (Top-level routines return these codes but internal routines don't really use it yet, so retry-ability isn't really there yet.)
The basic strategy is to allow a key insert/delete that causes a block to split/become empty to complete, but to then queue up the parent-update/block delete on some process that will retry deferred updates in the background until they succeed.
The problem is that deletes contain two separate operations that can wait: the parent update and the predecessor update. The two updates can be done separately if the block is first marked "dead" so that no insertions into it can occur. Unfortunately, there is a class of rare and complex screw cases involving the correct ordering of deletes and inserts that happen to involve the same key. One might for example delete a block, deleting its key, and then subsequent splits can reintroduce and re-delete it. If operations at the index level are deferred, the ordering of these deferred operations determines whether the resulting tree is correct.
Making block-delete atomic (as defined above) greatly simplifies this process. The block being inserted/deleted can then serve as its own semaphore.
We have decided to adopt a lazy update strategy. That is, rather than keeping around queues of pending parent-updates, we just throw them away if they can't be executed immediately. We can get away with this because the updates for level N+1 can be reconstructed from the chain at level N.
Now, a deferred update only affects performance if the affected path is actually encountered: if we find we have to chain across two blocks, it means a parent update hasn't occurred yet; if we encounter an empty block, it means a delete update hasn't occurred. Our idea is to fix these "inefficiencies" on demand, that is, only when we run into one will we expend the effort to fix it. For the moment, assume ALL block deletes and insert updates are deferred.
The basic algorithm is:
One major concern has been that an I/O failure during a block delete can leave a "dead" block, that is, one which can't be reached from its parent level. It is still in a chain but there is no down pointer to it and no search can terminate at that node.The problem is that once a block becomes dead we need to prevent it from being inserted into or restored into use, because that could result in entries being out-of-order (suppose that, while the block's dead, some inserts of keys less than its split key occur. They'll be directed into the next block by the index, and be posted there. If we then restore the block, the key ordering will be incorrect.) But it turns out that we can prevent this by observing which B-tree operations can encounter which types of deferred-update situation, as follows:
If we think of the tree as a set of nested SPANS, where the SPAN of an index entry is the set of entries in the blocks it SPANS, we note that during operations using search only - GET,PUT,REM -- the locus of search stays strictly within the SPAN of some entry in the root. We enforce that a block not be deleted until its parent pointers are completely updates, that is, its pointer has a span of exactly one block. Now suppose a block delete fails halfway, leaving a empty block without any parent pointer. Such a block is unreachable by FIND-NODE and hence by INSERT! This means that if INSERT encounters an empty block, it must be valid to insert into it.
This has several interesting consequences:
[Having realized this, we can actually let block-deletes be two-part operations (again). The only additional complexity is that then we'd need to implement a method for detecting dead blocks, that is, differentiating them from empty blocks in the middle of large spans, which we mustn't try to delete. We just check if the block is continued in some span! HOW TO DO THAT EFFICIENTLY?]
In detail, the method of detecting deferred operations goes like this:
------------
[Other strategies were considered but were either significantly more complex or over-heady, or introduced unnecessary delays:
One hack is to serialize the queue of postponed index INSERT and DELETE operations by brute force: to do them in exactly the order they arrived in. Possibly simpler alternative method: sort by key and timestamp them!
I think we also decided that the interpreter could simply devote a process to each deferred operation, since we want to shift resources toward accomplishing the deferred operations if too many queue up.
[WANG] maintains a queue of deleted operation on the DESTINATION block. This has the disadvantage that whenever a block is split or merged the deferred-operation queues have to be split or merged as well -- ugh!.]
This is a perennial nuisance -- There is a complex system in place which hasn't really been evaluated or tuned, and there is also a proposal by Jonathan to simply use the first free (or free-able) buffer one finds.
It seems to me there was some case where it wasn't OK to just do a release to #f and an update from there -- but I can't think of it offhand ...
Note: this is a different sort of referral from the "deferred index updates" discussed earlier. Those deferred operations were correctness- preserving; these are not. The idea here is that we can reduce i/o traffic if we can safely lose a "small" number of updates.
The general idea here is that we can reduce I/O traffic by deferring writes of leaf blocks, in the sense that the updates can be lost without compromising the structure of the database. This only works where the data updates in question are not critical to database or application. Currently, we always defer leaf updates -- both PUTs and REMoves -- to data trees, where a data tree is any tree not of type DIRECTORY. (DIRECTORY leaf blocks are written to disk after every update.) The idea is that the user application should have control over how often the data blocks are written.
Also, referral of PUTs and DELETES should be separately controllable. This feature is needed for example by the database itself in maintaining the free list: we can afford to defer INSERTS of deleted block, because the worst that could happen is that a free block gets lost. But DELETES from the free list must update immediately, else a block could be doubly-allocated.
The handle field `wcb' has the following boolean values:
These bits are set as follows:
The state of these bits can be changed at any time using
(han:set-wcb! han new-bits). Directory trees force
wcb-sap and wcb-sar when opened or created. Also,
open_seg forces the free list write control bits to be as shown
above, regardless of the block type of the free-list.
Calling flush-ents flushs some modified blocks. It is thread
safe and can be called by a timer interrupt.
Works great
NOT IMPLEMENTED YET: We could actually be cleverer and cache the parents of every node, and even the PREVs, using the same TRY-GET-ENT protocol!
Multiple READ access is not supported yet. READ-READ conflicts can occur. We seem to recall noting that this limitation kept us from encountering some other problem, whose identity is lost for the moment.
Outline:
Current problems:
Last week (1/8) we concluded that (all efforts to date notwithstanding) there were still failure cases in free-block management under concurrent operation, because of problems like:
On the positive side, we realized that we NEED NOT allow enough free blocks for a free-list insert to split the whole tree -- any split except the leaf split can simply be postponed if there isn't a free block available at that time! (And similarly for any insert-updates that happen to be triggered by free-list accesses.)
Come to think of it, if we happen to run out of blocks during a free-list insert, its OK to let the insert fail, we just lose one disk block!! That may just be the answer!!
Two routines have been built for purging buffers.
FLUSH-BUFFER(ENT) writes out ENT if it is dirty and unlocked. It returns TERMINATED if ENT is locked, RETRYERR if the write is attempted and fails, and SUCCESS otherwise.
PURGE-BUFFER(ENT) writes out ENT if it is dirty and then frees up the buffer. This IGNORES the access status of the buffer, so it should not be called by users; it always returns SUCCESS.
Use (DO-SEG-BUFFERS SEG# FUNC) to apply a function to all the buffers of a given segment; for example, (DO-SEG-BUFFERS SEG# FLUSH-BUFFER) can be used to guarantee that segment SEG#'s disk file is up to date. DO-SEG-BUFFERS halts if FUNC returns other than SUCCESS; the result of FUNC is returned. SUCCESS is returned if all buffers have been successfully processed. To process all segments, use SEG #=-1.
(CHECK-BUFFER ENT) checks that the buffer is written and unlocked, and repairs those that are not.
The problem: Calls need to return adequate information to distinguish:
The solution: use a bounded range of negative values as "failure" codes, leaving the non-negative return values available as "success" codes. The canonical success code is 0, but a routine that needs to return a value (string length, ENT pointer) can do so and have that value interpreted as a "success" code as well.
There are "degrees" of failure, and the negative codes are partitioned according to increasingly severity, as follows:
| value | C name | Meaning |
| 0 | success | successful execution |
| -1 | notpres | successful execution, no data present or no change made |
| -2 | terminated | failure, no damage, caller can retry operation |
| -10 | retryerr | failure, no damage, caller can retry operation |
| -15 | argerr | failure, no damage, call was in error |
| -20 | noroom | failure, no damage, out of room in file |
| -30 | typerr | failure, file or object was not of correct type |
| -40 | ioerr | i/o error, DB may be damaged |
| -45 | strangerr | internal error, DB may be damaged |
| -90 | unkerr | placeholder code |
| -100 | maxerr |
The first class represent operations that completed without error. The second class represent operations that failed to complete, but are guaranteed to leave the DB in a correct state and are retry-able (or easily correctable). The third class represent operations that failed to complete, did not damage the database, but are not easily fixable or restartable. The last class represent error conditions in which the DB was corrupted, or during which DB corruption was detected.
The predicate (ERR? code) returns #t if the return code is within the range NOTPRES-MAXERR; the predicate (REALERR? code) returns #t if CODE is an actual error, as opposed to a "not there" or "stop processing" message.
The 256.B length limit for value strings was a barrier to many
possible applications for WB. The db:get and db:put!
procedures in `db.c' work with value strings up to 64770.B in
length (see section SCM Record Operations).
For a value string of length L:
The rest are proposed extensions to unlimited value string length.
make_seg() or block-size
argument to make-seg is the size of all WB blocks (pages) in
that segment.
by Jonathan Finger
Use 2 btrees which I will refer to as main and slave
notation D[123,4567] will translate into a string D \3 1 2 3 \4 4 5 6 7 where \3 is ascii 3 etc (length of following string of digits) this will result in keys sorting in numerical order.
Store the data in D[2357,0] = first 250 bytes D[2357,1] = second 250 bytes . . . D[2357,249] = last 250 bytesWhen a new value is set, if it is > 254 bytes in length then you reuse the D[2357] names adding or deleting as needed. Note that it is much more efficient to overwrite existing nodes since (most) of them will be the same size and add or delete at the end. If the new value < 255 bytes delete D[2357] and store the value in the master btree.
Master btree
C0 = (flag)244
Slave btree
V[244,C1] = 245
V[245,C2] = 246
.
.
.
V[244 + N,CN] = data
A get now consistes of breaking up the key and following the chain.
The put, delete, next, and prev code will get a bit messy since
there will be multiple cases to consider.
I believe this scheme will give unlimited key and data length. Performance probably will not be great but may be acceptable. This should give good key compression.
I think most of the items in RJZ's list have been done.
RJZ Modified 4/8/1993
A comprehensive B-tree bibliography can be found at: http://www.informatik.uni-trier.de/~ley/db/access/btree.html