[Contents]   [Back]   [Prev]   [Up]   [Next]   [Forward]  


Theory

We used [SAGIV] as our starting point for concurrent b-tree algorithms. However, we made significant changes and (we feel) improvements. We have particularly tried to simplify the algorithms to reduce the implementation and especially debugging effort. There are also a lot of complex details involved in building a complete implementation. The goal of this document is to describe and explain the major design decisions and any subtleties that arose.

B-tree Structure and Access

Definitions

B+-tree

block

root block

internal block

leaf block

prefix compression

Instead of storing the full text of each key, store the length of match with the previous key (in this block) and the text of the key which doesn't match.

suffix compression

In non-leaf blocks, only store enough of the split-key to differentiate the two sequential blocks at the lower level.

Block Format

The first 20 bytes of the block header format is the same for all types of blocks.

0 blk:id
The ID of this block.
4 blk:top-id
The ID of the root block of this tree.
8 blk:nxt-id
The ID of the next block in the chain (at same level as this one).
12 blk:time
The 32-bit time/date when this block was last modified.
16 blk:end
Two-byte length of data in block, including header.
18 blk:level
Block level; 0 is leaf.
19 blk:typ
Block type, one of:
20
Start of data for SEQ-TYP blocks. Data spans to blk:end.
20 leaf-split-key
22
Start of data for other block types. Data spans to blk:end.

Tree format

The WB-tree (the "Wanna B-tree")

We use the B-link tree format (see [SAGIV]). Each leaf block contains some number of (key,value) pairs; each non-leaf (ie. index) block contains (key, down-pointer) pairs. Each block contains one additional key value, the SPLIT key, that is strictly greater than any other key in the block. The tree is augmented by chaining together the nodes of each level of the tree in split-key order (the NEXT pointers).

We allow the tree to be missing any index-insertion or block-deletion updates so long as certain key-order invariants are maintained:

A
Blocks are chained in order of increasing split key at each level, and all valid blocks appear in the chain.
B
Split keys are unique at each level.
C
Every DOWN pointer from level N+1 points to a block B at the next lower level N whose split-key S is less than or equal to the key S1 associated with the pointer.
C1
If the split key for block B (at level N) is strictly less than the key associated with the ptr from level N+1 (S < S1), then it must be the case that
  1. a block B' with that split key S1 exists at level N;
  2. B' is reachable from B by following NEXT pointers; and
  3. no pointers to either B' nor any blocks between B and B' exist in level N+1. We call the subchain from B through B' the SPAN of the (key,ptr) entry (split(B'),B).
C2
It is invalid for a down pointer to contain a key not present as a split key at the next level. Note that a key in an index must match its block's split key exactly; if a key K is less than the split-key S of the block B it points to, searches for intervening keys will be misdirected (to the next block); and if K is greater than S, then splits of the block after B at key values K' where S<K'<K will be mis-inserted in B's paren, because K' should logically go AFTER K, but K'<K. The notion of the span of an index entry is useful. We note that each block split can be thought of as an EXTENSION of some span at the next-higher level, while each PARENT-INSERT-UPDATE can be thought of as a corresponding span REDUCTION. A span that has only one block in it will be called FULLY REDUCED; a b-tree is fully reduced when all its spans are fully reduced, meaning that all pending/deferred insert-updates have been performed. Lastly, we can express rule C1 more succinctly in terms of spans:
C,C2
SPANs must be well-formed (span must be closed; keys must match exactly)
C1
The SPANS of the entries at any index must not overlap.

Key/Value Order; Uniform Leaf/Node Format

We originally had index nodes in (value,key) order because it made insertions simple, but abandoned that because it made all the insertion/deletion propagation code special-case. In fact, because of the asymmetry of INSERT and DELETE, either one or the other will require that a single operation sometimes modify two blocks (see section Insertion method). However, using (key,value) ordering at all levels of the tree simplified the code considerably.

Split keys

Split keys seem to be an accepted method for B-tree organization. In a Blink-tree, where one might potentially have to chain forward to find a desired key, the split key allows one to determine when the search can be terminated. Having the split key at the end of the block saves one block access per search. (If the split key were at the front of the block, one would have to access the next block in the chain to determine that a search could terminate.)

Last-key values

It is useful to define a largest key to be the split key of the last block at any given level; the code is made more complex if the last key is some special value (eg. the null string) that doesn't follow lexicographic order, not to mention the problems that arise with inserting into empty (last) blocks. We've reserved the set of strings starting with `0xff' as split keys only; they cannot be used as real key values. Last-key values are of the form xFF<level>. This makes the last key in level N+1 strictly larger than any in level N, so the end-of-chain key for level N need not be treated specially when it is inserted as a key at level N+1.

Fixed split keys/Independence of levels

We make the split keys at index levels act like those at the leaf level, that is, the split key in a block NEVER CHANGES. The advantage is that inserts and deletes cannot cause split-key changes, avoiding the need for code to propagate these changes, which are a pain to test. It also makes the levels genuinely independent, a useful conceptual simplification.

The disadvantage is that this introduces a dead zone between the last key-value pair of an index block and its split key. This complicates the code slightly, and makes the average search path slightly longer than log(N) (by a constant factor of 1+(1/BF), where BF is the branching factor of the tree). More annoyingly, it interacts with insertions for block splits (see INSERTION).

Insertion method

Insertion in a index needs to appear atomic so that the key-order invariant is maintained. Since we are using key/value-ordered index blocks, parent-update insertion has a special extra step: we insert the new key and pointer in index block B, then swap value fields with the next entry in sequence. Unfortunately, when the insertion is the end of the block, the next entry is in the next block B', so two blocks must be modified. (The advantage of value/key ordering is that insertion never modifies more than one block; however, you get screwed in the code for deletions instead.) The code checks for this case and locks both blocks before proceeding. (An alternate solution considered is to back up the split key of B so that the new key would go at the front of B', but that introduces other problems.)

While the above method preserves correctness (even under concurrency), there is unfortunately a 2nd-order screw WHICH STILL NEEDS TO BE IMPLEMENTED: how to write the 2 blocks in an order that preserves correctness in case of disk crash. If only B is written, the database will contain 2 pointers to the block whose split caused the insertion; if only B' is written, the pointer correctness will be violated. (It would seem that this kind of problem is inherent to any operation that needs to maintain consistent changes across multiple blocks.) The fix seems to be to surround the write with a notation in the root that says this special case is being exercised and indicates what blocks were being modified (B,B') and what the pointers should be:

We elect to write out block B' first for consistency with the case where B' is a new block created by a block split. The screw case will create a damaged DB, but the danger of deleting a doubly-pointed-at block is avoided. This is adequate information to recover directly. (JONATHAN?)

(Other alternatives were considered. Backing up the split key at next level allows a subsequent insertion to be atomic, but the backing-up operation has the same two-block consistency problem in keeping ITS parent index consistent. If the parent key isn't backed up, subsequent splits of the following block will result in incorrect updates at the next level.)

Deletion

Deleting can be divided into two parts: the block to be deleted must be unlinked from both parent and predecessor, after which the block can be reclaimed.

Unlinking the block

At the moment, we've decided to simplify the delete operation by requiring that both the parent and the previous block be available for modification, else the deletion fails (is deferred). This makes block deletion an atomic change, which avoids several problems introduced by permitting partial deletions (see CONCURRENCY).

Alternatives: [WANG] gives a clever way to do deletion w/o PREV, by swapping blocks. This seemed too hard to make bulletproof for concurrency so we stuck with the PREV method, which after all only does extra block accesses 1/BF of the time.

Crashes during DELETE

What about the order for writing the parent and predecessor blocks in a delete? We elect to write out the PARENT block first, so that if the system should crash the chain will still be intact. The block that we were trying to delete will still be in the chain, but it cannot be used, it is "dead." section Deferred Index Updates and concurrency discusses how to deal with "dead" blocks.

Reclaiming the block

One major change from [SAGIV] is that SAGIV assumed multiple copies of a block could exist, which makes reclaiming deleted blocks complex (he suggests a timeout-based protocol). We opted instead to track how many pointers to each block are extant by adding a lock for that purpose, the NAME lock (essentially, a DELETE lock). A routine must get NAME lock on a pointer BEFORE releasing READ/WRITE on the block from which the pointer is obtained. (SAGIV's method is almost certainly more efficient in the sense that our method incurs overhead on every operation as opposed to just on the problem case of a READ request during a WRITE; several empirical studies of such tradeoffs support this conclusion.) On the other hand, NAME lock is useful for other things, such as insuring that the block you are PREVing from can't be deleted while you're looking for its parent or predecessor ...

Non-symmetry of INSERT and DELETE

It is worth remembering that INSERT and DELETE are not symmetric, in the sense that a postponed insertion is NOT equivalent to deleting a (KEY,PTR) pair. The latter operation leaves a block whose pointer is missing unaccessible via the index, while the former leaves the block accessible through the NEXT pointer chain.

This asymmetry has been the death of more proposals for fixing various problems involving concurrency than I care to recall!

EXAMPLE:

  1. Start with blk p with split key k at level n;
  2. split p at k', adding block p';
  3. Now neither deleting p nor p' yields the original index contents:

Therefore, it is not possible for a subsequent delete to "cancel" an insert; the deletes must either wait for the relevant inserts to complete or else do special work to maintain correctness.

Non-delete of last block in the chain

We currently do not support deletion of the last block at any level, that is, the one with the end-of-block key. This is because this deletion requires special case to retain the end-of-block key. One way to achieve this is to copy forward the contents of the next-to-last block, and deleting that instead. [There are details to be worked out, eg. preservation of correctness during this operation.]

Prev

[DESCRIBE HOW IT WORKS]

SCREW CASE FIX UNDERSTOOD BUT NOT YET IMPLEMENTED: if down-pointers are missing to blocks immediately after the first block of a level, PREV will miss those blocks. The problem occurs when a PREV-BLOCK of block B at level N occurs, causing a PREV at level N+1. If down-pointers are missing, the block B' associated with B's split key may be some predecessor of B rather than B itself. In this case PREVing at level N+1 is wasteful but correct; it would require fewer block accesses to chain from B' than from PREV of that entry. However, if the B' entry is the FIRST at level N+1, PREV will erroneously conclude that it is at the start of the chain. It either has to (a) always look at B's entry at level N+1 to check that the current block's ptr is really up-to-date, or (b) just check when it hits the START-OF-CHAIN.

Root Block protocol

Root uniqueness

We guarantee that the root block number never changes and thus can be used as a unique identifier for a given WB-tree. Other systems provide a unique tree ID by introducing a level of indirection in root references; this is inefficient, as root references are frequent. When the root is split, we allocate a new block to hold the data that would have remained in the root block, then use the old root block for the new root. This does mean that one cannot depend on a block's ID being unchanged if it splits!

ROOT DELETE and Reducing number of levels in a tree

NOT IMPLEMENTED.

Other tree organizations

There are other possibilities for tree layouts. It is possible that some of these may simplify operations that in the current layout are complex, such as the insert-screw-case. Other possible choices include:

My (RJZ's) favorite alternate assumption is allowing multiple versions of blocks to exist, as in [SAGIV]. This would mean:

Perhaps we can explore the interrelationships in some methodical way someday ...

Concurrency

Name access (and deleted-block reclamation)

In order to be able to explicitly know when a block is safe to delete, we insist that a user must get NAME lock on a pointer BEFORE releasing READ/WRITE on the block from which the pointer is obtained. NAME lock is useful for other things, such as

Fail-out protocol/access conflict strategy

Blocked operations are at the moment simply going to fail with an error code of RETRYERR, meaning that they can be safely retried later. The current idea is to use this whenever a READ-WRITE conflict occurs. (This would not be necessary using SAGIV's method.) However, since various other lockout and wait conditions can occur -- waits for block reads, waits on NAME locks, waits on interlocked operations -- some such facility would be needed anyway, so it seems reasonable to try to use it to handle READ-WRITE blocking as well.

NOT REALLY IMPLEMENTED YET due to the complexity of reorganizing the code to pass up the appropriate information in all cases. (Top-level routines return these codes but internal routines don't really use it yet, so retry-ability isn't really there yet.)

Deferred Index Updates and concurrency

The basic strategy is to allow a key insert/delete that causes a block to split/become empty to complete, but to then queue up the parent-update/block delete on some process that will retry deferred updates in the background until they succeed.

The problem is that deletes contain two separate operations that can wait: the parent update and the predecessor update. The two updates can be done separately if the block is first marked "dead" so that no insertions into it can occur. Unfortunately, there is a class of rare and complex screw cases involving the correct ordering of deletes and inserts that happen to involve the same key. One might for example delete a block, deleting its key, and then subsequent splits can reintroduce and re-delete it. If operations at the index level are deferred, the ordering of these deferred operations determines whether the resulting tree is correct.

Making block-delete atomic (as defined above) greatly simplifies this process. The block being inserted/deleted can then serve as its own semaphore.

We have decided to adopt a lazy update strategy. That is, rather than keeping around queues of pending parent-updates, we just throw them away if they can't be executed immediately. We can get away with this because the updates for level N+1 can be reconstructed from the chain at level N.

Now, a deferred update only affects performance if the affected path is actually encountered: if we find we have to chain across two blocks, it means a parent update hasn't occurred yet; if we encounter an empty block, it means a delete update hasn't occurred. Our idea is to fix these "inefficiencies" on demand, that is, only when we run into one will we expend the effort to fix it. For the moment, assume ALL block deletes and insert updates are deferred.

The basic algorithm is:

  1. If we have to chain forward in searching for a key, there must be pointer missing at the next level (deferred insert). So we attempt to insert it. Forward chaining to reach the NEXT and PREV of a key is normal and hence shouldn't cause update attempts.
  2. Whenever we reach an empty block, we attempt to delete it, except for the case where we are doing an insert which should go in that block. The delete attempt will fail UNLESS the relevant parent updates are complete, that is, if and only if the block is within a fully reduced span.

One major concern has been that an I/O failure during a block delete can leave a "dead" block, that is, one which can't be reached from its parent level. It is still in a chain but there is no down pointer to it and no search can terminate at that node.The problem is that once a block becomes dead we need to prevent it from being inserted into or restored into use, because that could result in entries being out-of-order (suppose that, while the block's dead, some inserts of keys less than its split key occur. They'll be directed into the next block by the index, and be posted there. If we then restore the block, the key ordering will be incorrect.) But it turns out that we can prevent this by observing which B-tree operations can encounter which types of deferred-update situation, as follows:

If we think of the tree as a set of nested SPANS, where the SPAN of an index entry is the set of entries in the blocks it SPANS, we note that during operations using search only - GET,PUT,REM -- the locus of search stays strictly within the SPAN of some entry in the root. We enforce that a block not be deleted until its parent pointers are completely updates, that is, its pointer has a span of exactly one block. Now suppose a block delete fails halfway, leaving a empty block without any parent pointer. Such a block is unreachable by FIND-NODE and hence by INSERT! This means that if INSERT encounters an empty block, it must be valid to insert into it.

This has several interesting consequences:

  1. This means that only the operations that can possibly chain ACROSS spans have to worry about dead blocks: the NEXT and PREV operations. (What exactly is the effect on PREV?)
  2. This means that "dead" blocks can't be deleted (unless we look for them specially). But they (a) are only created by a rare kind of event, and (b) won't hurt anything, so long as we arrange that NEXT and PREVs ignore the empty blocks (ie. ignore the value of their split keys).
  3. The way we have defined deferred-delete reconstruction, if the block isn't within a fully-reduced span, the delete simply can't succeed. This means that there is no point in trying to delete a blank block when we're chaining though a span, that is, we should only try to delete empty blocks encountered (1) when following a DOWN pointer in FIND-ENT, or (2) due to a NEXT operation (or PREV?).

[Having realized this, we can actually let block-deletes be two-part operations (again). The only additional complexity is that then we'd need to implement a method for detecting dead blocks, that is, differentiating them from empty blocks in the middle of large spans, which we mustn't try to delete. We just check if the block is continued in some span! HOW TO DO THAT EFFICIENTLY?]

In detail, the method of detecting deferred operations goes like this:

  1. RECONSTRUCTING DEFERRED PARENT-UPDATES (missing DOWN pointers): Whenever we have to follow the NEXT pointer of a block B (in FIND-NODE), we should attempt a PARENT-INSERT-UPDATE using the (key,value) pair (split(B),next(B)). FIND-NODE needs to be fixed to chain right rather than down when the "dead zone" is encountered; and it should not attempt a PARENT-INSERT-UPDATE in this case. PARENT-INSERT-UPDATE can fail if:
    1. some other process is already doing the update (the first one to lock the parent wins, the other will fail out);
    2. the key split(B) is already present (means a DELETE is pending -- this shouldn't really occur, though);
    3. the update requires a block split but no blocks are available.
  2. RECONSTRUCTING DEFERRED BLOCK DELETES: Whenever we reach an empty block in FIND-NODE or a NEXT/PREV operation, we'll attempt a PARENT-DELETE-UPDATE. PARENT-DELETE-UPDATE should fail when
    1. the key is found but points to a different block (meaning the containing span isn't fully reduced);
    2. the block is NAME-locked (this means its safe to leave name-locks around, the worst that can happen is they cause block deletes to be deferred some);
    3. someone else is doing a PARENT-DELETE-UPDATE on this block (this can be guaranteed by having the delete process first name-lock the block being deleted;
    4. some block needing to be modified (parent or prev) is locked.
  3. We need to fix the code that does the NEXT-KEY operation to not stop at potentially-dead blocks. CHAIN-FIND need NOT ignore blank blocks. In practice, we'll want to trigger the update routines at the normal times as well, ie. try insert-updates after block splits and delete-updates after a block becomes empty. There are a number of new statistics we should keep; these include: (The number of chain-forwards is also a measure of the overhead.) [Note: this mechanism also supports a simple queuing method: keep a list of the block numbers at which updates were deferred, and in times of low usage do FINDs to them, which will update them as a side-effect ...]

------------

[Other strategies were considered but were either significantly more complex or over-heady, or introduced unnecessary delays:

One hack is to serialize the queue of postponed index INSERT and DELETE operations by brute force: to do them in exactly the order they arrived in. Possibly simpler alternative method: sort by key and timestamp them!

I think we also decided that the interpreter could simply devote a process to each deferred operation, since we want to shift resources toward accomplishing the deferred operations if too many queue up.

[WANG] maintains a queue of deleted operation on the DESTINATION block. This has the disadvantage that whenever a block is split or merged the deferred-operation queues have to be split or merged as well -- ugh!.]

Buffer, I/O, and Free-List Management

Reclaiming Buffers

age vs. level vs. dirtiness

This is a perennial nuisance -- There is a complex system in place which hasn't really been evaluated or tuned, and there is also a proposal by Jonathan to simply use the first free (or free-able) buffer one finds.

update-access

It seems to me there was some case where it wasn't OK to just do a release to #f and an update from there -- but I can't think of it offhand ...

Deferred writes of data blocks

Note: this is a different sort of referral from the "deferred index updates" discussed earlier. Those deferred operations were correctness- preserving; these are not. The idea here is that we can reduce i/o traffic if we can safely lose a "small" number of updates.

Description and purpose

The general idea here is that we can reduce I/O traffic by deferring writes of leaf blocks, in the sense that the updates can be lost without compromising the structure of the database. This only works where the data updates in question are not critical to database or application. Currently, we always defer leaf updates -- both PUTs and REMoves -- to data trees, where a data tree is any tree not of type DIRECTORY. (DIRECTORY leaf blocks are written to disk after every update.) The idea is that the user application should have control over how often the data blocks are written.

Also, referral of PUTs and DELETES should be separately controllable. This feature is needed for example by the database itself in maintaining the free list: we can afford to defer INSERTS of deleted block, because the worst that could happen is that a free block gets lost. But DELETES from the free list must update immediately, else a block could be doubly-allocated.

Implementation

The handle field `wcb' has the following boolean values:

  1. SAP: save block after PUTs
  2. SAR: save block after REMOVEs
  3. SAC: force block save after cached block changes
  4. FAC: flush buffer entirely after cached block changes (not currently implemented -- future functionality)

These bits are set as follows:

DIRECTORY
SAP=SAR=1;
FREE LIST
SAR=1; SAP=SAC=0;
USER DATA
SAP=SAR=SAC=0;

The state of these bits can be changed at any time using (han:set-wcb! han new-bits). Directory trees force wcb-sap and wcb-sar when opened or created. Also, open_seg forces the free list write control bits to be as shown above, regardless of the block type of the free-list.

Calling flush-ents flushs some modified blocks. It is thread safe and can be called by a timer interrupt.

Caching of last (leaf) block used

Works great

NOT IMPLEMENTED YET: We could actually be cleverer and cache the parents of every node, and even the PREVs, using the same TRY-GET-ENT protocol!

Multiple Read Access

Multiple READ access is not supported yet. READ-READ conflicts can occur. We seem to recall noting that this limitation kept us from encountering some other problem, whose identity is lost for the moment.

Free-block management

Outline:

Current problems:

Last week (1/8) we concluded that (all efforts to date notwithstanding) there were still failure cases in free-block management under concurrent operation, because of problems like:

On the positive side, we realized that we NEED NOT allow enough free blocks for a free-list insert to split the whole tree -- any split except the leaf split can simply be postponed if there isn't a free block available at that time! (And similarly for any insert-updates that happen to be triggered by free-list accesses.)

Come to think of it, if we happen to run out of blocks during a free-list insert, its OK to let the insert fail, we just lose one disk block!! That may just be the answer!!

Buffer management routines

Two routines have been built for purging buffers.

FLUSH-BUFFER(ENT) writes out ENT if it is dirty and unlocked. It returns TERMINATED if ENT is locked, RETRYERR if the write is attempted and fails, and SUCCESS otherwise.

PURGE-BUFFER(ENT) writes out ENT if it is dirty and then frees up the buffer. This IGNORES the access status of the buffer, so it should not be called by users; it always returns SUCCESS.

Use (DO-SEG-BUFFERS SEG# FUNC) to apply a function to all the buffers of a given segment; for example, (DO-SEG-BUFFERS SEG# FLUSH-BUFFER) can be used to guarantee that segment SEG#'s disk file is up to date. DO-SEG-BUFFERS halts if FUNC returns other than SUCCESS; the result of FUNC is returned. SUCCESS is returned if all buffers have been successfully processed. To process all segments, use SEG #=-1.

(CHECK-BUFFER ENT) checks that the buffer is written and unlocked, and repairs those that are not.

Other issues to document

Error Handling

The problem: Calls need to return adequate information to distinguish:

  1. restartable operations, for example
  2. non-restartable failures (eg. disk i/o error)
  3. null results (eg. key not found)
  4. "real values", eg. the length of a returned string, or an ENT pointer vs. NULL.

The solution: use a bounded range of negative values as "failure" codes, leaving the non-negative return values available as "success" codes. The canonical success code is 0, but a routine that needs to return a value (string length, ENT pointer) can do so and have that value interpreted as a "success" code as well.

There are "degrees" of failure, and the negative codes are partitioned according to increasingly severity, as follows:
value C name Meaning
0 success successful execution
-1 notpres successful execution, no data present or no change made
-2 terminated failure, no damage, caller can retry operation
-10 retryerr failure, no damage, caller can retry operation
-15 argerr failure, no damage, call was in error
-20 noroom failure, no damage, out of room in file
-30 typerr failure, file or object was not of correct type
-40 ioerr i/o error, DB may be damaged
-45 strangerr internal error, DB may be damaged
-90 unkerr placeholder code
-100 maxerr

The first class represent operations that completed without error. The second class represent operations that failed to complete, but are guaranteed to leave the DB in a correct state and are retry-able (or easily correctable). The third class represent operations that failed to complete, did not damage the database, but are not easily fixable or restartable. The last class represent error conditions in which the DB was corrupted, or during which DB corruption was detected.

The predicate (ERR? code) returns #t if the return code is within the range NOTPRES-MAXERR; the predicate (REALERR? code) returns #t if CODE is an actual error, as opposed to a "not there" or "stop processing" message.

Longer Value Fields

The 256.B length limit for value strings was a barrier to many possible applications for WB. The db:get and db:put! procedures in `db.c' work with value strings up to 64770.B in length (see section SCM Record Operations).

For a value string of length L:

0.B <= L <= 255.B
The value string is stored literally with given key.
256.B <= L <= 64770.B
The first 255 bytes of the value string is stored literally with the given key. Successive 255 byte chunks of the value string are stored as L/255-1 sequential key-value pairs. The key for each chunk is the given key with the index byte 1 <= k <= 254 appended to it. Note that use of bt:scan is complicated by long value strings.

Proposed Extensions

The rest are proposed extensions to unlimited value string length.

256.B <= L <= bsiz - 20
The b-tree value fields points to a type datalong (1) block, which contains the value. The bsiz argument to make_seg() or block-size argument to make-seg is the size of all WB blocks (pages) in that segment.
bsiz < L
Value points to head of a chain or tree of datalong blocks. Retrieved value assembly limits practical size. Storage efficiency is good for L >> bsiz. An interesting variant is to have two trees, one for datalongs and the other for everything else. If the two trees are in separate segments (stored in separate files), then the datalong segment blocksize can be optimized without impacting speed for non-datalong operations.
bsiz < L
Value designates external file containing data. storage efficiency is good for L >> bsiz. Retrieved value needs no assembly; use mmap().

Unlimited Length Keys and Values

by Jonathan Finger

Use 2 btrees which I will refer to as main and slave

notation D[123,4567] will translate into a string D \3 1 2 3 \4 4 5 6 7 where \3 is ascii 3 etc (length of following string of digits) this will result in keys sorting in numerical order.

  1. Data (the easy part) The first byte of data is a flag. IF the data length < 255 THEN store in bytes 1 - (length_of_data + 1) ELSE In slave btree increment value in key D (D for data) As an example assume the new value is 2357 and the data is 10,000 bytes
    Store the data in
    D[2357,0] = first 250 bytes
    D[2357,1] = second 250 bytes
      .
      .
      .
    D[2357,249] = last 250 bytes
    
    When a new value is set, if it is > 254 bytes in length then you reuse the D[2357] names adding or deleting as needed. Note that it is much more efficient to overwrite existing nodes since (most) of them will be the same size and add or delete at the end. If the new value < 255 bytes delete D[2357] and store the value in the master btree.
  2. Keys (the hard part) IF key < 250 bytes THEN Store in master file as usual ELSE (key >= 250 bytes) break up the key into chunks C0 ... CN; the first 250 bytes long and the rest 220 bytes long. Create the entries (assume current value of key V is 243)
    Master btree
    C0 = (flag)244
    Slave btree
    V[244,C1] = 245
    V[245,C2] = 246
        .
        .
        .
    V[244 + N,CN] = data
    
    A get now consistes of breaking up the key and following the chain. The put, delete, next, and prev code will get a bit messy since there will be multiple cases to consider.

I believe this scheme will give unlimited key and data length. Performance probably will not be great but may be acceptable. This should give good key compression.

To Be Done

I think most of the items in RJZ's list have been done.

RJZ Modified 4/8/1993

B-tree maintenacne
I/O
Concurrency
Error handling
Miscellany
Design/Documentation

Miscellany

"Largest keys" (End-of-chain marker strings).
Because we've reserved keys with a first byte of FF for split keys, these keys are unavailable to the user.
NULL keys
WB supports use of the null string as a key (Sliced Bread treats the key specially).
Searching on minimum and maximum keys
Given that the null string may be used as a key, there needs to be a way to specify a "least" key such that NEXT(least) yields the first key, even if it is NULL. The special key strings with length s of -2 and -1 are provided to represent the minimum and maximum key values, respectively. These keys are only useful with NEXT and PREV; data cannot be associated with these keys.
Need to document TSCAN, TSTATS.

Bibliography

A comprehensive B-tree bibliography can be found at: http://www.informatik.uni-trier.de/~ley/db/access/btree.html

[BM72]
R. Bayer and E. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1:173-189, 1972.
[SAGIV]
Yehoshua Sagiv. Concurrent Operations on B*-trees with Overtaking. JCSS 33(2): 275-296 (1986)
[WANG]
W.E. Weihl and P. Wang. Multi-version memory: Software cache management for concurrent B-trees. In Proc. 2nd IEEE Symp. Parallel and Distributed Processing, 650-655, 1990.


[Contents]   [Back]   [Prev]   [Up]   [Next]   [Forward]