B-tree and UB-tree
|Rudolf Bayer (2008), Scholarpedia, 3(11):7742.||doi:10.4249/scholarpedia.7742||revision #137338 [link to/cite this article]|
The B-tree is a dynamic high performance data structure to organize and manage large datasets which are stored on pseudorandom access devices like disks, (Bayer and McCreight 1972).
The UB-tree is a multidimensional generalization of the B-tree.
Invented in 1969, B-trees are still the prevailing data structure for indexes in relational databases and many file systems (Comer 1979), (Weikum and Vossen 2002). Large means that the index is too large for main memory and must be stored on a secondary store like a hard disk (HD). Only a small amount of main memory is needed for caching the upper levels of a B-tree and the search path through the tree.
An index is an ordered set of index elements of the form (x, α), where x is the key, like a name, and α the associated information. α typically is the desired information itself, if it is very short like a phone number, or a pointer to it (like a universal resource locator, URL), in which case one additional level of indirection is needed. The index is ordered by some order on the key set, like the alphabetic order on names.
The secondary store is assumed to provide direct access to chunks of data (disk blocks or Web-pages), if their reference, e.g. HD-address or URL, is known. Data retrieval happens in such chunks of data.
With B-trees, data retrieval is very fast and nearly independent of the size of the dataset, since the index and therefore the retrieval time grows only logarithmically with the size of the dataset. The base of that logarithm is very large, at least 1.000 for today’s storage architectures. In practical terms this means: Consider an index for all pages in the Web, presently estimated at 1011 pages. Even if the Web would grow by a factor of 1.000, this means that the corresponding retrieval time would only grow by one disk access, and there would be no noticeable slowdown in the access time.
If the world was perfectly static, such a performance could also be achieved with other indexing techniques. However, operational data of industrial enterprises or public institutions or the Web are highly dynamic and change quickly. The B-tree can easily cope with that, since it is a self-organizing structure, which reorganizes itself with every insertion or deletion of data. Therefore, it allows permanent continuous processing without any interruptions for reorganization.
Dynamics of B-trees
B-trees are multiway search trees such that preorder traversal yields the keys in increasing (or decreasing) order. Fig. 1 shows an example of a B-tree with numbers as keys. To find a key x and the associated data, one proceeds from the root and retrieves on each level that child node, which leads towards x. Since B-trees typically have a height of only 3 to 5, at most 5 nodes (resp. disk blocks) must be retrieved in the worst case. In computing environments with high access rates the standard caching strategies of the operating system tend to reduce that access by at least 2 HD-accesses. From a practical point of view B-trees, therefore, guarantee an access time of less than 10 ms even for extremely large datasets.
Insertion and deletion
The essential idea to understand B-trees is the insertion algorithm, in particular the split process: The nodes of a B-tree have a fixed capacity to contain up to 2k (k is a natural number) index elements and 2k+1 pointers to child nodes. The optimal k is determined by complexity analysis.
Starting with an empty B-tree, one creates a root node and inserts elements, which are kept sorted in the node. In general, when a node is full, i.e. contains 2k elements, and the next element must be inserted into it, the node is split into two half full nodes, each containing k elements, and the middle of the of 2k+1 elements (together with a pointer to the new node) is inserted into the parent node of the split node in proper sort order. This splitting process may propagate recursively on the path towards the root. If even the root must be split, a new root is created containing only the one middle element and pointers to its two children. Such a root split is the only event which increases the height of a B-tree, and obviously is extremely rare.
Deletion of an index element may cause a node P to contain only k-1 elements. If a sibling node Q (left or right) of P contains only k elements, P and Q are merged. Otherwise P and Q are merged only briefly (now containing more than 2k elements) in main memory and are immediately split again in the middle.
Figure 1 shows an example of a B-tree of height 2 and with k=2. For simplicity we use natural numbers as keys and omit showing the associated information, since it has no influence on the structure of the B-tree. Observe that all leaf nodes are exactly on the same level and all paths from the root to a leaf node have the same length. This is an obvious consequence of the split algorithm.
Now, assume that keys 3 and 19 are inserted into the B-tree of Fig. 1. They end up in the first and fourth leaf nodes without modification of the structure of the B-tree. Then, inserting key 9 into the tree results in a split of the second leaf node, the middle key 8 is moved to its parent, the root, causing a split of the root. The resulting B-tree is shown in Figure 2.
Deletion of the keys just inserted would result in the original tree. Observe that tree transformations involve only nodes along the search path, and at most two nodes per level are involved. Thus, the insertion and deletion effort is bounded by the height of the tree, but in most cases it is much smaller. The given example shows the worst case with modifications along the complete search path.
Properties of B-trees
From the insertion and deletion algorithms the following properties of B-trees can be derived. They result in a static definition of a non empty B-tree. Let h and k be natural numbers. Then a B-tree in class τ(k,h) is defined as follows:
- Each path from the root to any leaf has the same length h (the height of the tree)
- Each node except the root and the leaves has at least k+1 child nodes. The root is itself a leaf (h=1) or has at least 2 child nodes
- Each node has at most 2k+1 child nodes.
From this definition further properties of B-trees are derived, in particular tight bounds Imin and Imax for the minimal and maximal number of index elements in a B-tree: Imin = 2(k+1)h-1-1 and Imax = (2k+1)h-1
Therefore a B-tree of height 5 with k=500 contains at least 10024 > 1012 (or one trillion) and up to 1015 index elements. In comparison, Google presently indexes only about 40 billion (4*1010) Web-pages.
Variants of B-trees
They arise as a slight variant of B-trees as follows: When splitting a leaf node, the middle index element (x, α) remains on the right split off sibling, and only its key x is moved to the parent node. The upper levels of the B+-tree behave exactly like an original B-tree.
This has the practical consequence, that the intermediate nodes contain less data, since the associated information α is not needed there, and therefore they have a higher branching degree for a given node capacity resp. size of a HD-page. This results in an even shallower tree and in improved performance for a very small increase in storage overhead.
For k=1 a node contains at most 2 keys and 3 pointers to children. Such a B-tree is structurally identical to the so-called 2-3-trees or red/black trees. If the node contains only one key and pointers to two children, it looks like the node of a standard binary search tree. If it contains 2 keys and 3 pointers, the node looks like the left tree in Figure 3. One may introduce an additional pointer and may represent such a node as the node of a binary tree like the right part of Figure 3.
Now, each node contains 2 pointers and the B-tree has turned into a binary tree. The horizontal pointer plays a special role and simulates, that the 2 nodes arose from a page of the B-tree. The algorithms and properties of general B-trees may easily be transformed into those for binary B-trees, which are an excellent search structure for main memory. In addition, they have the significant SW-engineering advantage, that is the same algorithms can basically be used for main memory and secondary storage indexing.
Every node Q of a B-tree is the root of a subtree(Q) (except for the leaf nodes). Like in all search trees, all keys in subtree(Q) of a B-tree are within an interval, whose bounds are two adjacent keys in the parent node of Q. In case of lexicographic orders, that means that all keys of a node have a longest common prefix. Obviously it suffices to store this common prefix only once per node or to even derive it from the structure of the B-tree and the search path. This technique is known as head-compression or front-compression.
Similarly, to direct the search through the levels of the tree, one does not need to store the complete key, but only enough to distinguish it from its successor key, the rest is discarded. This technique is known as tail-compression or end-compression. Consider a node containing the following key words:
- Database, Datacompression, Datadefinition, Dataobject, Dataorganization, Datasource, Datastore, Datastructure,
Prefix compression with the common prefix Data yields e.g. the following node content:
- Data: base, compression, definition, object, organization, source, store, structure
If the above node is a leaf, which must be split in the middle like in a B+-tree, then it suffices to move a shortest prefix (the separator) which separates two adjacent tails to the parent node. To split between the tails object and organization above, this would be the separator or. Combining head and tail compression one needs to store only those shortest separators in the non-leaf nodes, i.e. very few characters for each key, on the average between 1 and 3 depending on the alphabet used. This technique further reduces the amount of data that must be stored in the interior nodes of a B+-tree, increasing the degree of branching, thereby reducing the height of the tree and improving performance. These are the basic ideas underlying Prefix B-trees originally described in (Bayer and Unterauer 1977).
B-trees in Relational Data Base Systems (RDBS)
B-trees as Primary Indexes
State of the art in RDBS is to use the primary key of a relation as the key for any variant of a B-tree. The data of a tuple is stored in the leaves of the B-tree, i.e. the complete table is stored in the B-tree. Alternatively, the tuple itself is stored in a separate data structure and the associated information in the B-tree is just a pointer to the tuple. In many RDBS applications, tuple access via the primary key is the most frequent retrieval operation and is made very fast by a B-tree index.
B-trees as Secondary Indexes
Arbitrary attributes (columns) of a relational table may also be indexed by a B-tree. In this case, the associated information is a set of primary key values or artificial internal identifiers, also called database identifiers, which are then used to access the tuple itself.
Both, primary and secondary indexes are ideal to answer SQL queries of the following forms quickly, where R is a relation, A is an attribute of R, and c and d are constants:
- select * from R where A = c;
- select * from R where A is between c and d;
B-trees as Join Indexes
Relational table joins result from queries like
- select * from R1, R2 where R1.A = R2.B;
Here, for a given tuple from R1 and known value for the attribute R1.A the matching tuples from the second relation R2 must be fetched. The access to R2 is obviously a point query and is very fast, if an index on the attribute B of R2 is available. Of course, this technique also works symmetrically with respect to R1 and R2.
The use of B-trees in RDBS is ubiquitous and manifold, their best use must be determined by the RDBS optimizer and is a complex task way beyond this survey.
Some relational queries can even be answered by the index itself without accessing the relation at all, e.g.
- select count(*) from Employees;
Parallel Processing with B-trees
In many applications, like banking, electronic shopping, Web or library queries, highly parallel processing within the databases is needed (Bayer and Schkolnick 1977), (Weikum and Vossen 2002). This must be compatible with parallel updates and in addition requires non-stop operation. B-trees are ideally suited for such DB-applications for the following two reasons:
- The root must be visited by every search and update process, but even for most updates it only needs to be read. Thus, the root – and maybe some of its children – are read hotspots, and therefore they are always cached in main memory by the standard caching techniques of the database system and the operating system. This allows extremely fast and highly parallel processing.
- On the other hand, most updates and structural transformations of B-trees are limited to the leaves and the lower levels of the tree, where they interfere very little.
To account for this processing scenario a variety of specialized synchronization techniques were developed for read- and update-transactions. They work with different types of locks, mostly on the node granularity, and follow locking protocols which take advantage of the special properties of B-trees. The first such synchronization protocol was developed in (Bayer and Schkolnick 1977) for System R of IBM Research. (Weikum and Vossen 2002) devotes an entire chapter to this topic.
UB-trees for Multidimensional Applications
Classical B-trees were designed for one dimensional, linearly ordered key spaces. Here, B-trees are excellent for point queries and interval queries. But many applications are multidimensional, e.g. geographic maps, (2-dim), GPS data (3-dim, because of altitude in addition to latitude and longitude), or Data Warehouse (DWH) queries like asking for
- the sales in a geographic area for a certain product group in a certain month (4-dim).
Range queries in such spaces correspond to multidimensional rectangles, and the multidimensional points in those rectangles must be retrieved.
By a mathematical transformation, such multidimensional data spaces can be linearized and then represented in ordinary B-trees. The overall resulting data structure is called UB-tree for Universal B-tree (Markl et al. 2000),(Website for UB-trees). The transformations used are space filling curves, and the linear order of the multidimensional keys is the order of the points on that curve.
Two examples of such curves are the Peano-curve or Z-curve and the Hilbert-curve. The order defined by the Z-curve is called Z-order. Points in multidimensional space are transformed to their one dimensional Z-coordinate and inserted into the B-tree using the Z-order. Therefore, the points in the node of a B-tree correspond to an interval on the Z-curve. An interval of the Z-curve covers a multidimensional subspace, a so-called Z-region. Therefore, a B-tree node corresponds exactly to a Z-curve interval and also to a Z-region. Figure 4 shows the interval [4:20] of the Z-curve (numbering starts with 0 in the left upper corner) in a 2-dimensional 8x8 space and the region covered by it with its characteristic shape.
Now, let us look at a multidimensional range query: To answer such a query, only those UB-tree nodes must be retrieved, whose Z-regions intersect the query box. There are basically two problems to solve, the mathematical algorithms
- to compute the Z-coordinate of a multidimensional point and
- to determine those UB-tree nodes, whose Z-regions intersect the query box specified by the range query.
These two algorithms are complicated, but they have linear running time, i.e. they are very efficient. In addition, both tasks can be solved by preprocessing the data before accessing the B-tree index. Therefore, multidimensional indexing can be done efficiently and easily by using an existing B-tree implementation, e.g. in a RDBS: One simply adds a preprocessor which performs the above two tasks, runs completely in main memory, and does not cause any I/O overhead. The combination of a standard B-tree with an appropriate preprocessor results in the UB-tree.
Figure 5 shows a coarse example of a very small database with a query box, which is intersected by only 5 regions.
A further consequence is: The set of nodes which must be retrieved is proportional to the set of datapoints in the query box, i.e. to the size of the answer for a query, and therefore it is essentially independent of the size of the overall database. This is illustrated intuitively by Figure 6: the region pattern results from a database with 50.000 points. The red regions are the only ones which must be retrieved from the database, since they intersect the querybox. For details see (Web Site for UB-trees).
All Relational Databases are Multidimensional
So far we only discussed multidimensional point data. An important observation is that all relational databases are multidimensional. To explain this view, consider an entity/relationship model with two entities, e.g. Authors and Books. Since some authors wrote several books and some books have several authors, there is an n:m relationship wrote between Authors and Books. The standard representation of such a model in a relational database is as follows:
- Use a table, also called Authors, for the entity Authors, with primary key A
- Use a table, also called Books, for the entity Books, with primary key B
- Use a table, also called wrote, for the n:m relationship wrote, with foreign keys A of Authors and B of Books.
Thus wrote has the composite key (A, B) as its primary key, and its tuples can obviously be considered as points in a 2-dimensional dataspace. Whenever relational queries must join the two tables Authors and Books via the table wrote, typically 2-dimensional range queries on the relation wrote result.
Extended Multidimensional Objects
To index extended multidimensional objects of arbitrary shape, a simple transformation does the trick: surround the object by a tight multidimensional rectangle R, the bounding box. In case of a d-dimensional space, transform the bounding box into a point in 2d-dimensional space, also called the dual space, and store this point in an UB-tree. Figure 7 shows the transformation of several intervals with coordinates [x, y]. Each interval in one-dimensional space is transformed into points (x, y) in 2-dimensional space. Since x<y, these points are all located in the subspace below the diagonal. Range queries for containment, exclusion or intersection with objects in d-space are transformed into range queries in 2d-space analogously. Details are found on (Web Site for UB-trees).
- Bayer, R., McCreight, E.: Organization and Maintenance of Large Ordered Indexes. Acta Informatica 1, 3 (1972), pp. 173- 189
- Weikum G., Vossen, G.: Transactional Information Systems. Morgan Kaufmann Publishers 2002, ISBN 1-55860-508-8
- Bayer, R., Schkolnick M.: Concurrency of Operations on B-trees. Acta Informatica 9,1 (1977), pp.1-21
- Comer, D.: The Ubiquitous B-tree. ACM Computing Surveys 11, 1 (June 1979), pp. 121-137
- Web Site for UB-trees: http://mistral.in.tum.de/
- Markl, V. et al.: Integrating the UB-Tree into a Database System Kernel, Proceedings of the Conference on very large databases (VLDB), Cairo, Egypt, September 2000.
- Howard Eichenbaum (2008) Memory. Scholarpedia, 3(3):1747.
- Hermann Haken (2008) Self-organization. Scholarpedia, 3(8):1401.
- Arkady Pikovsky and Michael Rosenblum (2007) Synchronization. Scholarpedia, 2(12):1459.
- Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, San Mateo 1993, ISBN 1-555860.
- Ramakrishnan, R.: Database Management Systems. McGraw-Hill 1998, ISBN 0-07-050775-9
- Zhang, R. et al.: Generalized Multidimensional Data Mapping and Query Processing. ACM Transactions on Database Systems, 30(3), Sept. 2005, pp. 661-697.
- Bayer, R., Unterauer, K.: Prefix B-Trees. ACM Transactions on Database Systems 2, 1 (March 1977), pp. 11-26.
- Kemper, A., Eickler, A.: Datenbanksysteme. Oldenbourg Verlag, München 2001, ISBN 3-486-25706-4
- Bayer, R.: The universal B-tree for multidimensional indexing: General Concepts. World Wide Computing and its Applications ‘97, Tsukuba 1997, pp. 10-11.