BSTs & Database Indexing
In the previous lesson, we explored general binary trees and their hierarchical nature. Now, we’ll dive into a specific type of binary tree with a powerful property: the Binary Search Tree (BST). BSTs are designed for efficient searching, insertion, and deletion of elements, making them fundamental for tasks like database indexing, where quick data retrieval is paramount.
Binary Search Tree
Keys in the left subtree are smaller, keys in the right subtree are larger.
Searching for 40 would follow the path: ...
🔍 What is a Binary Search Tree (BST)?
A Binary Search Tree is a node-based binary tree data structure which has the following properties:
- The left subtree of a node contains only nodes with keys less than the node’s key.
- The right subtree of a node contains only nodes with keys greater than the node’s key.
- Both the left and right subtrees must also be binary search trees.
- There must be no duplicate keys (though some implementations allow duplicates by placing them on one side consistently).
These properties ensure that an in-order traversal of a BST will yield elements in sorted order.
Analogy: Imagine a library where books are organized by a very specific rule: if a book’s title comes alphabetically before the current section’s title, you go left. If it comes after, you go right. This dramatically narrows down your search with each decision.
🚀 Core BST Operations & Complexity
Let’s look at the essential operations on a BST and their typical complexities. We’ll use recursive implementations, which are common for tree operations.
1. Search
To search for a key, you start at the root:
- If the key is equal to the current node’s data, you’ve found it.
- If the key is smaller, go to the left child.
- If the key is larger, go to the right child.
- If you reach
NULL
, the key is not in the tree.
2. Insert
To insert a new key, you search for the appropriate place where it should be:
- If the tree is empty, the new key becomes the root.
- If the key is smaller than the current node’s data, insert into the left subtree.
- If the key is larger, insert into the right subtree.
- Once an empty spot (NULL pointer) is found, create a new node there.
3. Delete
Deletion is the most complex operation, as you need to maintain the BST properties:
- Case 1: Node has no children (Leaf Node): Simply remove it.
- Case 2: Node has one child: Replace the node with its child.
- Case 3: Node has two children: Find the in-order successor (smallest node in the right subtree) or in-order predecessor (largest node in the left subtree), copy its data to the node being deleted, and then delete the successor/predecessor (which will be an easier Case 1 or 2 deletion).
#include <stdio.h>
#include <stdlib.h>
typedef struct BSTNode {
int key;
struct BSTNode *left;
struct BSTNode *right;
} BSTNode;
// Helper: Create a new BST node
BSTNode* create_bst_node(int key) {
BSTNode* newNode = (BSTNode*) malloc(sizeof(BSTNode));
if (newNode == NULL) { /* Handle error */ exit(EXIT_FAILURE); }
newNode->key = key;
newNode->left = NULL;
newNode->right = NULL;
return newNode;
}
// Helper: Find the in-order successor (smallest in right subtree)
BSTNode* find_min(BSTNode* node) {
BSTNode* current = node;
while (current && current->left != NULL)
current = current->left;
return current;
}
// Search for a key (returns node or NULL)
BSTNode* search_bst(BSTNode* root, int key) {
if (root == NULL || root->key == key)
return root;
if (key < root->key)
return search_bst(root->left, key);
return search_bst(root->right, key);
}
// Insert a new key
BSTNode* insert_bst(BSTNode* node, int key) {
if (node == NULL)
return create_bst_node(key);
if (key < node->key)
node->left = insert_bst(node->left, key);
else if (key > node->key) // Handle duplicates by not inserting
node->right = insert_bst(node->right, key);
return node;
}
// Delete a key
BSTNode* delete_bst(BSTNode* root, int key) {
if (root == NULL) return root;
if (key < root->key)
root->left = delete_bst(root->left, key);
else if (key > root->key)
root->right = delete_bst(root->right, key);
else { // Node with only one child or no child
if (root->left == NULL) {
BSTNode* temp = root->right;
free(root);
return temp;
} else if (root->right == NULL) {
BSTNode* temp = root->left;
free(root);
return temp;
}
// Node with two children: Get the in-order successor (smallest in the right subtree)
BSTNode* temp = find_min(root->right);
root->key = temp->key; // Copy successor's content to this node
root->right = delete_bst(root->right, temp->key); // Delete the successor
}
return root;
}
// In-order traversal to print BST (should be sorted)
void inorder_print(BSTNode* root) {
if (root != NULL) {
inorder_print(root->left);
printf("%d ", root->key);
inorder_print(root->right);
}
}
// Main for demonstration
int main() {
BSTNode* root = NULL;
root = insert_bst(root, 50);
insert_bst(root, 30);
insert_bst(root, 20);
insert_bst(root, 40);
insert_bst(root, 70);
insert_bst(root, 60);
insert_bst(root, 80);
printf("In-order traversal of BST: ");
inorder_print(root); // Expected: 20 30 40 50 60 70 80
printf("\n");
BSTNode* found = search_bst(root, 40);
printf("Search for 40: %s\n", found != NULL ? "Found" : "Not Found"); // Found
found = search_bst(root, 90);
printf("Search for 90: %s\n", found != NULL ? "Found" : "Not Found"); // Not Found
root = delete_bst(root, 20); // Delete leaf
printf("In-order after deleting 20: ");
inorder_print(root);
printf("\n");
root = delete_bst(root, 70); // Delete node with two children
printf("In-order after deleting 70: ");
inorder_print(root);
printf("\n");
// Free memory (simplified, full recursive free needed)
// For a robust solution, you'd typically have a function for post-order deletion.
// This is just for demonstration.
return 0;
}
class BSTNode:
def __init__(self, key):
self.key = key
self.left = None
self.right = None
class BST:
def __init__(self):
self.root = None
def _insert(self, node, key):
if node is None:
return BSTNode(key)
if key < node.key:
node.left = self._insert(node.left, key)
elif key > node.key: # No duplicates inserted
node.right = self._insert(node.right, key)
return node
def insert(self, key):
self.root = self._insert(self.root, key)
def _search(self, node, key):
if node is None or node.key == key:
return node
if key < node.key:
return self._search(node.left, key)
return self._search(node.right, key)
def search(self, key):
return self._search(self.root, key)
def _find_min(self, node):
current = node
while current.left is not None:
current = current.left
return current
def _delete(self, node, key):
if node is None:
return node
if key < node.key:
node.left = self._delete(node.left, key)
elif key > node.key:
node.right = self._delete(node.right, key)
else: # Node to be deleted found
if node.left is None: # Case 1 or 2 (no left child)
temp = node.right
node = None # For garbage collection
return temp
elif node.right is None: # Case 2 (no right child)
temp = node.left
node = None
return temp
# Case 3 (two children)
temp = self._find_min(node.right) # Find in-order successor
node.key = temp.key # Copy successor's key to this node
node.right = self._delete(node.right, temp.key) # Delete the successor
return node
def delete(self, key):
self.root = self._delete(self.root, key)
def _inorder_print(self, node, result):
if node:
self._inorder_print(node.left, result)
result.append(node.key)
self._inorder_print(node.right, result)
def inorder_print(self):
result = []
self._inorder_print(self.root, result)
return result
# Main for demonstration
bst = BST()
bst.insert(50)
bst.insert(30)
bst.insert(20)
bst.insert(40)
bst.insert(70)
bst.insert(60)
bst.insert(80)
print(f"In-order traversal of BST: {bst.inorder_print()}") # Expected: [20, 30, 40, 50, 60, 70, 80]
found_node = bst.search(40)
print(f"Search for 40: {'Found' if found_node else 'Not Found'}")
found_node = bst.search(90)
print(f"Search for 90: {'Found' if found_node else 'Not Found'}")
bst.delete(20)
print(f"In-order after deleting 20: {bst.inorder_print()}")
bst.delete(70)
print(f"In-order after deleting 70: {bst.inorder_print()}")
class BSTNode {
constructor(key) {
this.key = key;
this.left = null;
this.right = null;
}
}
class BinarySearchTree {
constructor() {
this.root = null;
}
// Insert a key
insert(key) {
const newNode = new BSTNode(key);
if (this.root === null) {
this.root = newNode;
return;
}
let current = this.root;
while (true) {
if (key < current.key) {
if (current.left === null) {
current.left = newNode;
return;
}
current = current.left;
} else if (key > current.key) { // No duplicates inserted
if (current.right === null) {
current.right = newNode;
return;
}
current = current.right;
} else { // Key already exists
return;
}
}
}
// Search for a key (returns node or null)
search(key) {
let current = this.root;
while (current) {
if (key === current.key) {
return current;
}
if (key < current.key) {
current = current.left;
} else {
current = current.right;
}
}
return null;
}
// Helper: Find the in-order successor (smallest in right subtree)
_findMinNode(node) {
if (node.left === null) {
return node;
}
return this._findMinNode(node.left);
}
// Delete a key
delete(key) {
this.root = this._deleteNode(this.root, key);
}
_deleteNode(node, key) {
if (node === null) {
return null;
}
if (key < node.key) {
node.left = this._deleteNode(node.left, key);
return node;
} else if (key > node.key) {
node.right = this._deleteNode(node.right, key);
return node;
} else { // Node to be deleted found
// Case 1: No children or one child
if (node.left === null) {
return node.right;
} else if (node.right === null) {
return node.left;
}
// Case 3: Two children
// Find in-order successor (smallest in right subtree)
const temp = this._findMinNode(node.right);
node.key = temp.key; // Copy successor's key to this node
node.right = this._deleteNode(node.right, temp.key); // Delete the successor
return node;
}
}
// In-order traversal to print BST (should be sorted)
inorderTraversal() {
const result = [];
this._inorder(this.root, result);
return result;
}
_inorder(node, result) {
if (node) {
this._inorder(node.left, result);
result.push(node.key);
this._inorder(node.right, result);
}
}
}
// Main for demonstration
const bst = new BinarySearchTree();
bst.insert(50);
bst.insert(30);
bst.insert(20);
bst.insert(40);
bst.insert(70);
bst.insert(60);
bst.insert(80);
console.log("In-order traversal of BST:", bst.inorderTraversal()); // Expected: [20, 30, 40, 50, 60, 70, 80]
let foundNode = bst.search(40);
console.log("Search for 40:", foundNode ? "Found" : "Not Found");
foundNode = bst.search(90);
console.log("Search for 90:", foundNode ? "Found" : "Not Found");
bst.delete(20);
console.log("In-order after deleting 20:", bst.inorderTraversal());
bst.delete(70);
console.log("In-order after deleting 70:", bst.inorderTraversal());
Complexity (BST Operations):
-
Search, Insert, Delete:
- Average Case: O(log N) - If the tree is reasonably balanced, each comparison halves the search space.
- Worst Case: O(N) - If the tree becomes skewed (e.g., elements inserted in strictly ascending or descending order), it degenerates into a linked list.
-
Space Complexity: O(N) for storing N nodes. (Recursive calls add O(H) space to the call stack, where H is height).
Note: The efficiency of BSTs hinges entirely on their balance. A perfectly balanced BST guarantees O(log N) operations, similar to binary search on a sorted array. However, a poorly balanced BST can perform as badly as a linked list for all operations, losing its primary advantage.
⚖️ Self-Balancing BSTs
To overcome the worst-case O(N) performance of a simple BST, specialized “self-balancing” BSTs were developed. These trees automatically perform rotations and reconfigurations during insertions and deletions to maintain a roughly logarithmic height.
Common types of self-balancing BSTs:
- AVL Trees: Ensure that for any node, the height difference between its left and right subtrees is at most 1.
- Red-Black Trees: Use color properties (red/black) on nodes to maintain balance and ensure that the longest path from the root to a leaf is no more than twice the length of the shortest path. (More complex to implement but very common).
Complexity (Self-Balancing BSTs):
- Search, Insert, Delete: Always O(log N) - The self-balancing mechanism guarantees logarithmic height.
- Space Complexity: O(N)
Note: Red-Black trees are often preferred in practice (e.g., in Java’s TreeMap
and TreeSet
, C++‘s std::map
and std::set
) because while AVL trees are more strictly balanced, Red-Black trees require fewer rotations on average during modifications, leading to better amortized performance in many real-world scenarios.
💾 Database Indexing with BSTs
One of the most crucial real-world applications of tree structures, particularly self-balancing BSTs (or their variations like B-trees and B+ trees), is database indexing.
Imagine a large database table with millions of rows. If you want to find a record by a specific column (e.g., WHERE user_id = 12345
), scanning every row (a linear scan, O(N)) would be incredibly slow.
How Indexing Works (Conceptual):
- Index Creation: When you create an index on a column (e.g.,
user_id
), the database builds a tree-like data structure (often a B-tree or B+ tree, which are multi-way search trees optimized for disk I/O). - Key-Pointer Pairs: This tree stores the indexed column’s values (keys) along with pointers (or references) to the actual rows in the table where those values are found.
- Efficient Search: When a query filters by the indexed column, the database traverses this index tree. Because it’s a search tree, it can quickly navigate from the root down to the desired key in O(log N) time (where N is the number of entries in the index).
- Direct Row Access: Once the key is found in the index, the database uses the associated pointer to jump directly to the relevant row(s) in the table, avoiding a full table scan.
Performance Comparison: O(N) vs. O(log N)
How operation time scales with input size for linear scan versus indexed search.
Benefits of Database Indexing:
- Faster Data Retrieval: Dramatically reduces query execution time for
SELECT
statements withWHERE
clauses on indexed columns. - Faster Sorting: Can speed up
ORDER BY
clauses. - Faster Joins: Improves performance of
JOIN
operations between tables.
Trade-offs/Considerations:
- Storage Overhead: Indexes consume disk space.
- Write Performance:
INSERT
,UPDATE
,DELETE
operations become slower because the database also has to update the index tree(s) in addition to the actual table data. - Choosing Indexes: Not every column needs an index. Columns frequently used in
WHERE
,ORDER BY
, orJOIN
clauses are good candidates.
Interesting Note: While conceptual BSTs are typically binary, database indexing often uses B-trees or B+ trees. These are not binary; they are multi-way search trees where a node can have many children (e.g., hundreds). This is because they are optimized for disk I/O: a single node can store many keys, meaning fewer disk reads are needed to traverse down the tree, which is crucial given the high latency of disk access.