Tree Structures Lesson 8

BSTs & Database Indexing

intermediate ⏱️ 50 minutes 🌍 Speeding up database lookups, implementing dictionaries/maps, symbol tables in compilers

In the previous lesson, we explored general binary trees and their hierarchical nature. Now, we’ll dive into a specific type of binary tree with a powerful property: the Binary Search Tree (BST). BSTs are designed for efficient searching, insertion, and deletion of elements, making them fundamental for tasks like database indexing, where quick data retrieval is paramount.

Binary Search Tree

Keys in the left subtree are smaller, keys in the right subtree are larger.

50 30 20 40 70 60 80

Searching for 40 would follow the path: ...


🔍 What is a Binary Search Tree (BST)?

A Binary Search Tree is a node-based binary tree data structure which has the following properties:

  1. The left subtree of a node contains only nodes with keys less than the node’s key.
  2. The right subtree of a node contains only nodes with keys greater than the node’s key.
  3. Both the left and right subtrees must also be binary search trees.
  4. There must be no duplicate keys (though some implementations allow duplicates by placing them on one side consistently).

These properties ensure that an in-order traversal of a BST will yield elements in sorted order.

Analogy: Imagine a library where books are organized by a very specific rule: if a book’s title comes alphabetically before the current section’s title, you go left. If it comes after, you go right. This dramatically narrows down your search with each decision.


🚀 Core BST Operations & Complexity

Let’s look at the essential operations on a BST and their typical complexities. We’ll use recursive implementations, which are common for tree operations.

To search for a key, you start at the root:

  • If the key is equal to the current node’s data, you’ve found it.
  • If the key is smaller, go to the left child.
  • If the key is larger, go to the right child.
  • If you reach NULL, the key is not in the tree.

2. Insert

To insert a new key, you search for the appropriate place where it should be:

  • If the tree is empty, the new key becomes the root.
  • If the key is smaller than the current node’s data, insert into the left subtree.
  • If the key is larger, insert into the right subtree.
  • Once an empty spot (NULL pointer) is found, create a new node there.

3. Delete

Deletion is the most complex operation, as you need to maintain the BST properties:

  • Case 1: Node has no children (Leaf Node): Simply remove it.
  • Case 2: Node has one child: Replace the node with its child.
  • Case 3: Node has two children: Find the in-order successor (smallest node in the right subtree) or in-order predecessor (largest node in the left subtree), copy its data to the node being deleted, and then delete the successor/predecessor (which will be an easier Case 1 or 2 deletion).
Binary Search Tree Operations

        
          
#include <stdio.h>
#include <stdlib.h>

typedef struct BSTNode {
  int key;
  struct BSTNode *left;
  struct BSTNode *right;
} BSTNode;

// Helper: Create a new BST node
BSTNode* create_bst_node(int key) {
  BSTNode* newNode = (BSTNode*) malloc(sizeof(BSTNode));
  if (newNode == NULL) { /* Handle error */ exit(EXIT_FAILURE); }
  newNode->key = key;
  newNode->left = NULL;
  newNode->right = NULL;
  return newNode;
}

// Helper: Find the in-order successor (smallest in right subtree)
BSTNode* find_min(BSTNode* node) {
  BSTNode* current = node;
  while (current && current->left != NULL)
      current = current->left;
  return current;
}

// Search for a key (returns node or NULL)
BSTNode* search_bst(BSTNode* root, int key) {
  if (root == NULL || root->key == key)
      return root;
  if (key < root->key)
      return search_bst(root->left, key);
  return search_bst(root->right, key);
}

// Insert a new key
BSTNode* insert_bst(BSTNode* node, int key) {
  if (node == NULL)
      return create_bst_node(key);
  if (key < node->key)
      node->left = insert_bst(node->left, key);
  else if (key > node->key) // Handle duplicates by not inserting
      node->right = insert_bst(node->right, key);
  return node;
}

// Delete a key
BSTNode* delete_bst(BSTNode* root, int key) {
  if (root == NULL) return root;

  if (key < root->key)
      root->left = delete_bst(root->left, key);
  else if (key > root->key)
      root->right = delete_bst(root->right, key);
  else { // Node with only one child or no child
      if (root->left == NULL) {
          BSTNode* temp = root->right;
          free(root);
          return temp;
      } else if (root->right == NULL) {
          BSTNode* temp = root->left;
          free(root);
          return temp;
      }
      // Node with two children: Get the in-order successor (smallest in the right subtree)
      BSTNode* temp = find_min(root->right);
      root->key = temp->key; // Copy successor's content to this node
      root->right = delete_bst(root->right, temp->key); // Delete the successor
  }
  return root;
}

// In-order traversal to print BST (should be sorted)
void inorder_print(BSTNode* root) {
  if (root != NULL) {
      inorder_print(root->left);
      printf("%d ", root->key);
      inorder_print(root->right);
  }
}

// Main for demonstration
int main() {
  BSTNode* root = NULL;
  root = insert_bst(root, 50);
  insert_bst(root, 30);
  insert_bst(root, 20);
  insert_bst(root, 40);
  insert_bst(root, 70);
  insert_bst(root, 60);
  insert_bst(root, 80);

  printf("In-order traversal of BST: ");
  inorder_print(root); // Expected: 20 30 40 50 60 70 80
  printf("\n");

  BSTNode* found = search_bst(root, 40);
  printf("Search for 40: %s\n", found != NULL ? "Found" : "Not Found"); // Found
  found = search_bst(root, 90);
  printf("Search for 90: %s\n", found != NULL ? "Found" : "Not Found"); // Not Found

  root = delete_bst(root, 20); // Delete leaf
  printf("In-order after deleting 20: ");
  inorder_print(root);
  printf("\n");

  root = delete_bst(root, 70); // Delete node with two children
  printf("In-order after deleting 70: ");
  inorder_print(root);
  printf("\n");

  // Free memory (simplified, full recursive free needed)
  // For a robust solution, you'd typically have a function for post-order deletion.
  // This is just for demonstration.
  return 0;
}
        
      

        
          
class BSTNode:
  def __init__(self, key):
      self.key = key
      self.left = None
      self.right = None

class BST:
  def __init__(self):
      self.root = None

  def _insert(self, node, key):
      if node is None:
          return BSTNode(key)
      if key < node.key:
          node.left = self._insert(node.left, key)
      elif key > node.key: # No duplicates inserted
          node.right = self._insert(node.right, key)
      return node

  def insert(self, key):
      self.root = self._insert(self.root, key)

  def _search(self, node, key):
      if node is None or node.key == key:
          return node
      if key < node.key:
          return self._search(node.left, key)
      return self._search(node.right, key)

  def search(self, key):
      return self._search(self.root, key)

  def _find_min(self, node):
      current = node
      while current.left is not None:
          current = current.left
      return current

  def _delete(self, node, key):
      if node is None:
          return node

      if key < node.key:
          node.left = self._delete(node.left, key)
      elif key > node.key:
          node.right = self._delete(node.right, key)
      else: # Node to be deleted found
          if node.left is None: # Case 1 or 2 (no left child)
              temp = node.right
              node = None # For garbage collection
              return temp
          elif node.right is None: # Case 2 (no right child)
              temp = node.left
              node = None
              return temp
          
          # Case 3 (two children)
          temp = self._find_min(node.right) # Find in-order successor
          node.key = temp.key # Copy successor's key to this node
          node.right = self._delete(node.right, temp.key) # Delete the successor

      return node

  def delete(self, key):
      self.root = self._delete(self.root, key)

  def _inorder_print(self, node, result):
      if node:
          self._inorder_print(node.left, result)
          result.append(node.key)
          self._inorder_print(node.right, result)

  def inorder_print(self):
      result = []
      self._inorder_print(self.root, result)
      return result

# Main for demonstration
bst = BST()
bst.insert(50)
bst.insert(30)
bst.insert(20)
bst.insert(40)
bst.insert(70)
bst.insert(60)
bst.insert(80)

print(f"In-order traversal of BST: {bst.inorder_print()}") # Expected: [20, 30, 40, 50, 60, 70, 80]

found_node = bst.search(40)
print(f"Search for 40: {'Found' if found_node else 'Not Found'}")
found_node = bst.search(90)
print(f"Search for 90: {'Found' if found_node else 'Not Found'}")

bst.delete(20)
print(f"In-order after deleting 20: {bst.inorder_print()}")

bst.delete(70)
print(f"In-order after deleting 70: {bst.inorder_print()}")

        
      

        
          
class BSTNode {
  constructor(key) {
      this.key = key;
      this.left = null;
      this.right = null;
  }
}

class BinarySearchTree {
  constructor() {
      this.root = null;
  }

  // Insert a key
  insert(key) {
      const newNode = new BSTNode(key);
      if (this.root === null) {
          this.root = newNode;
          return;
      }

      let current = this.root;
      while (true) {
          if (key < current.key) {
              if (current.left === null) {
                  current.left = newNode;
                  return;
              }
              current = current.left;
          } else if (key > current.key) { // No duplicates inserted
              if (current.right === null) {
                  current.right = newNode;
                  return;
              }
              current = current.right;
          } else { // Key already exists
              return;
          }
      }
  }

  // Search for a key (returns node or null)
  search(key) {
      let current = this.root;
      while (current) {
          if (key === current.key) {
              return current;
          }
          if (key < current.key) {
              current = current.left;
          } else {
              current = current.right;
          }
      }
      return null;
  }

  // Helper: Find the in-order successor (smallest in right subtree)
  _findMinNode(node) {
      if (node.left === null) {
          return node;
      }
      return this._findMinNode(node.left);
  }

  // Delete a key
  delete(key) {
      this.root = this._deleteNode(this.root, key);
  }

  _deleteNode(node, key) {
      if (node === null) {
          return null;
      }

      if (key < node.key) {
          node.left = this._deleteNode(node.left, key);
          return node;
      } else if (key > node.key) {
          node.right = this._deleteNode(node.right, key);
          return node;
      } else { // Node to be deleted found
          // Case 1: No children or one child
          if (node.left === null) {
              return node.right;
          } else if (node.right === null) {
              return node.left;
          }

          // Case 3: Two children
          // Find in-order successor (smallest in right subtree)
          const temp = this._findMinNode(node.right);
          node.key = temp.key; // Copy successor's key to this node
          node.right = this._deleteNode(node.right, temp.key); // Delete the successor
          return node;
      }
  }

  // In-order traversal to print BST (should be sorted)
  inorderTraversal() {
      const result = [];
      this._inorder(this.root, result);
      return result;
  }

  _inorder(node, result) {
      if (node) {
          this._inorder(node.left, result);
          result.push(node.key);
          this._inorder(node.right, result);
      }
  }
}

// Main for demonstration
const bst = new BinarySearchTree();
bst.insert(50);
bst.insert(30);
bst.insert(20);
bst.insert(40);
bst.insert(70);
bst.insert(60);
bst.insert(80);

console.log("In-order traversal of BST:", bst.inorderTraversal()); // Expected: [20, 30, 40, 50, 60, 70, 80]

let foundNode = bst.search(40);
console.log("Search for 40:", foundNode ? "Found" : "Not Found");
foundNode = bst.search(90);
console.log("Search for 90:", foundNode ? "Found" : "Not Found");

bst.delete(20);
console.log("In-order after deleting 20:", bst.inorderTraversal());

bst.delete(70);
console.log("In-order after deleting 70:", bst.inorderTraversal());

        
      

Complexity (BST Operations):

  • Search, Insert, Delete:

    • Average Case: O(log N) - If the tree is reasonably balanced, each comparison halves the search space.
    • Worst Case: O(N) - If the tree becomes skewed (e.g., elements inserted in strictly ascending or descending order), it degenerates into a linked list.
  • Space Complexity: O(N) for storing N nodes. (Recursive calls add O(H) space to the call stack, where H is height).

Note: The efficiency of BSTs hinges entirely on their balance. A perfectly balanced BST guarantees O(log N) operations, similar to binary search on a sorted array. However, a poorly balanced BST can perform as badly as a linked list for all operations, losing its primary advantage.


⚖️ Self-Balancing BSTs

To overcome the worst-case O(N) performance of a simple BST, specialized “self-balancing” BSTs were developed. These trees automatically perform rotations and reconfigurations during insertions and deletions to maintain a roughly logarithmic height.

Common types of self-balancing BSTs:

  • AVL Trees: Ensure that for any node, the height difference between its left and right subtrees is at most 1.
  • Red-Black Trees: Use color properties (red/black) on nodes to maintain balance and ensure that the longest path from the root to a leaf is no more than twice the length of the shortest path. (More complex to implement but very common).

Complexity (Self-Balancing BSTs):

  • Search, Insert, Delete: Always O(log N) - The self-balancing mechanism guarantees logarithmic height.
  • Space Complexity: O(N)

Note: Red-Black trees are often preferred in practice (e.g., in Java’s TreeMap and TreeSet, C++‘s std::map and std::set) because while AVL trees are more strictly balanced, Red-Black trees require fewer rotations on average during modifications, leading to better amortized performance in many real-world scenarios.


💾 Database Indexing with BSTs

One of the most crucial real-world applications of tree structures, particularly self-balancing BSTs (or their variations like B-trees and B+ trees), is database indexing.

Imagine a large database table with millions of rows. If you want to find a record by a specific column (e.g., WHERE user_id = 12345), scanning every row (a linear scan, O(N)) would be incredibly slow.

How Indexing Works (Conceptual):

  1. Index Creation: When you create an index on a column (e.g., user_id), the database builds a tree-like data structure (often a B-tree or B+ tree, which are multi-way search trees optimized for disk I/O).
  2. Key-Pointer Pairs: This tree stores the indexed column’s values (keys) along with pointers (or references) to the actual rows in the table where those values are found.
  3. Efficient Search: When a query filters by the indexed column, the database traverses this index tree. Because it’s a search tree, it can quickly navigate from the root down to the desired key in O(log N) time (where N is the number of entries in the index).
  4. Direct Row Access: Once the key is found in the index, the database uses the associated pointer to jump directly to the relevant row(s) in the table, avoiding a full table scan.

Performance Comparison: O(N) vs. O(log N)

How operation time scales with input size for linear scan versus indexed search.

1 10 100 1K 10K Input Size (N) High Medium Low Time/Operations O(N) - Linear Scan O(log N) - Indexed Search

Benefits of Database Indexing:

  • Faster Data Retrieval: Dramatically reduces query execution time for SELECT statements with WHERE clauses on indexed columns.
  • Faster Sorting: Can speed up ORDER BY clauses.
  • Faster Joins: Improves performance of JOIN operations between tables.

Trade-offs/Considerations:

  • Storage Overhead: Indexes consume disk space.
  • Write Performance: INSERT, UPDATE, DELETE operations become slower because the database also has to update the index tree(s) in addition to the actual table data.
  • Choosing Indexes: Not every column needs an index. Columns frequently used in WHERE, ORDER BY, or JOIN clauses are good candidates.

Interesting Note: While conceptual BSTs are typically binary, database indexing often uses B-trees or B+ trees. These are not binary; they are multi-way search trees where a node can have many children (e.g., hundreds). This is because they are optimized for disk I/O: a single node can store many keys, meaning fewer disk reads are needed to traverse down the tree, which is crucial given the high latency of disk access.