Hash Tables & Caching
We’ve covered ordered data structures like arrays, linked lists, and trees, which offer varying trade-offs for different operations. Now, we introduce Hash Tables, a data structure renowned for its incredible average-case O(1) performance for insertion, deletion, and lookup. This near-instant access makes them fundamental for high-speed applications like caching, where quickly finding data is paramount.
Hash Table (Chaining)
Keys are hashed to an index, and collisions are handled with linked lists.
A good hash function distributes keys evenly. Collisions handled by appending to a linked list.
🔑 What is a Hash Table?
A Hash Table (also known as a Hash Map, Dictionary, or Associative Array) is a data structure that maps keys to values. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
Core Components:
- Array (or Buckets/Slots): The underlying storage where the key-value pairs (or pointers to them) are actually stored.
- Hash Function: A function that takes a key as input and returns an integer (the hash code), which is then typically mapped to an array index. A good hash function:
- Is fast to compute.
- Distributes keys evenly across the array.
- Produces the same hash for the same input key (deterministic).
- Collision Resolution Strategy: A method to handle hash collisions – when the hash function generates the same index for two different keys.
Analogy: Imagine a giant post office with many mailboxes. Instead of having mailboxes sorted alphabetically, you have a sorting machine (hash function) that tells you exactly which mailbox number (index) to put a letter into based on the recipient’s name. If two names end up in the same mailbox (collision), there’s a specific rule for finding them inside that mailbox.
💥 Hash Collisions & Resolution
The ideal hash function would map every unique key to a unique index. In reality, with a finite-sized array, this is impossible for an infinite number of possible keys (Pigeonhole Principle). Thus, hash collisions are inevitable.
A hash collision occurs when two different keys hash to the same index in the array. Handling these collisions efficiently is crucial for a hash table’s performance.
Common Collision Resolution Strategies:
-
Chaining (Open Hashing):
- Each array slot (bucket) stores a pointer to a linked list (or another dynamic structure) of all key-value pairs that hash to that index.
- Insert: Hash the key, go to the index, append to the linked list.
- Search/Delete: Hash the key, go to the index, traverse the linked list to find the desired key.
- Pros: Simple, never “full,” handles many collisions gracefully.
- Cons: Requires extra space for pointers, cache performance can be worse due to non-contiguous linked list nodes.
-
Open Addressing (Closed Hashing):
- All key-value pairs are stored directly within the hash table’s array. When a collision occurs, the algorithm probes (searches) for the next available empty slot in the array.
- Probe Sequences:
- Linear Probing: Search
(index + 1) % capacity
,(index + 2) % capacity
, etc. - Quadratic Probing: Search
(index + 1^2) % capacity
,(index + 2^2) % capacity
, etc. - Double Hashing: Use a second hash function to determine the step size for probing.
- Linear Probing: Search
- Delete: Requires “lazy deletion” (marking a slot as deleted instead of truly removing it) to not break search chains.
- Pros: Better cache performance (contiguous storage), no pointer overhead.
- Cons: Can suffer from clustering (long chains of occupied slots), table can become truly “full,” deletion is complex.
Hash Collision Resolution
Strategies to handle multiple keys hashing to the same array index.
1. Chaining (Open Hashing)
2. Open Addressing (Linear Probing)
Both methods aim to maintain O(1) average time, but performance degrades with many collisions.
🚀 Hash Table Operations & Complexity
Let’s look at a conceptual implementation using chaining, as it’s often simpler to understand initially.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define TABLE_SIZE 10 // Size of our hash table array
// Node for a linked list in each bucket
typedef struct Entry {
char* key;
int value;
struct Entry* next;
} Entry;
// Hash table structure (array of linked list heads)
typedef struct {
Entry* buckets[TABLE_SIZE];
} HashTable;
// Simple hash function for strings
unsigned int hash(const char* key) {
unsigned int hash_val = 0;
while (*key != '\0') {
hash_val = (hash_val << 5) + *key++;
}
return hash_val % TABLE_SIZE;
}
// Initialize hash table
void init_hash_table(HashTable* ht) {
for (int i = 0; i < TABLE_SIZE; i++) {
ht->buckets[i] = NULL;
}
}
// Insert a key-value pair
void insert_ht(HashTable* ht, const char* key, int value) {
unsigned int index = hash(key);
// Create new entry
Entry* newEntry = (Entry*) malloc(sizeof(Entry));
if (newEntry == NULL) { /* Handle error */ exit(EXIT_FAILURE); }
newEntry->key = strdup(key); // Duplicate key string
newEntry->value = value;
newEntry->next = NULL;
// Check for existing key (update value if found)
Entry* current = ht->buckets[index];
while (current != NULL) {
if (strcmp(current->key, key) == 0) {
current->value = value; // Update existing value
free(newEntry->key); // Free unused duplicated key
free(newEntry); // Free unused new entry
return;
}
current = current->next;
}
// Add new entry to the front of the linked list
newEntry->next = ht->buckets[index];
ht->buckets[index] = newEntry;
}
// Search for a key (returns pointer to entry or NULL)
Entry* search_ht(HashTable* ht, const char* key) {
unsigned int index = hash(key);
Entry* current = ht->buckets[index];
while (current != NULL) {
if (strcmp(current->key, key) == 0) {
return current;
}
current = current->next;
}
return NULL;
}
// Delete a key
void delete_ht(HashTable* ht, const char* key) {
unsigned int index = hash(key);
Entry* current = ht->buckets[index];
Entry* prev = NULL;
while (current != NULL) {
if (strcmp(current->key, key) == 0) {
if (prev == NULL) { // Deleting the first node in the list
ht->buckets[index] = current->next;
} else {
prev->next = current->next;
}
free(current->key);
free(current);
return;
}
prev = current;
current = current->next;
}
}
// Cleanup
void free_hash_table(HashTable* ht) {
for (int i = 0; i < TABLE_SIZE; i++) {
Entry* current = ht->buckets[i];
while (current != NULL) {
Entry* next = current->next;
free(current->key);
free(current);
current = next;
}
ht->buckets[i] = NULL;
}
}
int main() {
HashTable ht;
init_hash_table(&ht);
insert_ht(&ht, "apple", 10);
insert_ht(&ht, "banana", 20);
insert_ht(&ht, "cherry", 30);
insert_ht(&ht, "date", 40); // Will likely collide with another key due to small TABLE_SIZE
Entry* found = search_ht(&ht, "banana");
printf("Found banana: %d\n", found ? found->value : -1); // 20
found = search_ht(&ht, "grape");
printf("Found grape: %d\n", found ? found->value : -1); // -1
delete_ht(&ht, "apple");
found = search_ht(&ht, "apple");
printf("Found apple after delete: %d\n", found ? found->value : -1); // -1
free_hash_table(&ht);
return 0;
}
# Python's dictionary is implemented as a hash table.
class MyHashTable:
def __init__(self):
self._data = {} # Python dict handles hashing and collisions for us
def insert(self, key, value): # O(1) average
self._data[key] = value
def search(self, key): # O(1) average
return self._data.get(key) # Returns value or None
def delete(self, key): # O(1) average
if key in self._data:
del self._data[key]
return True
return False # Key not found
def display(self):
print(self._data)
# Main for demonstration
ht = MyHashTable()
ht.insert("apple", 10)
ht.insert("banana", 20)
ht.insert("cherry", 30)
ht.insert("date", 40)
print(f"Found banana: {ht.search('banana')}") # 20
print(f"Found grape: {ht.search('grape')}") # None
ht.delete("apple")
print(f"Found apple after delete: {ht.search('apple')}") # None (or default None)
ht.display() # {'banana': 20, 'cherry': 30, 'date': 40}
// JavaScript's Map object provides hash table-like functionality.
class MyHashTable {
constructor() {
this._data = new Map(); // JavaScript Map handles hashing and collisions
}
insert(key, value) { // O(1) average
this._data.set(key, value);
}
search(key) { // O(1) average
return this._data.get(key); // Returns value or undefined
}
delete(key) { // O(1) average
return this._data.delete(key); // Returns true if deleted, false if not found
}
has(key) { // O(1) average
return this._data.has(key);
}
display() {
console.log(Object.fromEntries(this._data));
}
}
// Main for demonstration
const ht = new MyHashTable();
ht.insert("apple", 10);
ht.insert("banana", 20);
ht.insert("cherry", 30);
ht.insert("date", 40);
console.log("Found banana:", ht.search("banana")); // 20
console.log("Found grape:", ht.search("grape")); // undefined
ht.delete("apple");
console.log("Found apple after delete:", ht.search("apple")); // undefined
ht.display(); // { banana: 20, cherry: 30, date: 40 }
Complexity (Hash Table Operations - Chaining):
- Insert, Search, Delete:
- Average Case: O(1) - Assuming a good hash function that distributes keys evenly. Most operations only involve hashing and a quick check of a short linked list.
- Worst Case: O(N) - If all keys hash to the same bucket (very poor hash function or pathological input), the hash table degrades to a single linked list, requiring linear traversal.
- Space Complexity: O(N) for storing N key-value pairs.
Interesting Note: The performance of a hash table is highly dependent on two factors:
- Quality of the hash function: A good hash function is crucial for uniform distribution.
- Load Factor: The ratio of the number of elements to the number of buckets (
N / M
). A high load factor means more collisions and longer chains (for chaining), reducing performance. When the load factor exceeds a certain threshold, hash tables are usually resized (rehashed into a larger array), which is an O(N) operation but amortized over many insertions.
⚡ Caching with Hash Tables
One of the most powerful and ubiquitous applications of hash tables is in caching. Caching is the process of storing copies of data in a temporary storage area (the cache) so that future requests for that data can be served faster.
How Hash Tables power Caches:
- Key: The unique identifier for the data you want to cache (e.g., a URL, a user ID, a database query string).
- Value: The actual data you’re caching (e.g., a web page, user object, query result).
- Fast Lookups: When a program needs data, it first checks the cache using the data’s key. The hash table’s O(1) average lookup time allows for near-instant checks.
- Fast Inserts/Updates: When new data is fetched or existing data changes, it’s quickly stored or updated in the cache’s hash table.
Example: LRU Cache (Least Recently Used) A common caching strategy is LRU. It ensures that when the cache is full, the least recently used item is removed to make space for a new one. This is often implemented using a combination of a Doubly Linked List (for order of use) and a Hash Table (for O(1) access to list nodes).
Benefits of Caching:
- Reduced Latency: Data is fetched from fast memory (RAM) instead of slow disk or network.
- Reduced Load: Fewer requests hit the original data source (database, API).
- Improved User Experience: Faster response times for users.
Practical Usages:
- Web Browsers: Cache web content (images, scripts, HTML) locally.
- Web Servers: Cache API responses or frequently accessed database query results.
- Databases: In-memory caches for frequently accessed data blocks.
- Operating Systems: Page caching, file system metadata caching.
- Memoization: Optimizing recursive functions by storing results of expensive function calls (key = function args, value = result).
Note: The efficiency of caching systems often comes down to the hash table’s ability to provide fast lookups. The trade-off is memory usage. Caching typically aims to hit the sweet spot of storing enough data to be effective without consuming excessive memory that could lead to other performance problems. Caching algorithms (like LRU, LFU, FIFO) determine what to keep in the hash table when space runs out.