Bloom Filters & Web Crawling
We’ve explored deterministic data structures that give precise answers (e.g., “Is this item in the set? Yes/No”). Now, we introduce Bloom Filters: a fascinating probabilistic data structure that tells you “possibly in the set” or “definitely not in the set.” This trade-off (accepting a small chance of error) enables incredibly space-efficient and time-efficient membership testing, making them invaluable for massive-scale applications like Web Crawling.
Bloom Filter Mechanics
Uses multiple hash functions and a bit array for probabilistic membership testing.
Bit Array (`m=20` bits)
Hash Functions (`k=4` functions)
False positives are possible (all bits set by chance). False negatives are never possible.
🧐 What is a Bloom Filter?
A Bloom Filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
Key Properties:
- Space-Efficient: Uses significantly less memory than traditional hash sets or other data structures for the same number of elements.
- Time-Efficient: Membership queries (check operations) are very fast, O(k), where
k
is the number of hash functions, regardless of the number of elements in the filter. - Probabilistic:
- False Positives are Possible: It might claim an element is in the set when it’s actually not. This is known as a false positive probability (FPP).
- False Negatives are NOT Possible: It will never claim an element is not in the set if it actually is.
- Deletion is Not Supported: Elements cannot be reliably removed from a standard Bloom Filter because removing bits would inadvertently affect other elements.
Core Components:
- Bit Array (m bits): A fixed-size array of
m
bits, all initially set to 0. - Hash Functions (k hash functions): A set of
k
independent hash functions, each mapping an element to a position within them
-bit array. These hash functions should ideally distribute keys uniformly.
Analogy: Imagine a library where when you borrow a book, you put k
sticky notes on k
different pages chosen by some rule (hash functions). When someone asks if a book is in the library, you check those k
pages. If all k sticky notes are there, you say “probably yes” (it’s either the book or a false positive). If any sticky note is missing, you say “definitely no.” You never remove sticky notes, so if you return a book, its sticky notes stay, potentially indicating a book that isn’t truly there anymore.
🛠️ Bloom Filter Operations
1. Add an Element
To add an element x
to the Bloom Filter:
- Feed
x
to each of thek
hash functions. - Each hash function generates an index (a position in the bit array).
- Set the bits at all
k
generated indices to 1.
2. Check for Membership
To check if an element y
is in the Bloom Filter:
- Feed
y
to each of thek
hash functions. - Each hash function generates an index.
- Check the bits at all
k
generated indices. - If all
k
bits are 1: The element might be in the set (a “possible positive” or “true positive” or “false positive”). - If any of the
k
bits is 0: The element is definitely not in the set (a “true negative”).
import mmh3 # MurmurHash3, a good non-cryptographic hash function
from math import log, ceil
class BloomFilter:
def __init__(self, capacity, false_pos_prob):
"""
:param capacity: Max number of items expected to be stored in bloom filter
:param false_pos_prob: The desired false positive probability
"""
self.capacity = capacity
self.false_pos_prob = false_pos_prob
# Size of bit array 'm'
self.m = self.get_size(capacity, false_pos_prob)
# Number of hash functions 'k'
self.k = self.get_hash_count(self.m, capacity)
# Bit array (using a list of booleans or ints for simplicity)
self.bit_array = [0] * self.m
@staticmethod
def get_size(n, p):
"""
Calculates the optimal size of bit array (m)
n: capacity
p: false_pos_prob
"""
return int(-(n * log(p)) / (log(2) ** 2))
@staticmethod
def get_hash_count(m, n):
"""
Calculates the optimal number of hash functions (k)
m: size of bit array
n: capacity
"""
return int((m / n) * log(2))
def _get_hashes(self, item):
"""
Generates k hash indices for a given item.
Using mmh3 with different seeds to simulate multiple hash functions.
"""
indices = []
for i in range(self.k):
# Using i as a seed to get different hash outputs
h = mmh3.hash(item, i) % self.m
indices.append(h)
return indices
def add(self, item):
"""Adds an item to the bloom filter."""
indices = self._get_hashes(item)
for i in indices:
self.bit_array[i] = 1
def check(self, item):
"""Checks if an item is possibly in the bloom filter."""
indices = self._get_hashes(item)
for i in indices:
if self.bit_array[i] == 0:
return False # Definitely not in the set
return True # Possibly in the set (could be a false positive)
# Main for demonstration
# Expect to store up to 10 items, with a 1% false positive rate
bf = BloomFilter(capacity=10, false_pos_prob=0.01)
print(f"Bloom Filter initialized with m={bf.m} bits, k={bf.k} hash functions.")
items_to_add = ["apple", "banana", "cherry", "date", "elderberry"]
for item in items_to_add:
bf.add(item)
print(f"Added: '{item}'")
print("\n--- Checking Membership ---")
print(f"Is 'apple' in BF? {bf.check('apple')}") # True (correct)
print(f"Is 'grape' in BF? {bf.check('grape')}") # False (correct, or True if F.P.)
print(f"Is 'date' in BF? {bf.check('date')}") # True (correct)
print(f"Is 'zucchini' in BF? {bf.check('zucchini')}") # False (correct, or True if F.P.)
# Demonstrate potential false positive (will depend on hash functions and m/k)
# For example, after adding 'apple', 'banana', 'cherry', 'date', 'elderberry',
# let's say 'mango' hashes to the same bit positions as a combination of existing items.
# This might return True for 'mango' even though it wasn't added.
# For this small example, a false positive might not happen reliably.
// For hashing, we'll use a simple approach for demonstration.
// In a real Bloom Filter, use robust hash functions like MurmurHash3.
// A common trick is to use two strong hash functions (h1, h2)
// to simulate 'k' hash functions: hash_i(x) = (h1(x) + i * h2(x)) % m
class BloomFilter {
constructor(capacity, falsePosProb) {
this.capacity = capacity;
this.falsePosProb = falsePosProb;
this.m = Math.ceil((capacity * Math.log(falsePosProb)) / (Math.log(2) * -Math.log(2)));
this.k = Math.round((this.m / capacity) * Math.log(2));
this.bitArray = new Array(this.m).fill(0);
}
// A simple, non-cryptographic hash function for demonstration
_simpleHash(str, seed) {
let hash = seed;
for (let i = 0; i < str.length; i++) {
hash = (hash * 31 + str.charCodeAt(i)) | 0; // Bitwise OR 0 to ensure 32-bit integer
}
return Math.abs(hash); // Ensure positive
}
_getHashes(item) {
const indices = [];
// Simulate k hash functions using different seeds
for (let i = 0; i < this.k; i++) {
indices.push(this._simpleHash(item, i) % this.m);
}
return indices;
}
add(item) {
const indices = this._getHashes(item);
for (const i of indices) {
this.bitArray[i] = 1;
}
}
check(item) {
const indices = this._getHashes(item);
for (const i of indices) {
if (this.bitArray[i] === 0) {
return false; // Definitely not in the set
}
}
return true; // Possibly in the set
}
}
// Main for demonstration
// Expect to store up to 10 items, with a 1% false positive rate
const bf = new BloomFilter(10, 0.01);
const itemsToAdd = ["apple", "banana", "cherry", "date", "elderberry"];
for (const item of itemsToAdd) {
bf.add(item);
}
console.log("\n--- Checking Membership ---");
Complexity (Bloom Filter Operations):
- Add: O(k) - We run
k
hash functions and setk
bits. - Check: O(k) - We run
k
hash functions and checkk
bits. - Space Complexity: O(m) - Proportional to the size of the bit array, independent of the number of elements actually stored (up to capacity).
Note: The formula for optimal m
(bit array size) and k
(number of hash functions) depends on the capacity
(number of elements expected) and the desired false_pos_prob
. Choosing these values correctly is crucial to achieve the desired balance between space, speed, and accuracy.
🕷️ Web Crawling & Bloom Filters
Web crawlers (like those used by search engines) explore the internet by following links from web pages. A massive challenge for crawlers is to avoid visiting the same URL multiple times. Re-visiting URLs wastes bandwidth, time, and server resources.
The Problem: Storing every visited URL in a traditional HashSet
(which uses hash tables) would quickly consume an enormous amount of memory, as the number of unique URLs can be in the billions or trillions.
The Bloom Filter Solution:
- A Bloom Filter is maintained for all URLs already visited.
- When the crawler encounters a new URL:
- It first
checks
the Bloom Filter. - If the Bloom Filter says “definitely not visited”: The crawler can confidently add the URL to its queue for future processing and also
add
the URL to the Bloom Filter. - If the Bloom Filter says “possibly visited”: The crawler can then perform a more expensive, precise check (e.g., against a persistent database of visited URLs). If it turns out to be a false positive, it simply means an extra (but rare) database lookup. If it’s a true positive, the URL is skipped.
- It first
Why this works for Web Crawling:
- Space Savings: The Bloom Filter uses significantly less memory than storing full URLs in a hash table.
- Speed: Membership tests are very fast, so the crawler can quickly filter out most already-visited URLs.
- Acceptable False Positives: A small percentage of false positives means a tiny fraction of URLs might be re-fetched or trigger an unnecessary database lookup. This is usually acceptable given the massive scale and the memory/speed benefits.
Practical Usages (Beyond Web Crawling):
- Database Query Optimization: Checking if a row might exist in a large table before performing an expensive disk read.
- Spell Checkers: Storing a dictionary of valid words to quickly check if a word is likely misspelled.
- Caching Proxies: Determining if a URL has already been requested before.
- Detecting Network Worms/Viruses: Quickly identifying known malicious patterns.
- Preventing Username Collisions: Checking if a username is “possibly” taken before a more expensive database check.
Note: Designing a Bloom Filter for a specific application involves a careful trade-off. You need to estimate the maximum number of elements you expect (capacity
), and then choose an acceptable false_pos_prob
. These two parameters directly determine the size of the bit array (m
) and the number of hash functions (k
), impacting both memory usage and performance.