2. The stream model
• Data sequentially enters at a rapid rate from one or more inputs
• We cannot store the entire stream
• Processing in real-time
• Limited memory (usually sub linear in the size of the stream)
• Goal: Compute a function of stream, e.g., median, number of
distinct elements, longest increasing sequence, filtering
Approximate answer is usually preferable
2
4. Bloom Filter
• A Bloom filter is a space-efficient probabilistic data structure that is
used to test whether an element is a member of a set.
• For example, checking availability of username is set membership
problem, where the set is the list of all registered username.
• The price we pay for efficiency is that it is probabilistic in nature that
means, there might be some False Positive results.
5. Motivation:
The “Set Membership” Problem
• x: An Element
• S: A Set of elements (Finite)
• Input: x, S
• Output:
• True (if x in S)
• False (if x not in S)
20
Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)
Streaming Algorithm:
• Limited Space/item
• Limited Processing time/item
• Approximate answer based on a summary/sketch
of the data stream in the memory.
6. Bloom Filter
• Given a set of keys S that we want to filter
• Create a bit array B of n bits, initially all false (0s) (Complexity:- O(n) )
• Choose k independent and uniform hash functions, ℎ0, ℎ1, … , ℎk−1
each outputs a value within the range {0, 1, … , n-1}
F F F F F F F F F F
0 1 2 3 4 5 6 7 8 9
n = 10
21
7. Bloom Filter
• For each element sϵS, the Boolean value at positions
ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true. , i.e., B[h(s)]=1
• Complexity of Insertion:- O(k)
F F F F F F F F F F
0 1 2 3 4 5 6 7 8 9
𝑠1
ℎ0 𝑠1 = 1
ℎ1 𝑠1 = 4
ℎ2 𝑠1 = 6
k = 3
22
T T
T
8. Bloom Filter
• For each element sϵS, the Boolean value at positions
ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true.
F T F F T F T F F F
0 1 2 3 4 5 6 7 8 9
𝑠1 𝑠2
ℎ0 𝑠2 = 4
ℎ1 𝑠2 = 7 ℎ2 𝑠2 = 9
k = 3
23
Note: A particular Boolean value may
be set to True several times.
T
T
9. Algorithm to Approximate Set Membership Query
Input: x ( may/may not be an element)
Output: Boolean
For all i ϵ {0,1,…,k-1}
if hi(x) is False
return False
return True
F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9
𝑠1 𝑠2
k = 3
24
Runtime Complexity:- O(k)
𝑥 = S1 𝑥 = S3
10. Algorithm to Approximate Set Membership Query
• A empty bloom filter is a bit array of n bits, all set to zero, like this
• Example – Suppose we want to enter “geeks” in the filter, we are
using 3 hash functions and a bit array of length 10, all set to 0 initially.
First we’ll calculate the hashes as follows:
• h1(“geeks”) % 10 = 1,
• h2(“geeks”) % 10 = 4
• h3(“geeks”) % 10 = 7
11. Algorithm to Approximate Set Membership Query
• Again we want to enter “nerd”, similarly, we’ll calculate hashes
• h1(“nerd”) % 10 = 3
• h2(“nerd”) % 10 = 5
• h3(“nerd”) % 10 = 4
12. Algorithm to Approximate Set Membership Query
• False Positive in Bloom Filters
• Example – Suppose we want to check whether “cat” is present or not.
We’ll calculate hashes using h1, h2 and h3
• h1(“cat”) % 10 = 1
• h2(“cat”) % 10 = 3
• h3(“cat”) % 10 = 7
• If we check the bit array, bits at these indices are set to 1 but we
know that “cat” was never added to the filter. Bit at index 1 and 7
was set when we added “geeks” and bit 3 was set we added “nerd”.
13. Algorithm to Approximate Set Membership Query
F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9
𝑠1 𝑠2
ℎ0 𝑠2 = 4
ℎ1 𝑠2 = 7
ℎ2 𝑠2 = 9
k = 3
28
𝑥 ℎ0 𝑥 = 9
ℎ1 𝑥 = 6
ℎ2 𝑥 = 1
False Positive!!
ℎ0 𝑠1 = 1 ℎ1 𝑠1 = 4
ℎ2 𝑠1 = 6
14. Error Types
• False Negative – Answering “is not there” on an element which “is there”
• Never happens for Bloom Filter
• False Positive – Answering “is there” for an element which “is not there”
• Might happens. How likely?
29
15. Probability of false positives
F T F T F F T F T F
S1
S2
n = size of bit array
m = number of items
k = number of hash functions
Consider a particular bit 0 <= j <= n-1
Probability that ℎ𝑖 𝑥 does not set bit j after hashing only 1 item:
𝑃 ℎ𝑖 𝑥 ≠ 𝑗 = 1 −
1
𝑛
Probability that ℎ𝑖 𝑥 does not set bit j after hashing m items:
𝑃 ∀𝑥 𝑖𝑛 {𝑆1, 𝑆2, … , 𝑆𝑚}: ℎ𝑖 𝑥 ≠ 𝑗 = 1 −
1
𝑛
𝑚
30
16. Probability of false positives
F T F T F F T F T F
S1 S2
n = size of bit array
m = number of items
k = number of hash functions
Probability that none of the hash functions set bit j after hashing m items:
𝑃 ∀𝑥 𝑖𝑛 𝑆1, 𝑆2, … , 𝑆𝑚 , ∀𝑖 𝑖𝑛 1,2, … , 𝑘 : ℎ𝑖 (𝑥) ≠ 𝑗 = 1 −
1
𝑛
𝑘𝑚
We know that, 1 −
1
𝑛
𝑛
≈
1
e
= 𝑒−1
⇒ 1 −
1
𝑛
𝑘𝑚
= 1 −
1
𝑛
𝑛
𝑘𝑚 𝑛
≈ 𝑒−1 𝑘𝑚 𝑛 = 𝑒−𝑘𝑚 𝑛
31
17. Probability of false positives
F T F T F F T F T F
S1 S2
n = size of bit array
m = number of items
k = number of hash functions
Probability that bit j is not set 𝑃 𝐵𝑖𝑡 𝑗 = 𝐹 = 𝑒−𝑘𝑚 𝑛
The prob. of having all k bits of a new element already set = 𝟏 − 𝒆− 𝒌𝒎 𝒏 𝒌
32
Approximate
Probability of
False Positive
18. Analysis: Throwing Darts (1)
• More accurate analysis for the number of false positives
• Consider: If we throw m darts into n equally likely targets, what is
the probability that
a target gets at least one dart?
• In our case:
• Targets = bits/buckets
• Darts = hash values of items
19. Analysis: Throwing Darts (2)
• We have m darts, n targets
• What is the probability that a target gets at least one dart?
(1 – 1/n)
Probability some
target X not hit
by a dart
m
1 -
Probability at
least one dart
hits target X
n( / n)
Equivalent
Equals 1/e
as n ∞
1 – e–m/n
20. Analysis: Throwing Darts (3)
• Fraction of 1s in the array B =
= probability of false positive = 1 – e-m/n
• Example: 109 darts, 8∙109 targets
• Fraction of 1s in B = 1 – e-1/8 = 0.1175
• Compare with our earlier estimate: 1/8 = 0.125
(1 – 1/n)
m
1 -
n( / n)
1 – e–m/n
21. First Cut Solution (3)
|S| = 1 billion email addresses
|B|= 1GB = 8 billion bits
If the email address is in S, then it surely hashes to a bucket that has
the big set to 1,
so it always gets through (no false negatives)
22. Properties of Bloom Filters
• Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large
number of elements.
• Adding an element never fails. However, the false positive rate increases steadily as elements are added
until all bits in the filter are set to 1, at which point all queries yield a positive result.
• Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it
actually exists.
• Deleting elements from filter is not possible because, if we delete a single element by clearing bits at
indices generated by k hash functions, it might cause deletion of few other elements. Example – if we
delete “goods” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd”
also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present.
• Space Efficiency : If we want to store large list of items in a set for purpose of set membership, we can
store it in hashmap or simple array or linked list. All these methods require storing item itself, which is
not very memory efficient. For example, if we want to store “goods” in hashmap we have to store actual
string “ goods” as a key value pair {some_key : ”goods”}.
Bloom filters do not store the data item at all
Editor's Notes
Consider: |S| = m, |B| = n
--------Given a set of keys S that we want to filter
Create a bit array B of n bits, initially all 0s
Choose a hash function h with range [0,n)
Question: Where the randomness comes from? Ans: From the set {S1,S2,….,Sm}. Imagine, you are trying m times to set the bit value at position j to True using a certain (i-th) hash function (no randomness here so far, the function is fixed):- hi. The last line is the probability of your failure to set that bit to True.
In the first equation:- The reason why you will multiply (1-1/n)^m term k times is:- When you are using k hash functions to hash m items, you are trying k times to set the bit value at position j to True. Once again, there is no randomness in hi here. You are just taking more trials to set the bit(j) to True.