SlideShare a Scribd company logo
1 of 22
Streaming Algorithms- part2
1
The stream model
• Data sequentially enters at a rapid rate from one or more inputs
• We cannot store the entire stream
• Processing in real-time
• Limited memory (usually sub linear in the size of the stream)
• Goal: Compute a function of stream, e.g., median, number of
distinct elements, longest increasing sequence, filtering
Approximate answer is usually preferable
2
Bloom Filter
18
Bloom Filter
• A Bloom filter is a space-efficient probabilistic data structure that is
used to test whether an element is a member of a set.
• For example, checking availability of username is set membership
problem, where the set is the list of all registered username.
• The price we pay for efficiency is that it is probabilistic in nature that
means, there might be some False Positive results.
Motivation:
The “Set Membership” Problem
• x: An Element
• S: A Set of elements (Finite)
• Input: x, S
• Output:
• True (if x in S)
• False (if x not in S)
20
Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)
Streaming Algorithm:
• Limited Space/item
• Limited Processing time/item
• Approximate answer based on a summary/sketch
of the data stream in the memory.
Bloom Filter
• Given a set of keys S that we want to filter
• Create a bit array B of n bits, initially all false (0s) (Complexity:- O(n) )
• Choose k independent and uniform hash functions, ℎ0, ℎ1, … , ℎk−1
each outputs a value within the range {0, 1, … , n-1}
F F F F F F F F F F
0 1 2 3 4 5 6 7 8 9
n = 10
21
Bloom Filter
• For each element sϵS, the Boolean value at positions
ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true. , i.e., B[h(s)]=1
• Complexity of Insertion:- O(k)
F F F F F F F F F F
0 1 2 3 4 5 6 7 8 9
𝑠1
ℎ0 𝑠1 = 1
ℎ1 𝑠1 = 4
ℎ2 𝑠1 = 6
k = 3
22
T T
T
Bloom Filter
• For each element sϵS, the Boolean value at positions
ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true.
F T F F T F T F F F
0 1 2 3 4 5 6 7 8 9
𝑠1 𝑠2
ℎ0 𝑠2 = 4
ℎ1 𝑠2 = 7 ℎ2 𝑠2 = 9
k = 3
23
Note: A particular Boolean value may
be set to True several times.
T
T
Algorithm to Approximate Set Membership Query
Input: x ( may/may not be an element)
Output: Boolean
For all i ϵ {0,1,…,k-1}
if hi(x) is False
return False
return True
F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9
𝑠1 𝑠2
k = 3
24
Runtime Complexity:- O(k)
𝑥 = S1 𝑥 = S3
Algorithm to Approximate Set Membership Query
• A empty bloom filter is a bit array of n bits, all set to zero, like this
• Example – Suppose we want to enter “geeks” in the filter, we are
using 3 hash functions and a bit array of length 10, all set to 0 initially.
First we’ll calculate the hashes as follows:
• h1(“geeks”) % 10 = 1,
• h2(“geeks”) % 10 = 4
• h3(“geeks”) % 10 = 7
Algorithm to Approximate Set Membership Query
• Again we want to enter “nerd”, similarly, we’ll calculate hashes
• h1(“nerd”) % 10 = 3
• h2(“nerd”) % 10 = 5
• h3(“nerd”) % 10 = 4
Algorithm to Approximate Set Membership Query
• False Positive in Bloom Filters
• Example – Suppose we want to check whether “cat” is present or not.
We’ll calculate hashes using h1, h2 and h3
• h1(“cat”) % 10 = 1
• h2(“cat”) % 10 = 3
• h3(“cat”) % 10 = 7
• If we check the bit array, bits at these indices are set to 1 but we
know that “cat” was never added to the filter. Bit at index 1 and 7
was set when we added “geeks” and bit 3 was set we added “nerd”.
Algorithm to Approximate Set Membership Query
F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9
𝑠1 𝑠2
ℎ0 𝑠2 = 4
ℎ1 𝑠2 = 7
ℎ2 𝑠2 = 9
k = 3
28
𝑥 ℎ0 𝑥 = 9
ℎ1 𝑥 = 6
ℎ2 𝑥 = 1
False Positive!!
ℎ0 𝑠1 = 1 ℎ1 𝑠1 = 4
ℎ2 𝑠1 = 6
Error Types
• False Negative – Answering “is not there” on an element which “is there”
• Never happens for Bloom Filter
• False Positive – Answering “is there” for an element which “is not there”
• Might happens. How likely?
29
Probability of false positives
F T F T F F T F T F
S1
S2
n = size of bit array
m = number of items
k = number of hash functions
Consider a particular bit 0 <= j <= n-1
Probability that ℎ𝑖 𝑥 does not set bit j after hashing only 1 item:
𝑃 ℎ𝑖 𝑥 ≠ 𝑗 = 1 −
1
𝑛
Probability that ℎ𝑖 𝑥 does not set bit j after hashing m items:
𝑃 ∀𝑥 𝑖𝑛 {𝑆1, 𝑆2, … , 𝑆𝑚}: ℎ𝑖 𝑥 ≠ 𝑗 = 1 −
1
𝑛
𝑚
30
Probability of false positives
F T F T F F T F T F
S1 S2
n = size of bit array
m = number of items
k = number of hash functions
Probability that none of the hash functions set bit j after hashing m items:
𝑃 ∀𝑥 𝑖𝑛 𝑆1, 𝑆2, … , 𝑆𝑚 , ∀𝑖 𝑖𝑛 1,2, … , 𝑘 : ℎ𝑖 (𝑥) ≠ 𝑗 = 1 −
1
𝑛
𝑘𝑚
We know that, 1 −
1
𝑛
𝑛
≈
1
e
= 𝑒−1
⇒ 1 −
1
𝑛
𝑘𝑚
= 1 −
1
𝑛
𝑛
𝑘𝑚 𝑛
≈ 𝑒−1 𝑘𝑚 𝑛 = 𝑒−𝑘𝑚 𝑛
31
Probability of false positives
F T F T F F T F T F
S1 S2
n = size of bit array
m = number of items
k = number of hash functions
Probability that bit j is not set 𝑃 𝐵𝑖𝑡 𝑗 = 𝐹 = 𝑒−𝑘𝑚 𝑛
The prob. of having all k bits of a new element already set = 𝟏 − 𝒆− 𝒌𝒎 𝒏 𝒌
32
Approximate
Probability of
False Positive
Analysis: Throwing Darts (1)
• More accurate analysis for the number of false positives
• Consider: If we throw m darts into n equally likely targets, what is
the probability that
a target gets at least one dart?
• In our case:
• Targets = bits/buckets
• Darts = hash values of items
Analysis: Throwing Darts (2)
• We have m darts, n targets
• What is the probability that a target gets at least one dart?
(1 – 1/n)
Probability some
target X not hit
by a dart
m
1 -
Probability at
least one dart
hits target X
n( / n)
Equivalent
Equals 1/e
as n ∞
1 – e–m/n
Analysis: Throwing Darts (3)
• Fraction of 1s in the array B =
= probability of false positive = 1 – e-m/n
• Example: 109 darts, 8∙109 targets
• Fraction of 1s in B = 1 – e-1/8 = 0.1175
• Compare with our earlier estimate: 1/8 = 0.125
(1 – 1/n)
m
1 -
n( / n)
1 – e–m/n
First Cut Solution (3)
|S| = 1 billion email addresses
|B|= 1GB = 8 billion bits
If the email address is in S, then it surely hashes to a bucket that has
the big set to 1,
so it always gets through (no false negatives)
Properties of Bloom Filters
• Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large
number of elements.
• Adding an element never fails. However, the false positive rate increases steadily as elements are added
until all bits in the filter are set to 1, at which point all queries yield a positive result.
• Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it
actually exists.
• Deleting elements from filter is not possible because, if we delete a single element by clearing bits at
indices generated by k hash functions, it might cause deletion of few other elements. Example – if we
delete “goods” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd”
also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present.
• Space Efficiency : If we want to store large list of items in a set for purpose of set membership, we can
store it in hashmap or simple array or linked list. All these methods require storing item itself, which is
not very memory efficient. For example, if we want to store “goods” in hashmap we have to store actual
string “ goods” as a key value pair {some_key : ”goods”}.
Bloom filters do not store the data item at all

More Related Content

Similar to Unit 5 Streams2.pptx

DS Unit 1.pptx
DS Unit 1.pptxDS Unit 1.pptx
DS Unit 1.pptxchin463670
 
Acm aleppo cpc training sixth session
Acm aleppo cpc training sixth sessionAcm aleppo cpc training sixth session
Acm aleppo cpc training sixth sessionAhmad Bashar Eter
 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingzukun
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingSam Light
 
Agile Artificial Intelligence
Agile Artificial IntelligenceAgile Artificial Intelligence
Agile Artificial IntelligenceESUG
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfPython_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfsagar414433
 
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfPython_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfsagar414433
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology InitiativeBasil Bibi
 
Perform brute force
Perform brute forcePerform brute force
Perform brute forceSHC
 

Similar to Unit 5 Streams2.pptx (20)

DS Unit 1.pptx
DS Unit 1.pptxDS Unit 1.pptx
DS Unit 1.pptx
 
Acm aleppo cpc training sixth session
Acm aleppo cpc training sixth sessionAcm aleppo cpc training sixth session
Acm aleppo cpc training sixth session
 
Hash function
Hash functionHash function
Hash function
 
Randamization.pdf
Randamization.pdfRandamization.pdf
Randamization.pdf
 
Recursion
RecursionRecursion
Recursion
 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sorting
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - Hashing
 
Agile Artificial Intelligence
Agile Artificial IntelligenceAgile Artificial Intelligence
Agile Artificial Intelligence
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Chapter3.pptx
Chapter3.pptxChapter3.pptx
Chapter3.pptx
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfPython_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdf
 
Python_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdfPython_Cheat_Sheet_Keywords_1664634397.pdf
Python_Cheat_Sheet_Keywords_1664634397.pdf
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology Initiative
 
Hashing
HashingHashing
Hashing
 
Perform brute force
Perform brute forcePerform brute force
Perform brute force
 

More from SonaliAjankar

More from SonaliAjankar (7)

ch5_CPU Scheduling_part1.pdf
ch5_CPU Scheduling_part1.pdfch5_CPU Scheduling_part1.pdf
ch5_CPU Scheduling_part1.pdf
 
page replacement algorithms.pdf
page replacement algorithms.pdfpage replacement algorithms.pdf
page replacement algorithms.pdf
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
assem.ppt
assem.pptassem.ppt
assem.ppt
 
part2.pdf
part2.pdfpart2.pdf
part2.pdf
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 
Mobile transport layer
 Mobile transport layer Mobile transport layer
Mobile transport layer
 

Recently uploaded

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 

Recently uploaded (20)

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 

Unit 5 Streams2.pptx

  • 2. The stream model • Data sequentially enters at a rapid rate from one or more inputs • We cannot store the entire stream • Processing in real-time • Limited memory (usually sub linear in the size of the stream) • Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence, filtering Approximate answer is usually preferable 2
  • 4. Bloom Filter • A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. • For example, checking availability of username is set membership problem, where the set is the list of all registered username. • The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results.
  • 5. Motivation: The “Set Membership” Problem • x: An Element • S: A Set of elements (Finite) • Input: x, S • Output: • True (if x in S) • False (if x not in S) 20 Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|) Streaming Algorithm: • Limited Space/item • Limited Processing time/item • Approximate answer based on a summary/sketch of the data stream in the memory.
  • 6. Bloom Filter • Given a set of keys S that we want to filter • Create a bit array B of n bits, initially all false (0s) (Complexity:- O(n) ) • Choose k independent and uniform hash functions, ℎ0, ℎ1, … , ℎk−1 each outputs a value within the range {0, 1, … , n-1} F F F F F F F F F F 0 1 2 3 4 5 6 7 8 9 n = 10 21
  • 7. Bloom Filter • For each element sϵS, the Boolean value at positions ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true. , i.e., B[h(s)]=1 • Complexity of Insertion:- O(k) F F F F F F F F F F 0 1 2 3 4 5 6 7 8 9 𝑠1 ℎ0 𝑠1 = 1 ℎ1 𝑠1 = 4 ℎ2 𝑠1 = 6 k = 3 22 T T T
  • 8. Bloom Filter • For each element sϵS, the Boolean value at positions ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true. F T F F T F T F F F 0 1 2 3 4 5 6 7 8 9 𝑠1 𝑠2 ℎ0 𝑠2 = 4 ℎ1 𝑠2 = 7 ℎ2 𝑠2 = 9 k = 3 23 Note: A particular Boolean value may be set to True several times. T T
  • 9. Algorithm to Approximate Set Membership Query Input: x ( may/may not be an element) Output: Boolean For all i ϵ {0,1,…,k-1} if hi(x) is False return False return True F T F F T F T T F T 0 1 2 3 4 5 6 7 8 9 𝑠1 𝑠2 k = 3 24 Runtime Complexity:- O(k) 𝑥 = S1 𝑥 = S3
  • 10. Algorithm to Approximate Set Membership Query • A empty bloom filter is a bit array of n bits, all set to zero, like this • Example – Suppose we want to enter “geeks” in the filter, we are using 3 hash functions and a bit array of length 10, all set to 0 initially. First we’ll calculate the hashes as follows: • h1(“geeks”) % 10 = 1, • h2(“geeks”) % 10 = 4 • h3(“geeks”) % 10 = 7
  • 11. Algorithm to Approximate Set Membership Query • Again we want to enter “nerd”, similarly, we’ll calculate hashes • h1(“nerd”) % 10 = 3 • h2(“nerd”) % 10 = 5 • h3(“nerd”) % 10 = 4
  • 12. Algorithm to Approximate Set Membership Query • False Positive in Bloom Filters • Example – Suppose we want to check whether “cat” is present or not. We’ll calculate hashes using h1, h2 and h3 • h1(“cat”) % 10 = 1 • h2(“cat”) % 10 = 3 • h3(“cat”) % 10 = 7 • If we check the bit array, bits at these indices are set to 1 but we know that “cat” was never added to the filter. Bit at index 1 and 7 was set when we added “geeks” and bit 3 was set we added “nerd”.
  • 13. Algorithm to Approximate Set Membership Query F T F F T F T T F T 0 1 2 3 4 5 6 7 8 9 𝑠1 𝑠2 ℎ0 𝑠2 = 4 ℎ1 𝑠2 = 7 ℎ2 𝑠2 = 9 k = 3 28 𝑥 ℎ0 𝑥 = 9 ℎ1 𝑥 = 6 ℎ2 𝑥 = 1 False Positive!! ℎ0 𝑠1 = 1 ℎ1 𝑠1 = 4 ℎ2 𝑠1 = 6
  • 14. Error Types • False Negative – Answering “is not there” on an element which “is there” • Never happens for Bloom Filter • False Positive – Answering “is there” for an element which “is not there” • Might happens. How likely? 29
  • 15. Probability of false positives F T F T F F T F T F S1 S2 n = size of bit array m = number of items k = number of hash functions Consider a particular bit 0 <= j <= n-1 Probability that ℎ𝑖 𝑥 does not set bit j after hashing only 1 item: 𝑃 ℎ𝑖 𝑥 ≠ 𝑗 = 1 − 1 𝑛 Probability that ℎ𝑖 𝑥 does not set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 {𝑆1, 𝑆2, … , 𝑆𝑚}: ℎ𝑖 𝑥 ≠ 𝑗 = 1 − 1 𝑛 𝑚 30
  • 16. Probability of false positives F T F T F F T F T F S1 S2 n = size of bit array m = number of items k = number of hash functions Probability that none of the hash functions set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 𝑆1, 𝑆2, … , 𝑆𝑚 , ∀𝑖 𝑖𝑛 1,2, … , 𝑘 : ℎ𝑖 (𝑥) ≠ 𝑗 = 1 − 1 𝑛 𝑘𝑚 We know that, 1 − 1 𝑛 𝑛 ≈ 1 e = 𝑒−1 ⇒ 1 − 1 𝑛 𝑘𝑚 = 1 − 1 𝑛 𝑛 𝑘𝑚 𝑛 ≈ 𝑒−1 𝑘𝑚 𝑛 = 𝑒−𝑘𝑚 𝑛 31
  • 17. Probability of false positives F T F T F F T F T F S1 S2 n = size of bit array m = number of items k = number of hash functions Probability that bit j is not set 𝑃 𝐵𝑖𝑡 𝑗 = 𝐹 = 𝑒−𝑘𝑚 𝑛 The prob. of having all k bits of a new element already set = 𝟏 − 𝒆− 𝒌𝒎 𝒏 𝒌 32 Approximate Probability of False Positive
  • 18. Analysis: Throwing Darts (1) • More accurate analysis for the number of false positives • Consider: If we throw m darts into n equally likely targets, what is the probability that a target gets at least one dart? • In our case: • Targets = bits/buckets • Darts = hash values of items
  • 19. Analysis: Throwing Darts (2) • We have m darts, n targets • What is the probability that a target gets at least one dart? (1 – 1/n) Probability some target X not hit by a dart m 1 - Probability at least one dart hits target X n( / n) Equivalent Equals 1/e as n ∞ 1 – e–m/n
  • 20. Analysis: Throwing Darts (3) • Fraction of 1s in the array B = = probability of false positive = 1 – e-m/n • Example: 109 darts, 8∙109 targets • Fraction of 1s in B = 1 – e-1/8 = 0.1175 • Compare with our earlier estimate: 1/8 = 0.125 (1 – 1/n) m 1 - n( / n) 1 – e–m/n
  • 21. First Cut Solution (3) |S| = 1 billion email addresses |B|= 1GB = 8 billion bits If the email address is in S, then it surely hashes to a bucket that has the big set to 1, so it always gets through (no false negatives)
  • 22. Properties of Bloom Filters • Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements. • Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result. • Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists. • Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “goods” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present. • Space Efficiency : If we want to store large list of items in a set for purpose of set membership, we can store it in hashmap or simple array or linked list. All these methods require storing item itself, which is not very memory efficient. For example, if we want to store “goods” in hashmap we have to store actual string “ goods” as a key value pair {some_key : ”goods”}. Bloom filters do not store the data item at all

Editor's Notes

  1. Consider: |S| = m, |B| = n --------Given a set of keys S that we want to filter Create a bit array B of n bits, initially all 0s Choose a hash function h with range [0,n)
  2. Question: Where the randomness comes from? Ans: From the set {S1,S2,….,Sm}. Imagine, you are trying m times to set the bit value at position j to True using a certain (i-th) hash function (no randomness here so far, the function is fixed):- hi. The last line is the probability of your failure to set that bit to True.
  3. In the first equation:- The reason why you will multiply (1-1/n)^m term k times is:- When you are using k hash functions to hash m items, you are trying k times to set the bit value at position j to True. Once again, there is no randomness in hi here. You are just taking more trials to set the bit(j) to True.
  4. Unfortunately some spam email will get through
  5. Unfortunately some spam email will get through
  6. http://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf