SlideShare a Scribd company logo
1 of 9
Download to read offline
Machine Learning in Spam
Filtering
John Levine, IETF 94
1
Spam filtering
● It’s hard
● Spammers adapt
● So filters have to adapt too
2
Metadata vs. contents
● Metadata
○ Connecting IP, Envelope to/from, Timing, etc.
○ Usually cheap
● Contents
○ Naughty words, statistical patterns, etc.
○ Always expensive, since it requires message receipt
3
Content filtering
● Originally static words or regexps
○ viagra
○ v[1i][a@]gr[a@]
○ v[. ]*[1i][. ]*[a@][. ]*g[. ]*(a[. ]*)?r[. ]*[a@]
● Now usually dynamic
4
Bayesian filtering
● Paul Graham, A Plan for Spam
● Bayesian filtering on word sequences
5
Bayesian filtering
● Tokenize the message
○ All headers? Some headers? Body? Body minus attachments? Decoded attachments?
● Look up word sequences in database, compute score
● Do filtering
● Tune filters when they’re wrong
6
Filter tuning
● Tune when user reports spam/not-spam
● Auto-tune as mail goes by
● Per system? Per user? Shared among multiple systems?
7
Spamassassin spamassassin.apache.org
● The de facto standard filterware
● Big perl module with plugins
● Fairly sophisticated bayesian filters
● Tuning is your problem via sa-learn
8
Bulk counting
● Characterize message as a checksum
● Razor razor.sourceforge.net
○ Spam reported manually or automatically
○ Shared database of checksums of spam
● Distributed Checksum Clearinghouse www.dcc-servers.net
○ Count them all
○ Whitelist legit bulk mail
○ IP reputation add-on
9

More Related Content

Similar to Spam Detection Using Machine Learning (ML)

Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchMark Greene
 
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Derek Buitenhuis
 
2 Laymans Course - LAMP V2.pptx
2 Laymans Course - LAMP V2.pptx2 Laymans Course - LAMP V2.pptx
2 Laymans Course - LAMP V2.pptxssuser2f0fb0
 
An Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingAn Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingTyrone Systems
 
Ever Present Persistence - Established Footholds Seen in the Wild
Ever Present Persistence - Established Footholds Seen in the WildEver Present Persistence - Established Footholds Seen in the Wild
Ever Present Persistence - Established Footholds Seen in the WildCTruncer
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programmingJuggernaut Liu
 

Similar to Spam Detection Using Machine Learning (ML) (8)

Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearch
 
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
 
Taming Text
Taming TextTaming Text
Taming Text
 
2 Laymans Course - LAMP V2.pptx
2 Laymans Course - LAMP V2.pptx2 Laymans Course - LAMP V2.pptx
2 Laymans Course - LAMP V2.pptx
 
An Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingAn Introduction to Natural Language Processing
An Introduction to Natural Language Processing
 
Ever Present Persistence - Established Footholds Seen in the Wild
Ever Present Persistence - Established Footholds Seen in the WildEver Present Persistence - Established Footholds Seen in the Wild
Ever Present Persistence - Established Footholds Seen in the Wild
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
 
ANTLR4 in depth
ANTLR4 in depthANTLR4 in depth
ANTLR4 in depth
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Spam Detection Using Machine Learning (ML)

  • 1. Machine Learning in Spam Filtering John Levine, IETF 94 1
  • 2. Spam filtering ● It’s hard ● Spammers adapt ● So filters have to adapt too 2
  • 3. Metadata vs. contents ● Metadata ○ Connecting IP, Envelope to/from, Timing, etc. ○ Usually cheap ● Contents ○ Naughty words, statistical patterns, etc. ○ Always expensive, since it requires message receipt 3
  • 4. Content filtering ● Originally static words or regexps ○ viagra ○ v[1i][a@]gr[a@] ○ v[. ]*[1i][. ]*[a@][. ]*g[. ]*(a[. ]*)?r[. ]*[a@] ● Now usually dynamic 4
  • 5. Bayesian filtering ● Paul Graham, A Plan for Spam ● Bayesian filtering on word sequences 5
  • 6. Bayesian filtering ● Tokenize the message ○ All headers? Some headers? Body? Body minus attachments? Decoded attachments? ● Look up word sequences in database, compute score ● Do filtering ● Tune filters when they’re wrong 6
  • 7. Filter tuning ● Tune when user reports spam/not-spam ● Auto-tune as mail goes by ● Per system? Per user? Shared among multiple systems? 7
  • 8. Spamassassin spamassassin.apache.org ● The de facto standard filterware ● Big perl module with plugins ● Fairly sophisticated bayesian filters ● Tuning is your problem via sa-learn 8
  • 9. Bulk counting ● Characterize message as a checksum ● Razor razor.sourceforge.net ○ Spam reported manually or automatically ○ Shared database of checksums of spam ● Distributed Checksum Clearinghouse www.dcc-servers.net ○ Count them all ○ Whitelist legit bulk mail ○ IP reputation add-on 9