SlideShare a Scribd company logo
1 of 23
Download to read offline
Building a scalable distributed  WWW search engine  … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk)  at Birmingham Perl Mongers User Group  (http://birmingham.pm.org) V1.0 27/07/05
Contents ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
History (of my work in area of information retrieval) ,[object Object],[object Object],[object Object]
Goals ,[object Object],[object Object],[object Object]
Architecture ,[object Object],[object Object],[object Object],[object Object]
Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed c rawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions
Crawler screenshot 1
Crawler screenshot 2
Crawler screenshot 3
Crawler screenshot 4
Crawler screenshot 5
Current Stats Source:  http://www.majestic12.co.uk/projects/dsearch/stats.php  as of 27/07/05
Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl  – 1 Mongers  – 2 City  – 3  Inverted Index (Each of the WordID has list of  (ideally sorted) DocIDs) 0  -> 0, 1 1  -> 0, 2 2  -> 0, 3  -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl  – 1 Mongers  – 2 City  – 3  Inverted Index (lists DocIDs for each of the WordID) 0  -> 0, 1 1  -> 0, 2 2  -> 0, 3  -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query:  “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Not matched! 2 n/a Not matched! n/a 1 Matched! 0 0 Result 1 (Perl) 0 (Brum)
Search engine screenshot 1
Search engine screenshot 2
Implementation ,[object Object],[object Object],[object Object]
Why not Perl? (using C # instead) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object]
Credits ,[object Object],[object Object],* Volunteers running crawler and who crawled at least 1 mln URLs as of 27/07/05
Recommended reading ,[object Object],[object Object]
Join! Join the project  (unmetered broadband required!):  majestic12.co.uk Your name could be here!

More Related Content

What's hot

EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesHua Chu
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
Documenting an API written in Django Rest Framework
Documenting an API written in Django Rest FrameworkDocumenting an API written in Django Rest Framework
Documenting an API written in Django Rest Frameworksmirolo
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Richard Boulton
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
Data Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPData Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPJiangjun Huang
 
Week 2-after
Week 2-afterWeek 2-after
Week 2-afterjnand
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataCrate.io
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
An Introduction to MongoDB Compass
An Introduction to MongoDB CompassAn Introduction to MongoDB Compass
An Introduction to MongoDB CompassMongoDB
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
Final Presentation IRT - Jingxuan Wei V1.2
Final Presentation  IRT - Jingxuan Wei V1.2Final Presentation  IRT - Jingxuan Wei V1.2
Final Presentation IRT - Jingxuan Wei V1.2JINGXUAN WEI
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 

What's hot (20)

EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devices
 
SFrame
SFrameSFrame
SFrame
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Documenting an API written in Django Rest Framework
Documenting an API written in Django Rest FrameworkDocumenting an API written in Django Rest Framework
Documenting an API written in Django Rest Framework
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8
 
Optimizing Spark
Optimizing SparkOptimizing Spark
Optimizing Spark
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Data Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPData Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCP
 
Week 2-after
Week 2-afterWeek 2-after
Week 2-after
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor Data
 
Sphinx
SphinxSphinx
Sphinx
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
An Introduction to MongoDB Compass
An Introduction to MongoDB CompassAn Introduction to MongoDB Compass
An Introduction to MongoDB Compass
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Final Presentation IRT - Jingxuan Wei V1.2
Final Presentation  IRT - Jingxuan Wei V1.2Final Presentation  IRT - Jingxuan Wei V1.2
Final Presentation IRT - Jingxuan Wei V1.2
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 

Viewers also liked

C1 Introducere Sistem1
C1 Introducere Sistem1C1 Introducere Sistem1
C1 Introducere Sistem1antropologie
 
Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009antropologie
 
C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1antropologie
 
Nätverket 24 timmarswebben
Nätverket 24 timmarswebbenNätverket 24 timmarswebben
Nätverket 24 timmarswebbenBjörn Hagström
 
C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1antropologie
 

Viewers also liked (9)

C1 Introducere Sistem1
C1 Introducere Sistem1C1 Introducere Sistem1
C1 Introducere Sistem1
 
Kick off presentation
Kick off presentationKick off presentation
Kick off presentation
 
Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009
 
Page Rank
Page RankPage Rank
Page Rank
 
C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1
 
Nätverket 24 timmarswebben
Nätverket 24 timmarswebbenNätverket 24 timmarswebben
Nätverket 24 timmarswebben
 
Kathleen & nina
Kathleen & ninaKathleen & nina
Kathleen & nina
 
C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1
 
Diviziuni Sociale
Diviziuni SocialeDiviziuni Sociale
Diviziuni Sociale
 

Similar to Www Search Engine But Not In Perl

MongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDBMongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDBRick Copeland
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?Örjan Lundberg
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User GroupMongoDB
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN StackRob Davarnia
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearchMinsoo Jun
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava
 
Techorama - Evolvable Application Development with MongoDB
Techorama  - Evolvable Application Development with MongoDBTechorama  - Evolvable Application Development with MongoDB
Techorama - Evolvable Application Development with MongoDBbwullems
 
Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharpSerdar Buyuktemiz
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialPHP Support
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBMongoDB
 

Similar to Www Search Engine But Not In Perl (20)

MongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDBMongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDB
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
Mongo db
Mongo dbMongo db
Mongo db
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN Stack
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
Techorama - Evolvable Application Development with MongoDB
Techorama  - Evolvable Application Development with MongoDBTechorama  - Evolvable Application Development with MongoDB
Techorama - Evolvable Application Development with MongoDB
 
Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharp
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
Uma SunilKumar Resume
Uma SunilKumar ResumeUma SunilKumar Resume
Uma SunilKumar Resume
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js Tutorial
 
Reume IT
Reume ITReume IT
Reume IT
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Search Approach - ES, GraphDB
Search Approach - ES, GraphDBSearch Approach - ES, GraphDB
Search Approach - ES, GraphDB
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 

Recently uploaded

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 

Recently uploaded (20)

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 

Www Search Engine But Not In Perl

  • 1. Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed c rawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions
  • 12. Current Stats Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05
  • 13. Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (Each of the WordID has list of (ideally sorted) DocIDs) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
  • 14. Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
  • 15. Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (lists DocIDs for each of the WordID) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query: “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Not matched! 2 n/a Not matched! n/a 1 Matched! 0 0 Result 1 (Perl) 0 (Brum)
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Join! Join the project (unmetered broadband required!): majestic12.co.uk Your name could be here!