Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Www Search Engine But Not In Perl
1. Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05
2.
3.
4.
5.
6. Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed c rawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions
12. Current Stats Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05
13. Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (Each of the WordID has list of (ideally sorted) DocIDs) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
14. Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
15. Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (lists DocIDs for each of the WordID) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query: “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Not matched! 2 n/a Not matched! n/a 1 Matched! 0 0 Result 1 (Perl) 0 (Brum)