SlideShare a Scribd company logo
1 of 19
Basics of Information
Retrieval
Lillian N. Cassel
Some of these slides are taken or adapted from
Source: http://www.stanford.edu/class/cs276/cs276-2006-syllabus.html
Basic ideas
 Information overload
 The challenging byproduct of the information
age
 Huge amounts of information available -- how
to find what you need when you need it
 Think about addresses, e-mail messages, files
of interesting articles, etc.
 Information retrieval is the formal study of
efficient and effective ways to extract the right
bit of information from a collection.
 The web is a special case, as we will discuss.
Some distinctions
 Data, information, knowledge
 How do you distinguish among them?
 http://www.systems-thinking.org/dikw/dikw.htm
 Information sources
 Very well organized, indexed, controlled
 Totally unorganized, uncharacterized,
uncontrolled
 Something in between
Databases
 Databases hold specific data items
 Organization is explicit
 Keys relate items to each other
 Queries are constrained, but effective in retrieving the
data that is there
 Databases generally respond to specific queries with
specific results
 Browsing is difficult
 Searching for items not anticipated by the designers
can be difficult
The Web
 The Web contains many kinds of elements
 Organization?
 There are no keys to relate items to each other
 Queries are unconstrained; effectiveness depends on
the tools used.
 Web queries generally respond to general queries
with specific results
 Browsing is possible, though somewhat complicated
 There are no designers of the overall Web structure.
 Describe how you frequently use the web
 What works easily?
 What has been difficult?
Digital Library
 Something in between the very structured
database and the unstructured Web.
 Content is controlled. Someone makes the
entries. (Maybe a lot of people make the
entries, but there are rules for admission.)
 Searching and browsing are somewhat open,
not controlled by fixed keys and anticipated
queries.
 Nature of the collection regulates indexing
somewhat.
In all cases
 Trying to connect an information user to the
specific information wanted.
 Concerned with efficiency and effectiveness
 Effectiveness - how well did we do?
 Efficiency - how well did we use available
resources?
Effectiveness
 Two measures:
 Precision
 Of the results returned, what percentage are meaningful
to the goal of the query?
 Recall
 Of the materials available that match the query, what
percentage were returned?
 Ex. Search returns 590,000 responses and 195 are
relevant. How well did we do?
 Not enough information.
 Did the 590,000 include all relevant responses? If so,
recall is perfect.
 195/590,000 is not good precision!
The process
Query entered
Query
Interpreted
Items
retrieved
Index
searched
Results Ranked
The Collection
 Where does the collection come from?
 How is the index created?
 Those are important distinguishing
characteristics
 Inverted Index -- Ordered list of terms
related to the collected materials. Each
term has an associated pointer to the
related material(s).
 www.cs.cityu.edu.hk/~deng/5286/T51.doc
Crawling the web
 Misnomer as the spider or robot does not actually
move about the web
 Program sends a normal request for the page, just as
a browser would.
 Retrieve the page and parse it.
 Look for anchors -- pointers to other pages.
• Put them on a list of URLs to visit
 Extract key words (possibly all words) to use as index
terms related to that page
 Take the next URL and do it again
 Actually, the crawling and processing are parallel
activities
Responding to search queries
 Use the query string provided
 Form a boolean query
 Join all words with AND? With OR?
 Find the related index terms
 Return the information available about the
pages that correspond to the query terms.
 Many variations on how to do this. Usually
proprietary to the company.
Making the connections
 Stemming
 Making sure that simple variations in word form are
recognized as equivalent for the purpose of the search:
exercise, exercises, exercised, for example.
 Indexing
 A keyword or group of selected words
 Any word (more general)
 How to choose the most relevant terms to use as index
elements for a set of documents.
 Build an inverted file for the chosen index terms.
The Vector model
 Let,
 N be the total number of documents in the collection
 ni be the number of documents which contain ki
 freq(i,j) raw frequency of ki within dj
 A normalized tf (term frequency) factor is given by
 tf(i,j) = freq(i,j) / max(freq(i,j))
 where the maximum is computed over all terms which
occur within the document dj
 The idf (index term frequency) factor is computed as
 idf(i) = log (N/ni)
 the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
information associated with the term ki.
Anatomy of a web page
 Metatags: Information about the page
 Primary source of indexing information for a search
engine.
 Ex. Title. Never mind what has an H1 tag (though that
may be considered), what is in the <title> </title>
brackets?
 Other tags provide information about the page. This is
easier for the search engine to use than determining
the meaning of the text of the page.
 Dealing with the cheaters
 False information provided in the web page to make
the search engine return this page
 False metatags, invisible words (repeated many times),
etc
Standard Metatags
 The Dublin Core (http://dublincore.org/)
15 common items to use in labeling any web
document
Title Contributor Source
Creator Date Language
Subject Resources type Relation
Description Format Coverage
Publisher Identifier Rights
Hubs and authorities
 Hub points to a lot of other places.
 CITIDEL is a hub for computing information
 NSDL is a hub for science, technology, engineering
and mathematics education.
 Authorities are pointed to by a lot of other places.
 W3C.org is an authority for information about the web.
 When Hub or Authority status is captured, the search
can be more accurate.
 If several pages match a query, and one is an authority
page, it will be ranked higher.
 When a hub matches a query, the pages it points to are
likely to be relevant.
Some Digital Library examples
 Between the chaos of the Web and the strict structure
of a database, the digital library contains an
organized collection.
 We saw the digital collection at the Falvey library
session.
 See also:
 NSDL www.nsdl.org
 And the computing component, CITIDEL:
citidel.villanova.edu
 American Memory
http://memory.loc.gov/ammem/index.html
Conclusions
 The plan was to introduce the basic concepts
of information retrieval in a form accessible to
most students,before you have read anything
about it.
 We will look more deeply at these subjects in
the coming weeks.
 A word about the pattern for these slides …

More Related Content

Similar to Information retrieval is the process of accessing data resources. Usually documents or other unstructured data for the purpose of sharing knowledge.

Module03
Module03Module03
Module03susir
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for FindabilityFindwise
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customersrichwig
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
10 Reasons Search Is Difficult
10 Reasons Search Is Difficult10 Reasons Search Is Difficult
10 Reasons Search Is DifficultMYKrempasky
 
Information Systems For Business and BeyondChapter 4Data a.docx
Information Systems For Business and BeyondChapter 4Data a.docxInformation Systems For Business and BeyondChapter 4Data a.docx
Information Systems For Business and BeyondChapter 4Data a.docxjaggernaoma
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebIOSR Journals
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search enginesunyil96
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
lawTechCamp - Knowledge Management Panel
lawTechCamp - Knowledge Management PanellawTechCamp - Knowledge Management Panel
lawTechCamp - Knowledge Management Panellawtechcamp
 
What Publishers Need to Know About Web Scale Discovery
What Publishers Need to Know About Web Scale DiscoveryWhat Publishers Need to Know About Web Scale Discovery
What Publishers Need to Know About Web Scale DiscoveryRinggold Inc
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise SearchFindwise
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanJISC CETIS
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignMarianne Sweeny
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
Taxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnTaxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnAIIM Minnesota
 

Similar to Information retrieval is the process of accessing data resources. Usually documents or other unstructured data for the purpose of sharing knowledge. (20)

Module03
Module03Module03
Module03
 
From federated to aggregated search
From federated to aggregated searchFrom federated to aggregated search
From federated to aggregated search
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for Findability
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customers
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
10 Reasons Search Is Difficult
10 Reasons Search Is Difficult10 Reasons Search Is Difficult
10 Reasons Search Is Difficult
 
Information Systems For Business and BeyondChapter 4Data a.docx
Information Systems For Business and BeyondChapter 4Data a.docxInformation Systems For Business and BeyondChapter 4Data a.docx
Information Systems For Business and BeyondChapter 4Data a.docx
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
lawTechCamp - Knowledge Management Panel
lawTechCamp - Knowledge Management PanellawTechCamp - Knowledge Management Panel
lawTechCamp - Knowledge Management Panel
 
What Publishers Need to Know About Web Scale Discovery
What Publishers Need to Know About Web Scale DiscoveryWhat Publishers Need to Know About Web Scale Discovery
What Publishers Need to Know About Web Scale Discovery
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles Duncan
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
Taxonomies And Search Aiim Mn
Taxonomies And Search Aiim MnTaxonomies And Search Aiim Mn
Taxonomies And Search Aiim Mn
 

More from NALESVPMEngg

a simple idealized machine used to recognize patterns within input taken from...
a simple idealized machine used to recognize patterns within input taken from...a simple idealized machine used to recognize patterns within input taken from...
a simple idealized machine used to recognize patterns within input taken from...NALESVPMEngg
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...NALESVPMEngg
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...NALESVPMEngg
 
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...NALESVPMEngg
 
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...NALESVPMEngg
 
Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...NALESVPMEngg
 
Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...NALESVPMEngg
 
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...NALESVPMEngg
 
Wk5_UML_ActivityDiagram.pptx
Wk5_UML_ActivityDiagram.pptxWk5_UML_ActivityDiagram.pptx
Wk5_UML_ActivityDiagram.pptxNALESVPMEngg
 
6 Use Case Modeling.pptx
6 Use Case Modeling.pptx6 Use Case Modeling.pptx
6 Use Case Modeling.pptxNALESVPMEngg
 
Introduction To Data Structures.ppt
Introduction To Data Structures.pptIntroduction To Data Structures.ppt
Introduction To Data Structures.pptNALESVPMEngg
 
Introduction To Algorithms.ppt
Introduction To Algorithms.pptIntroduction To Algorithms.ppt
Introduction To Algorithms.pptNALESVPMEngg
 

More from NALESVPMEngg (13)

a simple idealized machine used to recognize patterns within input taken from...
a simple idealized machine used to recognize patterns within input taken from...a simple idealized machine used to recognize patterns within input taken from...
a simple idealized machine used to recognize patterns within input taken from...
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...
 
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
 
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
Class diagrams are a type of UML (Unified Modeling Language) diagram used in ...
 
Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...
 
Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...Activity diagrams show the flow of one activity to another within a system or...
Activity diagrams show the flow of one activity to another within a system or...
 
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
Introduction to Csharp (C-Sharp) is a programming language developed by Micro...
 
Wk5_UML_ActivityDiagram.pptx
Wk5_UML_ActivityDiagram.pptxWk5_UML_ActivityDiagram.pptx
Wk5_UML_ActivityDiagram.pptx
 
TutorialUML.pptx
TutorialUML.pptxTutorialUML.pptx
TutorialUML.pptx
 
6 Use Case Modeling.pptx
6 Use Case Modeling.pptx6 Use Case Modeling.pptx
6 Use Case Modeling.pptx
 
Introduction To Data Structures.ppt
Introduction To Data Structures.pptIntroduction To Data Structures.ppt
Introduction To Data Structures.ppt
 
Introduction To Algorithms.ppt
Introduction To Algorithms.pptIntroduction To Algorithms.ppt
Introduction To Algorithms.ppt
 

Recently uploaded

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 

Recently uploaded (20)

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 

Information retrieval is the process of accessing data resources. Usually documents or other unstructured data for the purpose of sharing knowledge.

  • 1. Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source: http://www.stanford.edu/class/cs276/cs276-2006-syllabus.html
  • 2. Basic ideas  Information overload  The challenging byproduct of the information age  Huge amounts of information available -- how to find what you need when you need it  Think about addresses, e-mail messages, files of interesting articles, etc.  Information retrieval is the formal study of efficient and effective ways to extract the right bit of information from a collection.  The web is a special case, as we will discuss.
  • 3. Some distinctions  Data, information, knowledge  How do you distinguish among them?  http://www.systems-thinking.org/dikw/dikw.htm  Information sources  Very well organized, indexed, controlled  Totally unorganized, uncharacterized, uncontrolled  Something in between
  • 4. Databases  Databases hold specific data items  Organization is explicit  Keys relate items to each other  Queries are constrained, but effective in retrieving the data that is there  Databases generally respond to specific queries with specific results  Browsing is difficult  Searching for items not anticipated by the designers can be difficult
  • 5. The Web  The Web contains many kinds of elements  Organization?  There are no keys to relate items to each other  Queries are unconstrained; effectiveness depends on the tools used.  Web queries generally respond to general queries with specific results  Browsing is possible, though somewhat complicated  There are no designers of the overall Web structure.  Describe how you frequently use the web  What works easily?  What has been difficult?
  • 6. Digital Library  Something in between the very structured database and the unstructured Web.  Content is controlled. Someone makes the entries. (Maybe a lot of people make the entries, but there are rules for admission.)  Searching and browsing are somewhat open, not controlled by fixed keys and anticipated queries.  Nature of the collection regulates indexing somewhat.
  • 7. In all cases  Trying to connect an information user to the specific information wanted.  Concerned with efficiency and effectiveness  Effectiveness - how well did we do?  Efficiency - how well did we use available resources?
  • 8. Effectiveness  Two measures:  Precision  Of the results returned, what percentage are meaningful to the goal of the query?  Recall  Of the materials available that match the query, what percentage were returned?  Ex. Search returns 590,000 responses and 195 are relevant. How well did we do?  Not enough information.  Did the 590,000 include all relevant responses? If so, recall is perfect.  195/590,000 is not good precision!
  • 10. The Collection  Where does the collection come from?  How is the index created?  Those are important distinguishing characteristics  Inverted Index -- Ordered list of terms related to the collected materials. Each term has an associated pointer to the related material(s).  www.cs.cityu.edu.hk/~deng/5286/T51.doc
  • 11. Crawling the web  Misnomer as the spider or robot does not actually move about the web  Program sends a normal request for the page, just as a browser would.  Retrieve the page and parse it.  Look for anchors -- pointers to other pages. • Put them on a list of URLs to visit  Extract key words (possibly all words) to use as index terms related to that page  Take the next URL and do it again  Actually, the crawling and processing are parallel activities
  • 12. Responding to search queries  Use the query string provided  Form a boolean query  Join all words with AND? With OR?  Find the related index terms  Return the information available about the pages that correspond to the query terms.  Many variations on how to do this. Usually proprietary to the company.
  • 13. Making the connections  Stemming  Making sure that simple variations in word form are recognized as equivalent for the purpose of the search: exercise, exercises, exercised, for example.  Indexing  A keyword or group of selected words  Any word (more general)  How to choose the most relevant terms to use as index elements for a set of documents.  Build an inverted file for the chosen index terms.
  • 14. The Vector model  Let,  N be the total number of documents in the collection  ni be the number of documents which contain ki  freq(i,j) raw frequency of ki within dj  A normalized tf (term frequency) factor is given by  tf(i,j) = freq(i,j) / max(freq(i,j))  where the maximum is computed over all terms which occur within the document dj  The idf (index term frequency) factor is computed as  idf(i) = log (N/ni)  the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.
  • 15. Anatomy of a web page  Metatags: Information about the page  Primary source of indexing information for a search engine.  Ex. Title. Never mind what has an H1 tag (though that may be considered), what is in the <title> </title> brackets?  Other tags provide information about the page. This is easier for the search engine to use than determining the meaning of the text of the page.  Dealing with the cheaters  False information provided in the web page to make the search engine return this page  False metatags, invisible words (repeated many times), etc
  • 16. Standard Metatags  The Dublin Core (http://dublincore.org/) 15 common items to use in labeling any web document Title Contributor Source Creator Date Language Subject Resources type Relation Description Format Coverage Publisher Identifier Rights
  • 17. Hubs and authorities  Hub points to a lot of other places.  CITIDEL is a hub for computing information  NSDL is a hub for science, technology, engineering and mathematics education.  Authorities are pointed to by a lot of other places.  W3C.org is an authority for information about the web.  When Hub or Authority status is captured, the search can be more accurate.  If several pages match a query, and one is an authority page, it will be ranked higher.  When a hub matches a query, the pages it points to are likely to be relevant.
  • 18. Some Digital Library examples  Between the chaos of the Web and the strict structure of a database, the digital library contains an organized collection.  We saw the digital collection at the Falvey library session.  See also:  NSDL www.nsdl.org  And the computing component, CITIDEL: citidel.villanova.edu  American Memory http://memory.loc.gov/ammem/index.html
  • 19. Conclusions  The plan was to introduce the basic concepts of information retrieval in a form accessible to most students,before you have read anything about it.  We will look more deeply at these subjects in the coming weeks.  A word about the pattern for these slides …