Loading...
Flash Player 9 (or above) is needed to view slideshows. We have detected that you do not have it on your computer.To install it, go here
-
vrc_bhatt favorited this 9 months ago
Slideshow Transcript
- Slide 1: Web Retrieval and Mining Overview Source: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
- Slide 2: Information Retrieval Methods for finding information in documents Started in the 1970s and 1980s “Methods” Algorithms and heuristics “Finding” Query – Document, Document – Document, etc. “Documents” Texts
- Slide 3: The Web is different Massive Thousands of millions of documents Dynamic Updates Deletes Distributed Variable quality Malicious behavior
- Slide 4: Web IR topics Web Search Crawling Indexing Querying Web Mining Adversarial Web IR Distributed Web IR Evaluation
- Slide 5: Web search
- Slide 6: Main goals Precision Relevant documents returned / Documents returned Recall Relevant documents returned / Relevant documents Freshness Performance/scalability
- Slide 7: Main goals
- Slide 8: Two phases of search Off-line Crawling and indexing On-line Querying and ranking
- Slide 9: Search phases
- Slide 10: Web crawling Download pages following rules Applications Create index for search Find particular information items Find/report problems Constraints Robot exclusion protocol and politeness Deep web
- Slide 11: Web indexing Logical view Tokenization Stopwords removal Stemming Creation of an inverted index
- Slide 12: Inverted index
- Slide 13: Challenges of indexing Index compression Efficiency in top-K searches Sorting Index distribution By terms By documents
- Slide 14: Web querying and ranking Keyword-based search is dominant paradigm No large-scale open-domain QA systems (yet) Relevance Vector space model and variants Query expansion Latent semantic indexing
- Slide 15: Web ranking Quality is the main problem Link ranking Hypothesis 1: Topical locality of links Hypothesis 2: Link implies endorsment PageRank HITS
- Slide 16: HITS
- Slide 17: Rank manipulation “The bubble of Web visibility” Content spam Keyword stuffing Content hidding Link spam Link farms Cloaking
- Slide 18: Web mining
- Slide 19: Content mining Extraction of knowledge from Web pages BUT ... HTML is physical formatting There is information loss
- Slide 20: Information loss
- Slide 21: Aspects of content mining Information extraction Revert information loss Content classification Topic Genre Sentiment analysis
- Slide 22: Link mining Scale-free networks
- Slide 23: Macroscopic view Bow-tie structure
- Slide 24: Usage mining Logfile analysis Query logs Privacy issues
- Slide 25: Emerging topics Mobile Web Semantic Web ...


