Quick Upload

Loading...
Flash Player 9 (or above) is needed to view slideshows. We have detected that you do not have it on your computer.To install it, go here
Post to Twitter Post to Twitter
Share on Facebook
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons
SlideShare is now available on LinkedIn. Add it to your LinkedIn profile.

Web Information Retrieval and Mining

From ChaToX, 10 months ago Add as contact

Talk based on: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).

606 views | 0 comments | 1 favorites | 26 downloads | 1 embeds (Stats)

Embed in your blog options close
Embed (wordpress.com) Exclude related slideshows Embed in your blog

More Info

This slideshow is Public
CC Attribution License
Total Views: 606 on Slideshare: 602 from embeds: 4
Most viewed embeds (Top 5): More
All Embeds: Less
Flagged as inappropriate Flag as inappropriate

Flag as inappropriate

Select your reason for flagging this slideshow as inappropriate.

If needed, use the feedback form to let us know more details.

Slideshow Transcript

  1. Slide 1: Web Retrieval and Mining Overview Source: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
  2. Slide 2: Information Retrieval Methods for finding information in documents  Started in the 1970s and 1980s  “Methods”  Algorithms and heuristics  “Finding”  Query – Document, Document – Document, etc.  “Documents”  Texts 
  3. Slide 3: The Web is different Massive  Thousands of millions of documents  Dynamic  Updates  Deletes  Distributed  Variable quality  Malicious behavior 
  4. Slide 4: Web IR topics Web Search  Crawling  Indexing  Querying  Web Mining  Adversarial Web IR  Distributed Web IR  Evaluation 
  5. Slide 5: Web search
  6. Slide 6: Main goals Precision  Relevant documents returned / Documents returned  Recall  Relevant documents returned / Relevant documents  Freshness  Performance/scalability 
  7. Slide 7: Main goals
  8. Slide 8: Two phases of search Off-line  Crawling and indexing  On-line  Querying and ranking 
  9. Slide 9: Search phases
  10. Slide 10: Web crawling Download pages following rules  Applications  Create index for search  Find particular information items  Find/report problems  Constraints  Robot exclusion protocol and politeness  Deep web 
  11. Slide 11: Web indexing Logical view  Tokenization  Stopwords removal  Stemming  Creation of an inverted index 
  12. Slide 12: Inverted index
  13. Slide 13: Challenges of indexing Index compression  Efficiency in top-K searches  Sorting  Index distribution  By terms  By documents 
  14. Slide 14: Web querying and ranking Keyword-based search is dominant paradigm  No large-scale open-domain QA systems (yet)  Relevance  Vector space model and variants  Query expansion  Latent semantic indexing 
  15. Slide 15: Web ranking Quality is the main problem  Link ranking  Hypothesis 1: Topical locality of links  Hypothesis 2: Link implies endorsment  PageRank  HITS 
  16. Slide 16: HITS
  17. Slide 17: Rank manipulation “The bubble of Web visibility”  Content spam  Keyword stuffing  Content hidding  Link spam  Link farms  Cloaking 
  18. Slide 18: Web mining
  19. Slide 19: Content mining Extraction of knowledge from Web pages  BUT ... HTML is physical formatting  There is information loss 
  20. Slide 20: Information loss
  21. Slide 21: Aspects of content mining Information extraction  Revert information loss  Content classification  Topic  Genre  Sentiment analysis 
  22. Slide 22: Link mining Scale-free networks 
  23. Slide 23: Macroscopic view Bow-tie structure 
  24. Slide 24: Usage mining Logfile analysis  Query logs  Privacy issues 
  25. Slide 25: Emerging topics Mobile Web  Semantic Web  ... 