2. What is web mining?
• Web mining is the use of the data mining techniques to automatically
discover and extract information from web documents/services.
• Discovering Knowledge from and about WWW - is one of the basic
abilities of an intelligent agent.
4. Web Mining .vs. Data Mining
• Structure (or lack of it)
• Textual information and linkage structure
• Scale
• Data generated per day is comparable to largest conventional data
warehouses
• Speed
• Often need to react to evolving usage patterns in real-time (e.g.,
merchandising)
5. Web Mining topics
• Web graph analysis
• Power Laws and The Long Tail
• Structured data extraction
• Web advertising
• Systems Issues
6. Size of the Web
• Number of pages
• Technically, infinite
• Much duplication (30-40%)
• Best estimate of “unique” static HTML pages comes from search engine
claims
• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion
• Google recently announced that their index contains 1 trillion pages
• How to explain the discrepancy?
7. The web as a graph
• Pages = nodes, hyperlinks = edges
• Ignore content
• Directed graph
• High linkage
• 10-20 links/page on average
• Power-law degree distribution
10. Measures
• Structure
• In-degrees
• Out-degrees
• Number of pages per site
• Usage patterns
• Number of visitors
• Popularity e.g., products, movies, music
12. Measures
• Shelf space is a scarce commodity for traditional retailers
• Also: TV networks, movie theaters,…
• The web enables near-zero-cost dissemination of information about
products
• More choice necessitates better filters
• Recommendation engines (e.g., Amazon)
• How Into Thin Air made Touching the Void a bestseller
14. Two approaches for analyzing data
• Machine Learning approach
• Emphasizes sophisticated algorithms e.g., Support Vector Machines
• Data sets tend to be small, fit in memory
• Data Mining approach
• Emphasizes big data sets (e.g., in the terabytes)
• Data cannot even fit on a single disk!
• Necessarily leads to simpler algorithms
15. View of mining system
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU
…
16. Issues
• Web data sets can be very large
• Tens to hundreds of terabytes
• Cannot mine on a single server!
• Need large farms of servers
• How to organize hardware/software to mine multi-terabyte data sets
• Without breaking the bank!
17. What it should do?
• Finding relevant information
• Low precision and unindexed information
• Creating new knowledge out of available information on the web
• A data-triggered process
• Personalizing the information
• Personal preference in content and presentation of the information
• Learning about the consumers
• What does the customer want to do?
18. Direct vs Indirect web mining
• Web mining techniques can be used to solve the information
overload problems:
Directly
Address the problem with web mining techniques
E.g. newsgroup agent classifies whether the news as relevant
Indirectly
Used as part of a bigger application that addresses problems
E.g. used to create index terms for a web search service
19. Web Mining Categories
• Web Content Mining
Discovering useful information from web page
contents/data/documents.
• Web Structure Mining
Discovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
• Web Usage Mining
Extraction of interesting knowledge from logging information
produced by web servers.
Usage data from logs, user profiles, user sessions, cookies, user
queries, bookmarks, mouse clicks and scrolls, etc.
20. Types
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
22. Web Content Data Structure
• Web content consists of several types of data
• Text, image, audio, video, hyperlinks.
• Unstructured – free text
• Semi-structured – HTML
• More structured – Data in the tables or database generated HTML
pages
Note: much of the Web content data is unstructured text data.
23. Web Content Mining
• Unstructured Documents
Bag of words to represent unstructured documents
Takes single word as feature
Ignores the sequence in which words occur
Features could be
Boolean
Word either occurs or does not occur in a document
Frequency based
Frequency of the word in a document
Variations of the feature selection include
Removing the case, punctuation, infrequent words and stop words
Features can be reduced using different feature selection techniques:
Information gain, mutual information, cross entropy.
Stemming: which reduces words to their morphological roots.
24. Web Content Mining
• Semi-Structured Documents
Uses richer representations for features
Due to the additional structural information in the hypertext
document (typically HTML and hyperlinks)
Uses common data mining methods (whereas
unstructured might use more text mining methods)
Application:
Hypertext classification or categorization and clustering,
learning relations between web documents,
learning extraction patterns or rules, and
finding patterns in semi-structured data.
25. Web Content Mining: DB View
• The database techniques on the Web are related to the problems of managing
and querying the information on the Web.
• DB view tries to infer the structure of a Web site or transform a Web site to
become a database
Better information management
Better querying on the Web
• Can be achieved by:
Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
26. Web Content Mining: DB View
• DB view mainly uses the Object Exchange Model (OEM)
Represents semi-structured data by a labeled graph
The data in the OEM is viewed as a graph, with objects as the vertices
and labels on the edges
Each object is identified by an object identifier [oid] and
Value is either atomic or complex
• Process typically starts with manual selection of Web sites for
doing Web content mining
• Main application:
• The task of finding frequent substructures in semi-structured data
• The task of creating multi-layered database
29. Web Structure Mining
• Interested in the structure of the hyperlinks within the Web
• Inspired by the study of social networks and citation analysis
• Can discover specific types of pages(such as hubs, authorities, etc.) based on
the incoming and outgoing links.
• Application:
• Discovering micro-communities in the Web ,
• measuring the “completeness” of a Web site
30. Web Usage Mining
• Tries to predict user behavior from interaction
with the Web
• Wide range of data (logs)
Web client data
Proxy server data
Web server data
• Two common approaches
Maps the usage data of Web server into relational tables before
an adapted data mining techniques
Uses the log data directly by utilizing special pre-processing
techniques
31. Web Usage Mining
Pre-Processing Pattern Discovery Pattern Analysis
User session
File Rules and Patterns Interesting
Knowledge
33. 33
Use of Multi-Layer Meta Web
• Benefits of Multi-Layer Meta-Web:
• Multi-dimensional Web info summary analysis
• Approximate and intelligent query answering
• Web high-level query answering (WebSQL, WebML)
• Web content and structure mining
• Observing the dynamics/evolution of the Web
• Is it realistic to construct such a meta-Web?
• Benefits even if it is partially constructed
• Benefits may justify the cost of tool development,
standardization and partial restructuring
34. Web Search Products and Services
Alta Vista
DB2 text extender
Excite
Fulcrum
Glimpse (Academic)
Google!
Inforseek Internet
Inforseek Intranet
Inktomi (HotBot)
Lycos
PLS
Smart (Academic)
Oracle text extender
Verity
Yahoo!
35. Web Usage Mining
• Typical problems:
• Distinguishing among unique users, server sessions,
episodes, etc. in the presence of caching and proxy
servers
• Often Usage Mining uses some background or domain
knowledge
E.g. site topology, Web content, etc.
36. Web Usage Mining
• Applications:
• Two main categories:
Learning a user profile (personalized)
Web users would be interested in techniques that learn their
needs and preferences automatically
Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques that
improve the effectiveness of their Web site