Web mining

Web Mining
By:-Mudit Dholakia
Guide:-Dr. Amit Ganatra Sir

What is web mining?
• Web mining is the use of the data mining techniques to automatically
discover and extract information from web documents/services.
• Discovering Knowledge from and about WWW - is one of the basic
abilities of an intelligent agent.

Web Mining .vs. Data Mining
• Structure (or lack of it)
• Textual information and linkage structure
• Scale
• Data generated per day is comparable to largest conventional data
warehouses
• Speed
• Often need to react to evolving usage patterns in real-time (e.g.,
merchandising)

Web Mining topics
• Web graph analysis
• Power Laws and The Long Tail
• Structured data extraction
• Web advertising
• Systems Issues

Size of the Web
• Number of pages
• Technically, infinite
• Much duplication (30-40%)
• Best estimate of “unique” static HTML pages comes from search engine
claims
• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion
• Google recently announced that their index contains 1 trillion pages
• How to explain the discrepancy?

The web as a graph
• Pages = nodes, hyperlinks = edges
• Ignore content
• Directed graph
• High linkage
• 10-20 links/page on average
• Power-law degree distribution

Measures
• Structure
• In-degrees
• Out-degrees
• Number of pages per site
• Usage patterns
• Number of visitors
• Popularity e.g., products, movies, music

Measures
• Shelf space is a scarce commodity for traditional retailers
• Also: TV networks, movie theaters,…
• The web enables near-zero-cost dissemination of information about
products
• More choice necessitates better filters
• Recommendation engines (e.g., Amazon)
• How Into Thin Air made Touching the Void a bestseller

Searching the Web
Content aggregatorsThe Web Content consumers

Two approaches for analyzing data
• Machine Learning approach
• Emphasizes sophisticated algorithms e.g., Support Vector Machines
• Data sets tend to be small, fit in memory
• Data Mining approach
• Emphasizes big data sets (e.g., in the terabytes)
• Data cannot even fit on a single disk!
• Necessarily leads to simpler algorithms

View of mining system
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU
…

Issues
• Web data sets can be very large
• Tens to hundreds of terabytes
• Cannot mine on a single server!
• Need large farms of servers
• How to organize hardware/software to mine multi-terabyte data sets
• Without breaking the bank!

What it should do?
• Finding relevant information
• Low precision and unindexed information
• Creating new knowledge out of available information on the web
• A data-triggered process
• Personalizing the information
• Personal preference in content and presentation of the information
• Learning about the consumers
• What does the customer want to do?

Direct vs Indirect web mining
• Web mining techniques can be used to solve the information
overload problems:
Directly
Address the problem with web mining techniques
E.g. newsgroup agent classifies whether the news as relevant
Indirectly
Used as part of a bigger application that addresses problems
E.g. used to create index terms for a web search service

Web Mining Categories
• Web Content Mining
Discovering useful information from web page
contents/data/documents.
• Web Structure Mining
Discovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
• Web Usage Mining
Extraction of interesting knowledge from logging information
produced by web servers.
Usage data from logs, user profiles, user sessions, cookies, user
queries, bookmarks, mouse clicks and scrolls, etc.

Types
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining

IR
System
Query
Documents
source
Ranked
Documents
Document
Document
Document
Clustering
System
Similarity
measure
Documents
source
Doc
Do
c
Doc
Doc
Doc
DocDoc
Doc
Doc
Doc

Web Content Data Structure
• Web content consists of several types of data
• Text, image, audio, video, hyperlinks.
• Unstructured – free text
• Semi-structured – HTML
• More structured – Data in the tables or database generated HTML
pages
Note: much of the Web content data is unstructured text data.

Web Content Mining
• Unstructured Documents
Bag of words to represent unstructured documents
 Takes single word as feature
 Ignores the sequence in which words occur
Features could be
 Boolean
 Word either occurs or does not occur in a document
 Frequency based
 Frequency of the word in a document
Variations of the feature selection include
 Removing the case, punctuation, infrequent words and stop words
Features can be reduced using different feature selection techniques:
 Information gain, mutual information, cross entropy.
 Stemming: which reduces words to their morphological roots.

Web Content Mining
• Semi-Structured Documents
Uses richer representations for features
Due to the additional structural information in the hypertext
document (typically HTML and hyperlinks)
Uses common data mining methods (whereas
unstructured might use more text mining methods)
Application:
 Hypertext classification or categorization and clustering,
 learning relations between web documents,
 learning extraction patterns or rules, and
 finding patterns in semi-structured data.

Web Content Mining: DB View
• The database techniques on the Web are related to the problems of managing
and querying the information on the Web.
• DB view tries to infer the structure of a Web site or transform a Web site to
become a database
Better information management
Better querying on the Web
• Can be achieved by:
Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database

Web Content Mining: DB View
• DB view mainly uses the Object Exchange Model (OEM)
Represents semi-structured data by a labeled graph
The data in the OEM is viewed as a graph, with objects as the vertices
and labels on the edges
 Each object is identified by an object identifier [oid] and
 Value is either atomic or complex
• Process typically starts with manual selection of Web sites for
doing Web content mining
• Main application:
• The task of finding frequent substructures in semi-structured data
• The task of creating multi-layered database

Taxonomies
• Ranking
• Graph Search
• Communities
• Hyperlink Induced Topic Search
• SEO
• Hub & Authorities

Web Structure Mining
• Interested in the structure of the hyperlinks within the Web
• Inspired by the study of social networks and citation analysis
• Can discover specific types of pages(such as hubs, authorities, etc.) based on
the incoming and outgoing links.
• Application:
• Discovering micro-communities in the Web ,
• measuring the “completeness” of a Web site

Web Usage Mining
• Tries to predict user behavior from interaction
with the Web
• Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
• Two common approaches
 Maps the usage data of Web server into relational tables before
an adapted data mining techniques
 Uses the log data directly by utilizing special pre-processing
techniques

Web Usage Mining
Pre-Processing Pattern Discovery Pattern Analysis
User session
File Rules and Patterns Interesting
Knowledge

XML View
Generalized Descriptions
More Generalized Descriptions
Layer0
Layer1
Layern
...

33
Use of Multi-Layer Meta Web
• Benefits of Multi-Layer Meta-Web:
• Multi-dimensional Web info summary analysis
• Approximate and intelligent query answering
• Web high-level query answering (WebSQL, WebML)
• Web content and structure mining
• Observing the dynamics/evolution of the Web
• Is it realistic to construct such a meta-Web?
• Benefits even if it is partially constructed
• Benefits may justify the cost of tool development,
standardization and partial restructuring

Web Search Products and Services
 Alta Vista
 DB2 text extender
 Excite
 Fulcrum
 Glimpse (Academic)
 Google!
 Inforseek Internet
 Inforseek Intranet
 Inktomi (HotBot)
 Lycos
 PLS
 Smart (Academic)
 Oracle text extender
 Verity
 Yahoo!

Web Usage Mining
• Typical problems:
• Distinguishing among unique users, server sessions,
episodes, etc. in the presence of caching and proxy
servers
• Often Usage Mining uses some background or domain
knowledge
E.g. site topology, Web content, etc.

Web Usage Mining
• Applications:
• Two main categories:
 Learning a user profile (personalized)
Web users would be interested in techniques that learn their
needs and preferences automatically
 Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques that
improve the effectiveness of their Web site

References
• www.cs.jyu.fi/ai/vagan/Web_Mining.ppt
• www.infolab.stanford.edu/~ullman/mining/webMiningOverview.ppt
• www.psl.cs.columbia.edu/classes/.../Presentation_Jagriti_Mishra.ppt
x

Web mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web mining

Similar to Web mining (20)

Recently uploaded

Recently uploaded (20)

Web mining