SlideShare a Scribd company logo
1 of 38
Web Mining
By:-Mudit Dholakia
Guide:-Dr. Amit Ganatra Sir
What is web mining?
• Web mining is the use of the data mining techniques to automatically
discover and extract information from web documents/services.
• Discovering Knowledge from and about WWW - is one of the basic
abilities of an intelligent agent.
Knowledge
WWW
Web Mining .vs. Data Mining
• Structure (or lack of it)
• Textual information and linkage structure
• Scale
• Data generated per day is comparable to largest conventional data
warehouses
• Speed
• Often need to react to evolving usage patterns in real-time (e.g.,
merchandising)
Web Mining topics
• Web graph analysis
• Power Laws and The Long Tail
• Structured data extraction
• Web advertising
• Systems Issues
Size of the Web
• Number of pages
• Technically, infinite
• Much duplication (30-40%)
• Best estimate of “unique” static HTML pages comes from search engine
claims
• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion
• Google recently announced that their index contains 1 trillion pages
• How to explain the discrepancy?
The web as a graph
• Pages = nodes, hyperlinks = edges
• Ignore content
• Directed graph
• High linkage
• 10-20 links/page on average
• Power-law degree distribution
Structure of Web graph
Power-law degree distribution
Measures
• Structure
• In-degrees
• Out-degrees
• Number of pages per site
• Usage patterns
• Number of visitors
• Popularity e.g., products, movies, music
The Long Tail
Measures
• Shelf space is a scarce commodity for traditional retailers
• Also: TV networks, movie theaters,…
• The web enables near-zero-cost dissemination of information about
products
• More choice necessitates better filters
• Recommendation engines (e.g., Amazon)
• How Into Thin Air made Touching the Void a bestseller
Searching the Web
Content aggregatorsThe Web Content consumers
Two approaches for analyzing data
• Machine Learning approach
• Emphasizes sophisticated algorithms e.g., Support Vector Machines
• Data sets tend to be small, fit in memory
• Data Mining approach
• Emphasizes big data sets (e.g., in the terabytes)
• Data cannot even fit on a single disk!
• Necessarily leads to simpler algorithms
View of mining system
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU
…
Issues
• Web data sets can be very large
• Tens to hundreds of terabytes
• Cannot mine on a single server!
• Need large farms of servers
• How to organize hardware/software to mine multi-terabyte data sets
• Without breaking the bank!
What it should do?
• Finding relevant information
• Low precision and unindexed information
• Creating new knowledge out of available information on the web
• A data-triggered process
• Personalizing the information
• Personal preference in content and presentation of the information
• Learning about the consumers
• What does the customer want to do?
Direct vs Indirect web mining
• Web mining techniques can be used to solve the information
overload problems:
Directly
Address the problem with web mining techniques
E.g. newsgroup agent classifies whether the news as relevant
Indirectly
Used as part of a bigger application that addresses problems
E.g. used to create index terms for a web search service
Web Mining Categories
• Web Content Mining
Discovering useful information from web page
contents/data/documents.
• Web Structure Mining
Discovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
• Web Usage Mining
Extraction of interesting knowledge from logging information
produced by web servers.
Usage data from logs, user profiles, user sessions, cookies, user
queries, bookmarks, mouse clicks and scrolls, etc.
Types
• Web Mining
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
IR
System
Query
Documents
source
Ranked
Documents
Document
Document
Document
Clustering
System
Similarity
measure
Documents
source
Doc
Do
c
Doc
Doc
Doc
DocDoc
Doc
Doc
Doc
Web Content Data Structure
• Web content consists of several types of data
• Text, image, audio, video, hyperlinks.
• Unstructured – free text
• Semi-structured – HTML
• More structured – Data in the tables or database generated HTML
pages
Note: much of the Web content data is unstructured text data.
Web Content Mining
• Unstructured Documents
Bag of words to represent unstructured documents
 Takes single word as feature
 Ignores the sequence in which words occur
Features could be
 Boolean
 Word either occurs or does not occur in a document
 Frequency based
 Frequency of the word in a document
Variations of the feature selection include
 Removing the case, punctuation, infrequent words and stop words
Features can be reduced using different feature selection techniques:
 Information gain, mutual information, cross entropy.
 Stemming: which reduces words to their morphological roots.
Web Content Mining
• Semi-Structured Documents
Uses richer representations for features
Due to the additional structural information in the hypertext
document (typically HTML and hyperlinks)
Uses common data mining methods (whereas
unstructured might use more text mining methods)
Application:
 Hypertext classification or categorization and clustering,
 learning relations between web documents,
 learning extraction patterns or rules, and
 finding patterns in semi-structured data.
Web Content Mining: DB View
• The database techniques on the Web are related to the problems of managing
and querying the information on the Web.
• DB view tries to infer the structure of a Web site or transform a Web site to
become a database
Better information management
Better querying on the Web
• Can be achieved by:
Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
Web Content Mining: DB View
• DB view mainly uses the Object Exchange Model (OEM)
Represents semi-structured data by a labeled graph
The data in the OEM is viewed as a graph, with objects as the vertices
and labels on the edges
 Each object is identified by an object identifier [oid] and
 Value is either atomic or complex
• Process typically starts with manual selection of Web sites for
doing Web content mining
• Main application:
• The task of finding frequent substructures in semi-structured data
• The task of creating multi-layered database
Taxonomies
• Ranking
• Graph Search
• Communities
• Hyperlink Induced Topic Search
• SEO
• Hub & Authorities
Web Structure Mining
• Interested in the structure of the hyperlinks within the Web
• Inspired by the study of social networks and citation analysis
• Can discover specific types of pages(such as hubs, authorities, etc.) based on
the incoming and outgoing links.
• Application:
• Discovering micro-communities in the Web ,
• measuring the “completeness” of a Web site
Web Usage Mining
• Tries to predict user behavior from interaction
with the Web
• Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
• Two common approaches
 Maps the usage data of Web server into relational tables before
an adapted data mining techniques
 Uses the log data directly by utilizing special pre-processing
techniques
Web Usage Mining
Pre-Processing Pattern Discovery Pattern Analysis
User session
File Rules and Patterns Interesting
Knowledge
XML View
Generalized Descriptions
More Generalized Descriptions
Layer0
Layer1
Layern
...
33
Use of Multi-Layer Meta Web
• Benefits of Multi-Layer Meta-Web:
• Multi-dimensional Web info summary analysis
• Approximate and intelligent query answering
• Web high-level query answering (WebSQL, WebML)
• Web content and structure mining
• Observing the dynamics/evolution of the Web
• Is it realistic to construct such a meta-Web?
• Benefits even if it is partially constructed
• Benefits may justify the cost of tool development,
standardization and partial restructuring
Web Search Products and Services
 Alta Vista
 DB2 text extender
 Excite
 Fulcrum
 Glimpse (Academic)
 Google!
 Inforseek Internet
 Inforseek Intranet
 Inktomi (HotBot)
 Lycos
 PLS
 Smart (Academic)
 Oracle text extender
 Verity
 Yahoo!
Web Usage Mining
• Typical problems:
• Distinguishing among unique users, server sessions,
episodes, etc. in the presence of caching and proxy
servers
• Often Usage Mining uses some background or domain
knowledge
E.g. site topology, Web content, etc.
Web Usage Mining
• Applications:
• Two main categories:
 Learning a user profile (personalized)
Web users would be interested in techniques that learn their
needs and preferences automatically
 Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques that
improve the effectiveness of their Web site
References
• www.cs.jyu.fi/ai/vagan/Web_Mining.ppt
• www.infolab.stanford.edu/~ullman/mining/webMiningOverview.ppt
• www.psl.cs.columbia.edu/classes/.../Presentation_Jagriti_Mishra.ppt
x
Thank You

More Related Content

What's hot

What's hot (20)

Web content mining
Web content miningWeb content mining
Web content mining
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
WEB MINING.
WEB MINING.WEB MINING.
WEB MINING.
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Web mining
Web miningWeb mining
Web mining
 
Text mining
Text miningText mining
Text mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Web mining
Web miningWeb mining
Web mining
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Web mining
Web miningWeb mining
Web mining
 
Web Mining
Web Mining Web Mining
Web Mining
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 

Similar to Web mining

Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisTim Weninger
 
TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013Avtex
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technologyStefanos Anastasiadis
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Denis Shestakov
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining Jeremiah Fadugba
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's pptmak57
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlervinay arora
 
SharePoint Saturday Paris 2015 Validating SharePoint 2013 Farm Before Go-Live
SharePoint Saturday Paris 2015   Validating SharePoint 2013 Farm Before Go-LiveSharePoint Saturday Paris 2015   Validating SharePoint 2013 Farm Before Go-Live
SharePoint Saturday Paris 2015 Validating SharePoint 2013 Farm Before Go-LiveChirag Patel
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...NCCOMMS
 

Similar to Web mining (20)

Web mining
Web miningWeb mining
Web mining
 
Metadata and the web
Metadata and the webMetadata and the web
Metadata and the web
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
IRT Unit_4.pptx
IRT Unit_4.pptxIRT Unit_4.pptx
IRT Unit_4.pptx
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
 
TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Foundations of business intelligence databases and information management
Foundations of business intelligence databases and information managementFoundations of business intelligence databases and information management
Foundations of business intelligence databases and information management
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Dm1.1
Dm1.1Dm1.1
Dm1.1
 
Semantic Web For Dummies
Semantic Web For DummiesSemantic Web For Dummies
Semantic Web For Dummies
 
SharePoint WCM 2013
SharePoint WCM 2013SharePoint WCM 2013
SharePoint WCM 2013
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
SharePoint Saturday Paris 2015 Validating SharePoint 2013 Farm Before Go-Live
SharePoint Saturday Paris 2015   Validating SharePoint 2013 Farm Before Go-LiveSharePoint Saturday Paris 2015   Validating SharePoint 2013 Farm Before Go-Live
SharePoint Saturday Paris 2015 Validating SharePoint 2013 Farm Before Go-Live
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 

Recently uploaded

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Web mining

  • 2. What is web mining? • Web mining is the use of the data mining techniques to automatically discover and extract information from web documents/services. • Discovering Knowledge from and about WWW - is one of the basic abilities of an intelligent agent.
  • 4. Web Mining .vs. Data Mining • Structure (or lack of it) • Textual information and linkage structure • Scale • Data generated per day is comparable to largest conventional data warehouses • Speed • Often need to react to evolving usage patterns in real-time (e.g., merchandising)
  • 5. Web Mining topics • Web graph analysis • Power Laws and The Long Tail • Structured data extraction • Web advertising • Systems Issues
  • 6. Size of the Web • Number of pages • Technically, infinite • Much duplication (30-40%) • Best estimate of “unique” static HTML pages comes from search engine claims • Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion • Google recently announced that their index contains 1 trillion pages • How to explain the discrepancy?
  • 7. The web as a graph • Pages = nodes, hyperlinks = edges • Ignore content • Directed graph • High linkage • 10-20 links/page on average • Power-law degree distribution
  • 10. Measures • Structure • In-degrees • Out-degrees • Number of pages per site • Usage patterns • Number of visitors • Popularity e.g., products, movies, music
  • 12. Measures • Shelf space is a scarce commodity for traditional retailers • Also: TV networks, movie theaters,… • The web enables near-zero-cost dissemination of information about products • More choice necessitates better filters • Recommendation engines (e.g., Amazon) • How Into Thin Air made Touching the Void a bestseller
  • 13. Searching the Web Content aggregatorsThe Web Content consumers
  • 14. Two approaches for analyzing data • Machine Learning approach • Emphasizes sophisticated algorithms e.g., Support Vector Machines • Data sets tend to be small, fit in memory • Data Mining approach • Emphasizes big data sets (e.g., in the terabytes) • Data cannot even fit on a single disk! • Necessarily leads to simpler algorithms
  • 15. View of mining system Mem Disk CPU Mem Disk CPU Mem Disk CPU …
  • 16. Issues • Web data sets can be very large • Tens to hundreds of terabytes • Cannot mine on a single server! • Need large farms of servers • How to organize hardware/software to mine multi-terabyte data sets • Without breaking the bank!
  • 17. What it should do? • Finding relevant information • Low precision and unindexed information • Creating new knowledge out of available information on the web • A data-triggered process • Personalizing the information • Personal preference in content and presentation of the information • Learning about the consumers • What does the customer want to do?
  • 18. Direct vs Indirect web mining • Web mining techniques can be used to solve the information overload problems: Directly Address the problem with web mining techniques E.g. newsgroup agent classifies whether the news as relevant Indirectly Used as part of a bigger application that addresses problems E.g. used to create index terms for a web search service
  • 19. Web Mining Categories • Web Content Mining Discovering useful information from web page contents/data/documents. • Web Structure Mining Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs • Web Usage Mining Extraction of interesting knowledge from logging information produced by web servers. Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.
  • 20. Types • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining
  • 22. Web Content Data Structure • Web content consists of several types of data • Text, image, audio, video, hyperlinks. • Unstructured – free text • Semi-structured – HTML • More structured – Data in the tables or database generated HTML pages Note: much of the Web content data is unstructured text data.
  • 23. Web Content Mining • Unstructured Documents Bag of words to represent unstructured documents  Takes single word as feature  Ignores the sequence in which words occur Features could be  Boolean  Word either occurs or does not occur in a document  Frequency based  Frequency of the word in a document Variations of the feature selection include  Removing the case, punctuation, infrequent words and stop words Features can be reduced using different feature selection techniques:  Information gain, mutual information, cross entropy.  Stemming: which reduces words to their morphological roots.
  • 24. Web Content Mining • Semi-Structured Documents Uses richer representations for features Due to the additional structural information in the hypertext document (typically HTML and hyperlinks) Uses common data mining methods (whereas unstructured might use more text mining methods) Application:  Hypertext classification or categorization and clustering,  learning relations between web documents,  learning extraction patterns or rules, and  finding patterns in semi-structured data.
  • 25. Web Content Mining: DB View • The database techniques on the Web are related to the problems of managing and querying the information on the Web. • DB view tries to infer the structure of a Web site or transform a Web site to become a database Better information management Better querying on the Web • Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database
  • 26. Web Content Mining: DB View • DB view mainly uses the Object Exchange Model (OEM) Represents semi-structured data by a labeled graph The data in the OEM is viewed as a graph, with objects as the vertices and labels on the edges  Each object is identified by an object identifier [oid] and  Value is either atomic or complex • Process typically starts with manual selection of Web sites for doing Web content mining • Main application: • The task of finding frequent substructures in semi-structured data • The task of creating multi-layered database
  • 27.
  • 28. Taxonomies • Ranking • Graph Search • Communities • Hyperlink Induced Topic Search • SEO • Hub & Authorities
  • 29. Web Structure Mining • Interested in the structure of the hyperlinks within the Web • Inspired by the study of social networks and citation analysis • Can discover specific types of pages(such as hubs, authorities, etc.) based on the incoming and outgoing links. • Application: • Discovering micro-communities in the Web , • measuring the “completeness” of a Web site
  • 30. Web Usage Mining • Tries to predict user behavior from interaction with the Web • Wide range of data (logs)  Web client data  Proxy server data  Web server data • Two common approaches  Maps the usage data of Web server into relational tables before an adapted data mining techniques  Uses the log data directly by utilizing special pre-processing techniques
  • 31. Web Usage Mining Pre-Processing Pattern Discovery Pattern Analysis User session File Rules and Patterns Interesting Knowledge
  • 32. XML View Generalized Descriptions More Generalized Descriptions Layer0 Layer1 Layern ...
  • 33. 33 Use of Multi-Layer Meta Web • Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis • Approximate and intelligent query answering • Web high-level query answering (WebSQL, WebML) • Web content and structure mining • Observing the dynamics/evolution of the Web • Is it realistic to construct such a meta-Web? • Benefits even if it is partially constructed • Benefits may justify the cost of tool development, standardization and partial restructuring
  • 34. Web Search Products and Services  Alta Vista  DB2 text extender  Excite  Fulcrum  Glimpse (Academic)  Google!  Inforseek Internet  Inforseek Intranet  Inktomi (HotBot)  Lycos  PLS  Smart (Academic)  Oracle text extender  Verity  Yahoo!
  • 35. Web Usage Mining • Typical problems: • Distinguishing among unique users, server sessions, episodes, etc. in the presence of caching and proxy servers • Often Usage Mining uses some background or domain knowledge E.g. site topology, Web content, etc.
  • 36. Web Usage Mining • Applications: • Two main categories:  Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically  Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site