Web scraping is the process of extracting data from websites. It is typically done using a computer program that automates the process of visiting websites, parsing the HTML code, and extracting the desired data. Web scraping can be used for a variety of purposes, such as collecting product prices, tracking competitor activity, or creating market research reports.
There are a variety of tools and techniques that can be used for web scraping. Some popular tools include Python libraries such as BeautifulSoup and Scrapy, as well as online services such as ScraperAPI and Octoparse. The specific tool or technique that is best for a particular task will depend on the complexity of the data that needs to be extracted and the frequency with which it needs to be updated.
It is important to use web scraping responsibly and ethically. This means following the terms of service of the websites that are being scraped, not overloading the websites with requests, and giving credit to the original source of the data. Web scraping can be a powerful tool, but it is important to use it in a way that does not harm the websites that are being scraped.
3. Topic
HTTP Programming
• Methods
• Cookie
• Session
HTTP Tools
• Chrome Develop tool
• Postman
HTTP access with PHP
• Fopen
• File_get_contents
• Curl
HTML access with DOM
PHP I/O Stream
4. • Data extraction is the act or process of
retrieving data out of (usually unstructured or
poorly structured) data sources for further data
processing or data storage (data migration).
https://en.wikipedia.org/wiki/Data_extraction
5. Techniques
• HTTP programming
• DOM parsing
• Text pattern matching (Regular
expression)
• Etc.
https://en.wikipedia.org/wiki/Web_scraping
26. PHP DOM
CLASS TYPE
• DOMDocument
• Represents an entire HTML or XML
document
• DOMNodeList
• The NodeList object represents an ordered
list of nodes
• DOMNode
• Each of DOMNodeList
• DOMElement
• Extend DOMNode
27. DOMXPath
Expression Description
nodename Selects all nodes with the name "nodename"
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter
where they are
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes
28. DOMXPath
Path Expression Description
bookstore Selects all nodes with the name "bookstore"
/bookstore Selects from the root node
bookstore/book Selects all book elements that are children of bookstore
//book Selects all book elements no matter where they are in the document
bookstore//book Selects all book elements that are descendant of the bookstore element, no matter
where they are under the bookstore element
//@lang Selects all attributes that are named lang
/bookstore/book[1] Selects the first book element that is the child of the bookstore element.