Web scraping with php

•Download as PPTX, PDF•

0 likes•7 views

Web scraping is the process of extracting data from websites. It is typically done using a computer program that automates the process of visiting websites, parsing the HTML code, and extracting the desired data. Web scraping can be used for a variety of purposes, such as collecting product prices, tracking competitor activity, or creating market research reports. There are a variety of tools and techniques that can be used for web scraping. Some popular tools include Python libraries such as BeautifulSoup and Scrapy, as well as online services such as ScraperAPI and Octoparse. The specific tool or technique that is best for a particular task will depend on the complexity of the data that needs to be extracted and the frequency with which it needs to be updated. It is important to use web scraping responsibly and ethically. This means following the terms of service of the websites that are being scraped, not overloading the websites with requests, and giving credit to the original source of the data. Web scraping can be a powerful tool, but it is important to use it in a way that does not harm the websites that are being scraped.

Technology

Topic
HTTP Programming
• Methods
• Cookie
• Session
HTTP Tools
• Chrome Develop tool
• Postman
HTTP access with PHP
• Fopen
• File_get_contents
• Curl
HTML access with DOM
PHP I/O Stream

• Data extraction is the act or process of
retrieving data out of (usually unstructured or
poorly structured) data sources for further data
processing or data storage (data migration).
https://en.wikipedia.org/wiki/Data_extraction

Techniques
• HTTP programming
• DOM parsing
• Text pattern matching (Regular
expression)
• Etc.
https://en.wikipedia.org/wiki/Web_scraping

HTTP
Programming
• Methods
• Get
• Post
• Cookie
• Session

HTTP
Request &
Response
GET /index.html HTTP/1.1
Host: www.example.com
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34
GMT
Content-Type: text/html;
charset=UTF-8
Content-Encoding: UTF-8
Content-Length: 138
Last-Modified: Wed, 08 Jan 2003
23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-
Hat/Linux)
ETag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Connection: close
<html>
<head>
<title>An Example Page</title>
</head>
<body> Hello World, this is a very
simple HTML document. </body>
</html>

Hand On #1
GET / HTTP/1.1
Host: testing-ground.scraping.pro
HTTP/1.1 200 OK
Date: Fri, 18 May 2018 05:17:36 GMT
Server: Apache/2.2.22 (Debian)
X-Powered-By: PHP/5.4.4-14+deb7u12
Vary: Accept-Encoding
Content-Length: 3701
Content-Type: text/html

HTTP Component
https://www.gammon.com.au/forum/?id=12942

Cookie &
Session
http://www.hackingarticles.in/beginner-guide-understand-cookies-session-management/

http://testing-
ground.scraping.pro/login

How to get html
content with PHP
•fopen()
•file_get_contents()
•curl

HTML DOM
Access
• getElementById
• getElementByTagName
• DOMXPath

PHP DOM
CLASS TYPE
• DOMDocument
• Represents an entire HTML or XML
document
• DOMNodeList
• The NodeList object represents an ordered
list of nodes
• DOMNode
• Each of DOMNodeList
• DOMElement
• Extend DOMNode

DOMXPath
Expression Description
nodename Selects all nodes with the name "nodename"
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter
where they are
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes

DOMXPath
Path Expression Description
bookstore Selects all nodes with the name "bookstore"
/bookstore Selects from the root node
bookstore/book Selects all book elements that are children of bookstore
//book Selects all book elements no matter where they are in the document
bookstore//book Selects all book elements that are descendant of the bookstore element, no matter
where they are under the bookstore element
//@lang Selects all attributes that are named lang
/bookstore/book[1] Selects the first book element that is the child of the bookstore element.

Similar to Web scraping with php

Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilitiesDefconRussia

WEB TECHNOLOGY Unit-4.pptxkarthiksmart21

Wt-UNNIT-1 (1).pptGReshma10

Web Scraping with PythonChakrit Phain

Building bridges - Plone Conference 2015 BucharestAndreas Jung

Type URL, Enter, and Then …Jinglun Li

20090629 Using phpDocumentorRimpei Ogawa

Php reports sumitSumit Biswas

Week 1 - Interactive News Editing and Producingkurtgessler

Web technologies-course 09.pptxStefan Oprea

The Big Documentation ExtravaganzaStephan Schmidt

Unit 4 - HTTP and the Web Services - ITDeepraj Bhujel

IP UNIT 1.pptxKousheekVinnakoti1

Front end for back end developersWojciech Bednarski

Basic web security modelG Prachi

Dom date and objects and event handlingsmitha273566

Solr Recipes WorkshopErik Hatcher

Webapp security testingTomas Doran

Files in phpsana mateen

Similar to Web scraping with php (20)

Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilities

WEB TECHNOLOGY Unit-4.pptx

Wt-UNNIT-1 (1).ppt

Web Scraping with Python

Building bridges - Plone Conference 2015 Bucharest

Type URL, Enter, and Then …

20090629 Using phpDocumentor

Php reports sumit

Week 1 - Interactive News Editing and Producing

Web technologies-course 09.pptx

The Big Documentation Extravaganza

Unit 4 - HTTP and the Web Services - IT

IP UNIT 1.pptx

Front end for back end developers

Basic web security model

Dom date and objects and event handling

Solr Recipes Workshop

Webapp security testing

Files in php

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

CloudStudio User manual (basic edition):comworks

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Install Stable Diffusion in windows machinePadma Pradeep

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Build your next Gen AI Breakthrough - April 2024Neo4j

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Pigging Solutions in Pet Food ManufacturingPigging Solutions

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs

CloudStudio User manual (basic edition):

My Hashitalk Indonesia April 2024 Presentation

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Install Stable Diffusion in windows machine

Breaking the Kubernetes Kill Chain: Host Path Mount

Are Multi-Cloud and Serverless Good or Bad?

Streamlining Python Development: A Guide to a Modern Project Setup

SQL Database Design For Developers at php[tek] 2024

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Benefits Of Flutter Compared To Other Frameworks

Build your next Gen AI Breakthrough - April 2024

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Pigging Solutions in Pet Food Manufacturing

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Presentation on how to chat with PDF using ChatGPT code interpreter

Web scraping with php

1. Web data extraction with PHP Chakrit Phain Softnix Technology Co.,Ltd.

2. goo.gl/ytLxtS

3. Topic HTTP Programming • Methods • Cookie • Session HTTP Tools • Chrome Develop tool • Postman HTTP access with PHP • Fopen • File_get_contents • Curl HTML access with DOM PHP I/O Stream

4. • Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). https://en.wikipedia.org/wiki/Data_extraction

5. Techniques • HTTP programming • DOM parsing • Text pattern matching (Regular expression) • Etc. https://en.wikipedia.org/wiki/Web_scraping

6. HTTP Programming • Methods • Get • Post • Cookie • Session

7. HTTP Request & Response GET /index.html HTTP/1.1 Host: www.example.com HTTP/1.1 200 OK Date: Mon, 23 May 2005 22:38:34 GMT Content-Type: text/html; charset=UTF-8 Content-Encoding: UTF-8 Content-Length: 138 Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT Server: Apache/1.3.3.7 (Unix) (Red- Hat/Linux) ETag: "3f80f-1b6-3e1cb03b" Accept-Ranges: bytes Connection: close <html> <head> <title>An Example Page</title> </head> <body> Hello World, this is a very simple HTML document. </body> </html>

8. Hand On #1 http telnet

9. Hand On #1 GET / HTTP/1.1 Host: testing-ground.scraping.pro HTTP/1.1 200 OK Date: Fri, 18 May 2018 05:17:36 GMT Server: Apache/2.2.22 (Debian) X-Powered-By: PHP/5.4.4-14+deb7u12 Vary: Accept-Encoding Content-Length: 3701 Content-Type: text/html

10. HTTP Component https://www.gammon.com.au/forum/?id=12942

11. Cookie & Session http://www.hackingarticles.in/beginner-guide-understand-cookies-session-management/

12. Tools

13. Hand On #2 http cookie

14. http://testing- ground.scraping.pro/login

15. How to get html content with PHP •fopen() •file_get_contents() •curl

16. Fopen

17. Hand On #3 Fopen

18. FILE_GET_CONTENTS

19. Hand On #4 File_get_contents

20. CURL

21. Hand On #5 CURL

22. Document Object Model (DOM)

23. HTML DOM Access • getElementById • getElementByTagName • DOMXPath

24. DOMDocument::getElementById

25. DOMDocument::getElementsByTagName

26. PHP DOM CLASS TYPE • DOMDocument • Represents an entire HTML or XML document • DOMNodeList • The NodeList object represents an ordered list of nodes • DOMNode • Each of DOMNodeList • DOMElement • Extend DOMNode

27. DOMXPath Expression Description nodename Selects all nodes with the name "nodename" / Selects from the root node // Selects nodes in the document from the current node that match the selection no matter where they are . Selects the current node .. Selects the parent of the current node @ Selects attributes

28. DOMXPath Path Expression Description bookstore Selects all nodes with the name "bookstore" /bookstore Selects from the root node bookstore/book Selects all book elements that are children of bookstore //book Selects all book elements no matter where they are in the document bookstore//book Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element //@lang Selects all attributes that are named lang /bookstore/book[1] Selects the first book element that is the child of the bookstore element.

29. DOM Example

30. DOM Example //article/div[2]/div[7]//a

31. Hand On #6 IMDB

32. PHP I/O Stream