7. Application layer protocol
HTTP is the protocol to exchange or transfer hypertext
Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html
sequences of characters
12. Making use of RE to parse HTTP responses headers
13. Great! Now we’re able to parse the http response headers… so what?
-We can properly process the response body!
Ah I see! … and what would I do that for?
-Let me show you!
16. Internet bots (web robots,WWW robots or
bots) are software applications that run
automated tasks over the Internet
A Web crawler is an Internet bot that
systematically browses theWorld Wide Web,
typically for the purpose ofWeb indexing
Web scraping is a computer software technique
of extracting information from websites
17. A Web Crawler Starts with a list of URLs to visit. As the
crawler visits these URLs, it identifies all
the hyperlinks in the page and adds them to the list of
URLs to visit
hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)"");
hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
18. Web Scraping: A simple yet powerful approach to
extract information from web pages can be based on
regular expression matching facilities of programming
languages (for instance C++, Perl or Python)