regular expressions and the world wide web

•

0 likes•506 views

Sergio Burdisso

Importance of regular expressions on the web

Technology

REGULAR EXPRESSIONS, EXTRAORDINARY POWER
UNSL
2013
Burdisso Sergio - sergio.burdisso@gmail.com

 I have 20 min to cover all about using REs on
theW3

 HTTP
 Internet bots
 Web Crawler
 Web Scraping

WWW (The Web)
Web Browser
Request
Response
HTTP
HTTP

 Application layer protocol
 HTTP is the protocol to exchange or transfer hypertext
Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html
sequences of characters

 FirstThings First…
Regular ExpressionsAre
Awesome!
 Gather text
 Replace /Transform text
 Search /Validate text

$ POSIX regular expressions (standard) ▪ ^. [ ] [^ ] (0) * {m,n} ? +|$  regex.h  pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})"  regcomp(regex_t *regex, pattern, cflags);  regex.re_nsub = 4 //Number of parenthesized subexpressions  regexec(regex, text, pmatch[])  pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255$

 Making use of RE to parse HTTP responses headers

Great! Now we’re able to parse the http response headers… so what?
-We can properly process the response body!
Ah I see! … and what would I do that for?
-Let me show you!

Regular Expressions cartoon from xkcd
Web Scraping
(we will see!)

 Internet bots (web robots,WWW robots or
bots) are software applications that run
automated tasks over the Internet
 A Web crawler is an Internet bot that
systematically browses theWorld Wide Web,
typically for the purpose ofWeb indexing
 Web scraping is a computer software technique
of extracting information from websites

 A Web Crawler Starts with a list of URLs to visit. As the
crawler visits these URLs, it identifies all
the hyperlinks in the page and adds them to the list of
URLs to visit
hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)"");
hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");

 Web Scraping: A simple yet powerful approach to
extract information from web pages can be based on
regular expression matching facilities of programming
languages (for instance C++, Perl or Python)

$Regular Expressions cartoon from xkcd WebScraping wScraping (8, "http://emails.com/victim"); wScraping.findAll( "^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Box d{1,5}))s{1,2}(?i:(?<address2>(((APT|B LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{ 1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?< city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}), x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$" ); We’ve saved the day!$

Everybody stand back!
We know regular expressions
The end
Thank you for your patience!

What's hot

CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese

Analyse your SEO Data with R and KibanaVincent Terrasi

Combinators - Lightning TalkMike Harris

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung

Python, web scraping and content management: Scrapy and DjangoSammy Fung

サービスリニューアルしてわかったRailsのReactとの付き合い方Haruhiko Kobayashi

CouchDB Day NYC 2017: MapReduce ViewsIBM Cloud Data Services

Golang slidesaudreyAudrey Lim

hySON - D2FestOsori Hanyang

hySON지영 이

How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton

01 ElasticSearch : Getting StartedOpenThink Labs

Reactをproductionに導入して変わったことHaruhiko Kobayashi

Downloading the internet with Python + ScrapyErin Shellman

Scrapy.for.dummiesChandler Huang

Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson

Web Scraping with PythonPaul Schreiber

Do something in 5 with gas 3-simple invoicing appBruce McPherson

Document Conversion & Retrieve and Rank 一問一答Hisashi Komine

Routing @ Scuk.czJakub Kulhan

What's hot (20)

CouchDB Mobile - From Couch to 5K in 1 Hour

Analyse your SEO Data with R and Kibana

Combinators - Lightning Talk

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

Python, web scraping and content management: Scrapy and Django

サービスリニューアルしてわかったRailsのReactとの付き合い方

CouchDB Day NYC 2017: MapReduce Views

Golang slidesaudrey

hySON - D2Fest

hySON

How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...

01 ElasticSearch : Getting Started

Reactをproductionに導入して変わったこと

Downloading the internet with Python + Scrapy

Scrapy.for.dummies

Do something in 5 minutes with gas 1-use spreadsheet as database

Web Scraping with Python

Do something in 5 with gas 3-simple invoicing app

Document Conversion & Retrieve and Rank 一問一答

Routing @ Scuk.cz

Similar to regular expressions and the world wide web

Senior Project Documentation.Seedy Ahmed Jallow

Software Analysis for the Web: Achievements and ProspectsAli Mesbah

Ruby On Rails SiddheshSiddhesh Bhobe

Top 10 Security Vulnerabilities (2006)Susam Pal

Wss Object Modelmaddinapudi

Writing Secure Code for WordPressShawn Hooper

Web services intro.Ranbeer Yadav

Os Pruettoscon2007

C#Web Sec Oct27 2010 FinalRich Helton

Apirandyhoyt

Web Crawleriamthevictory

MashupNaveen P.N

Securing Java EE Web AppsFrank Kim

Extracting data from text documents using the regexSteve Mylroie

Html intake 38 lect1ghkadous

Taking AJAX to the Next LevelStephen.Walther

Microsoft ASP.NET: Taking AJAX to the Next Levelgoodfriday

Introduction to Web ArchitectureChamnap Chhorn

GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharpergranicz

Switch to Backend 2023Google Developer Students Club NIT Silchar

Similar to regular expressions and the world wide web (20)

Senior Project Documentation.

Software Analysis for the Web: Achievements and Prospects

Ruby On Rails Siddhesh

Top 10 Security Vulnerabilities (2006)

Wss Object Model

Writing Secure Code for WordPress

Web services intro.

Os Pruett

C#Web Sec Oct27 2010 Final

Api

Web Crawler

Mashup

Securing Java EE Web Apps

Extracting data from text documents using the regex

Html intake 38 lect1

Taking AJAX to the Next Level

Microsoft ASP.NET: Taking AJAX to the Next Level

Introduction to Web Architecture

GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper

Switch to Backend 2023

Recently uploaded

How to Remove Document Management Hurdles with X-Docs?XfilesPro

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Install Stable Diffusion in windows machinePadma Pradeep

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

CloudStudio User manual (basic edition):comworks

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Recently uploaded (20)

How to Remove Document Management Hurdles with X-Docs?

08448380779 Call Girls In Civil Lines Women Seeking Men

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Advanced Test Driven-Development @ php[tek] 2024

Install Stable Diffusion in windows machine

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Breaking the Kubernetes Kill Chain: Host Path Mount

Pigging Solutions in Pet Food Manufacturing

Presentation on how to chat with PDF using ChatGPT code interpreter

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

CloudStudio User manual (basic edition):

The transition to renewables in India.pdf

Benefits Of Flutter Compared To Other Frameworks

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Next-generation AAM aircraft unveiled by Supernal, S-A2

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

regular expressions and the world wide web

1. REGULAR EXPRESSIONS, EXTRAORDINARY POWER UNSL 2013 Burdisso Sergio - sergio.burdisso@gmail.com

2.  I have 20 min to cover all about using REs on theW3

3.  HTTP  Internet bots  Web Crawler  Web Scraping

4. HyperText Transfer Protocol

6. WWW (The Web) Web Browser Request Response HTTP HTTP

7.  Application layer protocol  HTTP is the protocol to exchange or transfer hypertext Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html sequences of characters

8.  HTTP Response example Header Body

9. EXTRAORDINARY POWER

10.  FirstThings First… Regular ExpressionsAre Awesome!  Gather text  Replace /Transform text  Search /Validate text

11.  POSIX regular expressions (standard) ▪ ^. [ ] [^ ] (0) * {m,n} ? +|$  regex.h  pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})"  regcomp(regex_t *regex, pattern, cflags);  regex.re_nsub = 4 //Number of parenthesized subexpressions  regexec(regex, text, pmatch[])  pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255

12.  Making use of RE to parse HTTP responses headers

13. Great! Now we’re able to parse the http response headers… so what? -We can properly process the response body! Ah I see! … and what would I do that for? -Let me show you!

14. Just like spiders on the web!

15. Regular Expressions cartoon from xkcd Web Scraping (we will see!)

16.  Internet bots (web robots,WWW robots or bots) are software applications that run automated tasks over the Internet  A Web crawler is an Internet bot that systematically browses theWorld Wide Web, typically for the purpose ofWeb indexing  Web scraping is a computer software technique of extracting information from websites

17.  A Web Crawler Starts with a list of URLs to visit. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)""); hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");

18.  Web Scraping: A simple yet powerful approach to extract information from web pages can be based on regular expression matching facilities of programming languages (for instance C++, Perl or Python)

19. Regular Expressions cartoon from xkcd WebScraping wScraping (8, "http://emails.com/victim"); wScraping.findAll( "^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Box d{1,5}))s{1,2}(?i:(?<address2>(((APT|B LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{ 1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?< city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}), x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$" ); We’ve saved the day!

20. Everybody stand back! We know regular expressions The end Thank you for your patience!

regular expressions and the world wide web

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to regular expressions and the world wide web

Similar to regular expressions and the world wide web (20)

Recently uploaded

Recently uploaded (20)

regular expressions and the world wide web