SlideShare a Scribd company logo
1 of 20
REGULAR EXPRESSIONS, EXTRAORDINARY POWER
UNSL
2013
Burdisso Sergio - sergio.burdisso@gmail.com
 I have 20 min to cover all about using REs on
theW3
 HTTP
 Internet bots
 Web Crawler
 Web Scraping
HyperText Transfer Protocol
WWW (The Web)
Web Browser
Request
Response
HTTP
HTTP
 Application layer protocol
 HTTP is the protocol to exchange or transfer hypertext
Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html
sequences of characters
 HTTP Response example
Header
Body
EXTRAORDINARY POWER
 FirstThings First…
Regular ExpressionsAre
Awesome!
 Gather text
 Replace /Transform text
 Search /Validate text
 POSIX regular expressions (standard)
▪ ^. [ ] [^ ] (0) * {m,n} ? +|$
 regex.h
 pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})"
 regcomp(regex_t *regex, pattern, cflags);
 regex.re_nsub = 4 //Number of parenthesized subexpressions
 regexec(regex, text, pmatch[])
 pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255
 Making use of RE to parse HTTP responses headers
Great! Now we’re able to parse the http response headers… so what?
-We can properly process the response body!
Ah I see! … and what would I do that for?
-Let me show you!
Just like spiders on the web!
Regular Expressions cartoon from xkcd
Web Scraping
(we will see!)
 Internet bots (web robots,WWW robots or
bots) are software applications that run
automated tasks over the Internet
 A Web crawler is an Internet bot that
systematically browses theWorld Wide Web,
typically for the purpose ofWeb indexing
 Web scraping is a computer software technique
of extracting information from websites
 A Web Crawler Starts with a list of URLs to visit. As the
crawler visits these URLs, it identifies all
the hyperlinks in the page and adds them to the list of
URLs to visit
hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)"");
hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
 Web Scraping: A simple yet powerful approach to
extract information from web pages can be based on
regular expression matching facilities of programming
languages (for instance C++, Perl or Python)
Regular Expressions cartoon from xkcd
WebScraping wScraping (8, "http://emails.com/victim");
wScraping.findAll(
"^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Box
d{1,5}))s{1,2}(?i:(?<address2>(((APT|B
LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{
1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?<
city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}),
x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL
N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]
|T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$"
);
We’ve saved the day!
Everybody stand back!
We know regular expressions
The end
Thank you for your patience!

More Related Content

What's hot

CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaVincent Terrasi
 
Combinators - Lightning Talk
Combinators - Lightning TalkCombinators - Lightning Talk
Combinators - Lightning TalkMike Harris
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方Haruhiko Kobayashi
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudreyAudrey Lim
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
 
01 ElasticSearch : Getting Started
01 ElasticSearch : Getting Started01 ElasticSearch : Getting Started
01 ElasticSearch : Getting StartedOpenThink Labs
 
Reactをproductionに導入して変わったこと
Reactをproductionに導入して変わったことReactをproductionに導入して変わったこと
Reactをproductionに導入して変わったことHaruhiko Kobayashi
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyErin Shellman
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appBruce McPherson
 
Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答Hisashi Komine
 

What's hot (20)

CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Combinators - Lightning Talk
Combinators - Lightning TalkCombinators - Lightning Talk
Combinators - Lightning Talk
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
hySON - D2Fest
hySON - D2FesthySON - D2Fest
hySON - D2Fest
 
hySON
hySONhySON
hySON
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
01 ElasticSearch : Getting Started
01 ElasticSearch : Getting Started01 ElasticSearch : Getting Started
01 ElasticSearch : Getting Started
 
Reactをproductionに導入して変わったこと
Reactをproductionに導入して変わったことReactをproductionに導入して変わったこと
Reactをproductionに導入して変わったこと
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 

Similar to regular expressions and the world wide web

Software Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and ProspectsSoftware Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and ProspectsAli Mesbah
 
Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails SiddheshSiddhesh Bhobe
 
Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)Susam Pal
 
Wss Object Model
Wss Object ModelWss Object Model
Wss Object Modelmaddinapudi
 
Writing Secure Code for WordPress
Writing Secure Code for WordPressWriting Secure Code for WordPress
Writing Secure Code for WordPressShawn Hooper
 
C#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 FinalC#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 FinalRich Helton
 
Securing Java EE Web Apps
Securing Java EE Web AppsSecuring Java EE Web Apps
Securing Java EE Web AppsFrank Kim
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regexSteve Mylroie
 
Html intake 38 lect1
Html intake 38 lect1Html intake 38 lect1
Html intake 38 lect1ghkadous
 
Taking AJAX to the Next Level
Taking AJAX to the Next LevelTaking AJAX to the Next Level
Taking AJAX to the Next LevelStephen.Walther
 
Microsoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next LevelMicrosoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next Levelgoodfriday
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web ArchitectureChamnap Chhorn
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharpergranicz
 

Similar to regular expressions and the world wide web (20)

Senior Project Documentation.
Senior Project Documentation.Senior Project Documentation.
Senior Project Documentation.
 
Software Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and ProspectsSoftware Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and Prospects
 
Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails Siddhesh
 
Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)
 
Wss Object Model
Wss Object ModelWss Object Model
Wss Object Model
 
Writing Secure Code for WordPress
Writing Secure Code for WordPressWriting Secure Code for WordPress
Writing Secure Code for WordPress
 
Web services intro.
Web services intro.Web services intro.
Web services intro.
 
Os Pruett
Os PruettOs Pruett
Os Pruett
 
C#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 FinalC#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 Final
 
Api
ApiApi
Api
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Mashup
MashupMashup
Mashup
 
Securing Java EE Web Apps
Securing Java EE Web AppsSecuring Java EE Web Apps
Securing Java EE Web Apps
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regex
 
Html intake 38 lect1
Html intake 38 lect1Html intake 38 lect1
Html intake 38 lect1
 
Taking AJAX to the Next Level
Taking AJAX to the Next LevelTaking AJAX to the Next Level
Taking AJAX to the Next Level
 
Microsoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next LevelMicrosoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next Level
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
 
Switch to Backend 2023
Switch to Backend 2023Switch to Backend 2023
Switch to Backend 2023
 

Recently uploaded

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

regular expressions and the world wide web

  • 1. REGULAR EXPRESSIONS, EXTRAORDINARY POWER UNSL 2013 Burdisso Sergio - sergio.burdisso@gmail.com
  • 2.  I have 20 min to cover all about using REs on theW3
  • 3.  HTTP  Internet bots  Web Crawler  Web Scraping
  • 5.
  • 6. WWW (The Web) Web Browser Request Response HTTP HTTP
  • 7.  Application layer protocol  HTTP is the protocol to exchange or transfer hypertext Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html sequences of characters
  • 8.  HTTP Response example Header Body
  • 10.  FirstThings First… Regular ExpressionsAre Awesome!  Gather text  Replace /Transform text  Search /Validate text
  • 11.  POSIX regular expressions (standard) ▪ ^. [ ] [^ ] (0) * {m,n} ? +|$  regex.h  pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})"  regcomp(regex_t *regex, pattern, cflags);  regex.re_nsub = 4 //Number of parenthesized subexpressions  regexec(regex, text, pmatch[])  pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255
  • 12.  Making use of RE to parse HTTP responses headers
  • 13. Great! Now we’re able to parse the http response headers… so what? -We can properly process the response body! Ah I see! … and what would I do that for? -Let me show you!
  • 14. Just like spiders on the web!
  • 15. Regular Expressions cartoon from xkcd Web Scraping (we will see!)
  • 16.  Internet bots (web robots,WWW robots or bots) are software applications that run automated tasks over the Internet  A Web crawler is an Internet bot that systematically browses theWorld Wide Web, typically for the purpose ofWeb indexing  Web scraping is a computer software technique of extracting information from websites
  • 17.  A Web Crawler Starts with a list of URLs to visit. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)""); hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
  • 18.  Web Scraping: A simple yet powerful approach to extract information from web pages can be based on regular expression matching facilities of programming languages (for instance C++, Perl or Python)
  • 19. Regular Expressions cartoon from xkcd WebScraping wScraping (8, "http://emails.com/victim"); wScraping.findAll( "^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Box d{1,5}))s{1,2}(?i:(?<address2>(((APT|B LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{ 1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?< city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}), x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$" ); We’ve saved the day!
  • 20. Everybody stand back! We know regular expressions The end Thank you for your patience!