SlideShare a Scribd company logo
1 of 45
All you need to know about
crawlers
Noemi Ferrera
@TheTextLynx
All you need to know
about crawlers
All you need to know
about crawlers
All you need to know
about crawlers
About me
● Currently working @Amazon
Disclaimer: I am not representing Amazon, not talking about anything
to do with my current, previous or future experience within Amazon.
Development and testing professionally since 2009
IBM, Microsoft, Dell, Netease…
● Over 20 presentations worldwide
● Author of the book “How to Test a Time Machine”
● Contact:
https://thetestlynx.wordpress.com
@thetestlynx in twitter
Noemi Ferrera
Agenda
● What’s a crawler?
● Why and when do we need a crawler?
● Types of crawlers
● Components of a crawler
○ View/node
○ Arcs/links
○ Visited storage
○ Heat map
● Example
● What can go wrong
● What you need to succeed
All you need to know about crawlers…
What’s a crawler?
A crawler is an automatic system that iterates throughout the parts of an
application, with the objective of finding issues or explore it.
..Can be a web application, but also other types of applications.
Definition
Why and when …
● Discovery testing
● Finding particular common issues (ex. 404)
● Quick coverage
● Generally runs in production - or pre-prod (late)
…do we need a crawler?
Types of Crawlers
● UI vs API
● View First vs Arc First
● Exhaustive vs Shortcutted
● Random vs Smart
Types of crawlers
UI VS API
UI API
Uses UI to navigate through the application Uses API to navigate through the application
Closer to user’s behaviour Faster to run
Checks elements, not only links Focus mostly on links and API points
Types of crawlers
View first VS Arc first
View First Arc First
Focuses first on the view, then navigates Focuses first on the navigation, then check
the view
Better when the application has many checks but
does not have too much navigation
Better when views have few things to check
but long list of navigation points
Types of crawlers
Exhaustive vs Shortcutted
Exhaustive Shortcutted
Aims to visit the entire application Stops after a number of visits
Better for smaller applications or have a lot of time
to cover it all
Better if the application is too big, and not
enough time to cover it all
Might make too many calls or take too long finding
issues
Might end up after visiting the important parts
of the application
Types of crawlers
Random vs Smart
Random Smart
Could be partially random Uses some logic to give priority to parts of
the application
Likely needs to be shortcutted Might end up after visiting the important parts
of the application
Components of a
crawler
● View/node
● Arcs/links
● Visited storage
● Heat map
Key concepts
Components of a
crawler
● View/node
● Arcs/links
● Visited storage
● Heat map
Key concepts
Components of a crawler
● How to tell when you are in a different one?
○ Website: URL
○ External links - Avoid navigation
○ Games or harder apps
View/Node
Components of a
crawler
● View/node
● Arcs/links
● Visited storage
● Heat map
Key concepts
Components of a crawler
● How to navigate?
○ Clicks
○ API calls
○ Swiping and other actions
○ VR apps - other interactions
● Clickable objects?
○ Websites - href
○ All dom objects
■ Containers?
○ Moving/Changing objects
Arcs/links
Components of a crawler
● How to navigate?
○ Clicks
○ API calls
○ Swiping and other actions
○ VR apps - other interactions
● Clickable objects?
○ Websites - href
○ All dom objects
■ Containers?
○ Moving/Changing objects
Arcs/links
Components of a crawler
● How to navigate?
○ Clicks
○ API calls
○ Swiping and other actions
○ VR apps - other interactions
● Clickable objects?
○ Websites - href
○ All dom objects
■ Containers?
○ Moving/Changing objects
Arcs/links
Components of a crawler
● How to navigate?
○ Clicks
○ API calls
○ Swiping and other actions
○ VR apps - other interactions
● Clickable objects?
○ Websites - href
○ All dom objects
■ Containers?
○ Moving/Changing objects
○ Dynanism
○ Hidden elements?
Arcs/links
Components of a
crawler
● View/node
● Arcs/links
● Visited storage
● Heat map
Key concepts
index.html
Components of a
crawler
● View/node
● Arcs/links
● Visited storage
● Heat map
Key concepts
index.html Second view
Components of a
crawler
● View/node
● Arcs/links
● Visited storage
● Heat map
Key concepts
Components of a crawler
● By usage
● By issues found
● By novelty
● Others
Heat map
Crawling with Selenium
Class WebCrawlerSelenium:
def __init__(self):
driver = webdriver.Chrome(...)
self.top_level = 10
url = https://www.selenium.dev
driver.get(url)
view = view_class.ViewClass(url)
self.explore(view, [], 0 driver)
driver.close()
Example
Start the crawler
Explore the first view
Crawling with Selenium
def explore(self, view, visited, current_level, driver):
current_level = current_level + 1
if current_level >= self.top_level:
sys.exit(“Max visit reached”)
visit.append(view)
check_status(node.url)
If view.count == -1:
view.count = 0
get_all_href(view) # adds view.count
while view.count > 0:
get_next_view(view)
Example cont…
Explore each level
Initialize the linked views
Crawling with Selenium
def check_status(self, node):
status_code =
requests.get(url).status_code
if status_code < 200 or status_code >=
400:
sys.exit(“Error on url” + url)
Example cont 2 …
Check status with API
Crawling with Selenium
def get_all_href(self, view):
for a_tag in driver.find_element(By.TAG_NAME, ‘a’):
view.count = view.count + 1
href = a_tag.get_attribute(‘href’)
view.actions[href] =
a_tag.get_dom_attribute(‘href’)
Example cont 3 …
Get all references for the node
Finding by a tag
Add to actions
Crawling with Selenium
def get_next_view(self, view, visited):
sub_url = view.actions.last()
count = len(view.actions)
while sub_url in visited and count > 0:
sub_url = view.actions[count]
count = count - 1
if count == 0:
return
subview = view_class.ViewClass(sub_url)
self.try_click(sub_url, driver) # ui navigation, API -
requests.get
Example cont 4 …
Initialize the view
Get all the urls
Click next action
Explore next
Crawling with Selenium
def try_click(self, href, driver):
xpath= ('//a[@href="'+href+'"]')
try:
element =
driver.find_element(By.XPATH, xpath)
element.click()
except Exception:
print(“Could not find the xpath”)
Example cont 5 …
Tries to click the
element
We could add here
other actions
What could go wrong
● How to identify views? (Already covered)
○ External links?
○ Keep track of visited
○ Top level
● How to identify navigation points/arcs? (already covered)
○ Partial vs full hrefs
What could go wrong
● How to identify views?
○ External links?
○ Keep track of visited
○ Top level
● How to identify navigation points/arcs?
○ Partial vs full hrefs
● Forms, ex. login
What could go wrong
● How to identify views? (Already covered)
○ External links?
○ Keep track of visited
○ Top level
● How to identify navigation points/arcs? (already covered)
○ Partial vs full hrefs
● Forms, ex. login
● Pop-ups
What could go wrong
● How to identify views? (Already covered)
○ External links?
○ Keep track of visited
○ Top level
● How to identify navigation points/arcs? (already covered)
○ Partial vs full hrefs
● Forms, ex. login
● Pop-ups
● Cookies
● Dynamic objects
● Stale links
What you need to succeed
● Know: graph, trees, types of traversals
○ Tracking visited nodes
● App knowledge
○ Experience or tool (head map generator…)
● What type of issues are you looking for?
○ API? UI? When do they happen?
● Make sure you cannot cover these with other testing!!!
Summary
● What’s a crawler?
● Why and when do we need a crawler?
○ Discovery
○ Common issues
○ Quick coverage
● Types of crawlers
● Components of a crawler
● Example
● What can go wrong
● What you need to succeed
Summary
● What’s a crawler?
● Why and when do we need a crawler?
● Types of crawlers
○ UI/API/MIXED
○ VIEW FIRST / DEPTH FIRST
○ EXHAUSTIVE / SHORTCUTTED
○ RANDOM / SMART
● Components of a crawler
● Example
● What can go wrong
● What you need to succeed
Summary
● What’s a crawler?
● Why and when do we need a crawler?
● Types of crawlers
● Components of a crawler
○ View/node
○ Arcs/links
○ Visited storage
○ Heat map
● Example
● What can go wrong
● What you need to succeed
Summary
● What’s a crawler?
● Why and when do we need a crawler?
● Types of crawlers
● Components of a crawler
● Example
● What can go wrong
● What you need to succeed
Thank you!
https://thetestlynx.wordpress.com
@thetestlynx twitter
Noemi Ferrera
All you need to know about crawlers
All you need to know about crawlers

More Related Content

What's hot

Google Cloud IAM 계정, 권한 및 조직 관리
Google Cloud IAM 계정, 권한 및 조직 관리Google Cloud IAM 계정, 권한 및 조직 관리
Google Cloud IAM 계정, 권한 및 조직 관리정명훈 Jerry Jeong
 
Introduction to JavaScript.pptx
Introduction to JavaScript.pptxIntroduction to JavaScript.pptx
Introduction to JavaScript.pptxAxmedMaxamuud4
 
[2017 AWS Startup Day] AWS 비용 최대 90% 절감하기: 스팟 인스턴스 Deep-Dive
[2017 AWS Startup Day] AWS 비용 최대 90% 절감하기: 스팟 인스턴스 Deep-Dive [2017 AWS Startup Day] AWS 비용 최대 90% 절감하기: 스팟 인스턴스 Deep-Dive
[2017 AWS Startup Day] AWS 비용 최대 90% 절감하기: 스팟 인스턴스 Deep-Dive Amazon Web Services Korea
 
간단한 게임을 쉽고 저렴하게 서비스해보자! ::: AWS Game Master 온라인 시리즈 #1
간단한 게임을 쉽고 저렴하게 서비스해보자! ::: AWS Game Master 온라인 시리즈 #1간단한 게임을 쉽고 저렴하게 서비스해보자! ::: AWS Game Master 온라인 시리즈 #1
간단한 게임을 쉽고 저렴하게 서비스해보자! ::: AWS Game Master 온라인 시리즈 #1Amazon Web Services Korea
 
[NDC 2014] 던전앤파이터 클라이언트 로딩 최적화
[NDC 2014] 던전앤파이터 클라이언트 로딩 최적화[NDC 2014] 던전앤파이터 클라이언트 로딩 최적화
[NDC 2014] 던전앤파이터 클라이언트 로딩 최적화Jaeseung Ha
 
WebdriverIO: the Swiss Army Knife of testing
WebdriverIO: the Swiss Army Knife of testingWebdriverIO: the Swiss Army Knife of testing
WebdriverIO: the Swiss Army Knife of testingDaniel Chivescu
 
PC 와 모바일에서의 P2P 게임 구현에서의 차이점 비교
PC 와 모바일에서의 P2P 게임 구현에서의 차이점 비교PC 와 모바일에서의 P2P 게임 구현에서의 차이점 비교
PC 와 모바일에서의 P2P 게임 구현에서의 차이점 비교iFunFactory Inc.
 
엔터프라이즈를 위한 AWS 지원 및 사례 (서수영) - AWS 웨비나 시리즈
엔터프라이즈를 위한 AWS 지원 및 사례 (서수영) - AWS 웨비나 시리즈엔터프라이즈를 위한 AWS 지원 및 사례 (서수영) - AWS 웨비나 시리즈
엔터프라이즈를 위한 AWS 지원 및 사례 (서수영) - AWS 웨비나 시리즈Amazon Web Services Korea
 
테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템QooJuice
 
Web API 2 Token Based Authentication
Web API 2 Token Based AuthenticationWeb API 2 Token Based Authentication
Web API 2 Token Based Authenticationjeremysbrown
 
NDC 11 자이언트 서버의 비밀
NDC 11 자이언트 서버의 비밀NDC 11 자이언트 서버의 비밀
NDC 11 자이언트 서버의 비밀승명 양
 
Introduction to Selenium | Selenium Tutorial for Beginners | Selenium Trainin...
Introduction to Selenium | Selenium Tutorial for Beginners | Selenium Trainin...Introduction to Selenium | Selenium Tutorial for Beginners | Selenium Trainin...
Introduction to Selenium | Selenium Tutorial for Beginners | Selenium Trainin...Edureka!
 
Including Everyone: Web Accessibility 101
Including Everyone: Web Accessibility 101Including Everyone: Web Accessibility 101
Including Everyone: Web Accessibility 101Helena Zubkow
 
안드로이드 모바일 애플리케이션 접근성 점검 매뉴얼 최종
안드로이드 모바일 애플리케이션 접근성 점검 매뉴얼 최종안드로이드 모바일 애플리케이션 접근성 점검 매뉴얼 최종
안드로이드 모바일 애플리케이션 접근성 점검 매뉴얼 최종Seunghyeon Kim
 
AWS CloudFront 가속 및 DDoS 방어
AWS CloudFront 가속 및 DDoS 방어AWS CloudFront 가속 및 DDoS 방어
AWS CloudFront 가속 및 DDoS 방어Kyle(KY) Yang
 
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingCloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
Akka.NET 으로 만드는 온라인 게임 서버 (NDC2016)
Akka.NET 으로 만드는 온라인 게임 서버 (NDC2016)Akka.NET 으로 만드는 온라인 게임 서버 (NDC2016)
Akka.NET 으로 만드는 온라인 게임 서버 (NDC2016)Esun Kim
 

What's hot (20)

Google Cloud IAM 계정, 권한 및 조직 관리
Google Cloud IAM 계정, 권한 및 조직 관리Google Cloud IAM 계정, 권한 및 조직 관리
Google Cloud IAM 계정, 권한 및 조직 관리
 
Introduction to JavaScript.pptx
Introduction to JavaScript.pptxIntroduction to JavaScript.pptx
Introduction to JavaScript.pptx
 
[2017 AWS Startup Day] AWS 비용 최대 90% 절감하기: 스팟 인스턴스 Deep-Dive
[2017 AWS Startup Day] AWS 비용 최대 90% 절감하기: 스팟 인스턴스 Deep-Dive [2017 AWS Startup Day] AWS 비용 최대 90% 절감하기: 스팟 인스턴스 Deep-Dive
[2017 AWS Startup Day] AWS 비용 최대 90% 절감하기: 스팟 인스턴스 Deep-Dive
 
Liferay and Cloud
Liferay and CloudLiferay and Cloud
Liferay and Cloud
 
간단한 게임을 쉽고 저렴하게 서비스해보자! ::: AWS Game Master 온라인 시리즈 #1
간단한 게임을 쉽고 저렴하게 서비스해보자! ::: AWS Game Master 온라인 시리즈 #1간단한 게임을 쉽고 저렴하게 서비스해보자! ::: AWS Game Master 온라인 시리즈 #1
간단한 게임을 쉽고 저렴하게 서비스해보자! ::: AWS Game Master 온라인 시리즈 #1
 
[NDC 2014] 던전앤파이터 클라이언트 로딩 최적화
[NDC 2014] 던전앤파이터 클라이언트 로딩 최적화[NDC 2014] 던전앤파이터 클라이언트 로딩 최적화
[NDC 2014] 던전앤파이터 클라이언트 로딩 최적화
 
WebdriverIO: the Swiss Army Knife of testing
WebdriverIO: the Swiss Army Knife of testingWebdriverIO: the Swiss Army Knife of testing
WebdriverIO: the Swiss Army Knife of testing
 
PC 와 모바일에서의 P2P 게임 구현에서의 차이점 비교
PC 와 모바일에서의 P2P 게임 구현에서의 차이점 비교PC 와 모바일에서의 P2P 게임 구현에서의 차이점 비교
PC 와 모바일에서의 P2P 게임 구현에서의 차이점 비교
 
SOAP vs REST
SOAP vs RESTSOAP vs REST
SOAP vs REST
 
엔터프라이즈를 위한 AWS 지원 및 사례 (서수영) - AWS 웨비나 시리즈
엔터프라이즈를 위한 AWS 지원 및 사례 (서수영) - AWS 웨비나 시리즈엔터프라이즈를 위한 AWS 지원 및 사례 (서수영) - AWS 웨비나 시리즈
엔터프라이즈를 위한 AWS 지원 및 사례 (서수영) - AWS 웨비나 시리즈
 
HBase RITs
HBase RITsHBase RITs
HBase RITs
 
테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템
 
Web API 2 Token Based Authentication
Web API 2 Token Based AuthenticationWeb API 2 Token Based Authentication
Web API 2 Token Based Authentication
 
NDC 11 자이언트 서버의 비밀
NDC 11 자이언트 서버의 비밀NDC 11 자이언트 서버의 비밀
NDC 11 자이언트 서버의 비밀
 
Introduction to Selenium | Selenium Tutorial for Beginners | Selenium Trainin...
Introduction to Selenium | Selenium Tutorial for Beginners | Selenium Trainin...Introduction to Selenium | Selenium Tutorial for Beginners | Selenium Trainin...
Introduction to Selenium | Selenium Tutorial for Beginners | Selenium Trainin...
 
Including Everyone: Web Accessibility 101
Including Everyone: Web Accessibility 101Including Everyone: Web Accessibility 101
Including Everyone: Web Accessibility 101
 
안드로이드 모바일 애플리케이션 접근성 점검 매뉴얼 최종
안드로이드 모바일 애플리케이션 접근성 점검 매뉴얼 최종안드로이드 모바일 애플리케이션 접근성 점검 매뉴얼 최종
안드로이드 모바일 애플리케이션 접근성 점검 매뉴얼 최종
 
AWS CloudFront 가속 및 DDoS 방어
AWS CloudFront 가속 및 DDoS 방어AWS CloudFront 가속 및 DDoS 방어
AWS CloudFront 가속 및 DDoS 방어
 
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingCloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Akka.NET 으로 만드는 온라인 게임 서버 (NDC2016)
Akka.NET 으로 만드는 온라인 게임 서버 (NDC2016)Akka.NET 으로 만드는 온라인 게임 서버 (NDC2016)
Akka.NET 으로 만드는 온라인 게임 서버 (NDC2016)
 

Similar to All you need to know about crawlers

Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPPaul Redmond
 
Automation Abstractions: Page Objects and Beyond
Automation Abstractions: Page Objects and BeyondAutomation Abstractions: Page Objects and Beyond
Automation Abstractions: Page Objects and BeyondTechWell
 
4-Step SEO Waltz: Tackle SEO Challenges Head-On
4-Step SEO Waltz: Tackle SEO Challenges Head-On4-Step SEO Waltz: Tackle SEO Challenges Head-On
4-Step SEO Waltz: Tackle SEO Challenges Head-OnSearch Engine Journal
 
Architecting single-page front-end apps
Architecting single-page front-end appsArchitecting single-page front-end apps
Architecting single-page front-end appsZohar Arad
 
From Website To Webapp Shane Morris
From Website To Webapp   Shane MorrisFrom Website To Webapp   Shane Morris
From Website To Webapp Shane MorrisShane Morris
 
Hands on Exploration of Page Objects and Abstraction Layers with Selenium Web...
Hands on Exploration of Page Objects and Abstraction Layers with Selenium Web...Hands on Exploration of Page Objects and Abstraction Layers with Selenium Web...
Hands on Exploration of Page Objects and Abstraction Layers with Selenium Web...Alan Richardson
 
BDD with SpecFlow and Selenium
BDD with SpecFlow and SeleniumBDD with SpecFlow and Selenium
BDD with SpecFlow and SeleniumLiraz Shay
 
From Back to Front: Rails To React Family
From Back to Front: Rails To React FamilyFrom Back to Front: Rails To React Family
From Back to Front: Rails To React FamilyKhor SoonHin
 
Destination Documentation: How Not to Get Lost in Your Org
Destination Documentation: How Not to Get Lost in Your OrgDestination Documentation: How Not to Get Lost in Your Org
Destination Documentation: How Not to Get Lost in Your Orgcsupilowski
 
Peeling the Onion: Making Sense of the Layers of API Security
Peeling the Onion: Making Sense of the Layers of API SecurityPeeling the Onion: Making Sense of the Layers of API Security
Peeling the Onion: Making Sense of the Layers of API SecurityMatt Tesauro
 
End to-End SPA Development Using ASP.NET and AngularJS
End to-End SPA Development Using ASP.NET and AngularJSEnd to-End SPA Development Using ASP.NET and AngularJS
End to-End SPA Development Using ASP.NET and AngularJSGil Fink
 
Angular js recommended practices - mini
Angular js   recommended practices - miniAngular js   recommended practices - mini
Angular js recommended practices - miniRasheed Waraich
 
Measure everything you can
Measure everything you canMeasure everything you can
Measure everything you canRicardo Bánffy
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Angular basicschat
Angular basicschatAngular basicschat
Angular basicschatYu Jin
 
Dreamforce 2017 - Up close and personal with Lightning Experience as Platform
Dreamforce 2017 - Up close and personal with Lightning Experience as PlatformDreamforce 2017 - Up close and personal with Lightning Experience as Platform
Dreamforce 2017 - Up close and personal with Lightning Experience as Platformandyinthecloud
 
Philip Shurpik "Architecting React Native app"
Philip Shurpik "Architecting React Native app"Philip Shurpik "Architecting React Native app"
Philip Shurpik "Architecting React Native app"Fwdays
 

Similar to All you need to know about crawlers (20)

Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHP
 
Automation Abstractions: Page Objects and Beyond
Automation Abstractions: Page Objects and BeyondAutomation Abstractions: Page Objects and Beyond
Automation Abstractions: Page Objects and Beyond
 
4-Step SEO Waltz: Tackle SEO Challenges Head-On
4-Step SEO Waltz: Tackle SEO Challenges Head-On4-Step SEO Waltz: Tackle SEO Challenges Head-On
4-Step SEO Waltz: Tackle SEO Challenges Head-On
 
Architecting single-page front-end apps
Architecting single-page front-end appsArchitecting single-page front-end apps
Architecting single-page front-end apps
 
From Website To Webapp Shane Morris
From Website To Webapp   Shane MorrisFrom Website To Webapp   Shane Morris
From Website To Webapp Shane Morris
 
Hands on Exploration of Page Objects and Abstraction Layers with Selenium Web...
Hands on Exploration of Page Objects and Abstraction Layers with Selenium Web...Hands on Exploration of Page Objects and Abstraction Layers with Selenium Web...
Hands on Exploration of Page Objects and Abstraction Layers with Selenium Web...
 
Introduction to AngularJs
Introduction to AngularJsIntroduction to AngularJs
Introduction to AngularJs
 
SearchEngines.pdf
SearchEngines.pdfSearchEngines.pdf
SearchEngines.pdf
 
BDD with SpecFlow and Selenium
BDD with SpecFlow and SeleniumBDD with SpecFlow and Selenium
BDD with SpecFlow and Selenium
 
From Back to Front: Rails To React Family
From Back to Front: Rails To React FamilyFrom Back to Front: Rails To React Family
From Back to Front: Rails To React Family
 
Destination Documentation: How Not to Get Lost in Your Org
Destination Documentation: How Not to Get Lost in Your OrgDestination Documentation: How Not to Get Lost in Your Org
Destination Documentation: How Not to Get Lost in Your Org
 
Peeling the Onion: Making Sense of the Layers of API Security
Peeling the Onion: Making Sense of the Layers of API SecurityPeeling the Onion: Making Sense of the Layers of API Security
Peeling the Onion: Making Sense of the Layers of API Security
 
Node.js Course 2 of 2 - Advanced techniques
Node.js Course 2 of 2 - Advanced techniquesNode.js Course 2 of 2 - Advanced techniques
Node.js Course 2 of 2 - Advanced techniques
 
End to-End SPA Development Using ASP.NET and AngularJS
End to-End SPA Development Using ASP.NET and AngularJSEnd to-End SPA Development Using ASP.NET and AngularJS
End to-End SPA Development Using ASP.NET and AngularJS
 
Angular js recommended practices - mini
Angular js   recommended practices - miniAngular js   recommended practices - mini
Angular js recommended practices - mini
 
Measure everything you can
Measure everything you canMeasure everything you can
Measure everything you can
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
 
Angular basicschat
Angular basicschatAngular basicschat
Angular basicschat
 
Dreamforce 2017 - Up close and personal with Lightning Experience as Platform
Dreamforce 2017 - Up close and personal with Lightning Experience as PlatformDreamforce 2017 - Up close and personal with Lightning Experience as Platform
Dreamforce 2017 - Up close and personal with Lightning Experience as Platform
 
Philip Shurpik "Architecting React Native app"
Philip Shurpik "Architecting React Native app"Philip Shurpik "Architecting React Native app"
Philip Shurpik "Architecting React Native app"
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

All you need to know about crawlers

  • 1.
  • 2.
  • 3. All you need to know about crawlers Noemi Ferrera @TheTextLynx
  • 4. All you need to know about crawlers
  • 5. All you need to know about crawlers
  • 6. All you need to know about crawlers
  • 7. About me ● Currently working @Amazon Disclaimer: I am not representing Amazon, not talking about anything to do with my current, previous or future experience within Amazon. Development and testing professionally since 2009 IBM, Microsoft, Dell, Netease… ● Over 20 presentations worldwide ● Author of the book “How to Test a Time Machine” ● Contact: https://thetestlynx.wordpress.com @thetestlynx in twitter Noemi Ferrera
  • 8. Agenda ● What’s a crawler? ● Why and when do we need a crawler? ● Types of crawlers ● Components of a crawler ○ View/node ○ Arcs/links ○ Visited storage ○ Heat map ● Example ● What can go wrong ● What you need to succeed All you need to know about crawlers…
  • 9. What’s a crawler? A crawler is an automatic system that iterates throughout the parts of an application, with the objective of finding issues or explore it. ..Can be a web application, but also other types of applications. Definition
  • 10. Why and when … ● Discovery testing ● Finding particular common issues (ex. 404) ● Quick coverage ● Generally runs in production - or pre-prod (late) …do we need a crawler?
  • 11. Types of Crawlers ● UI vs API ● View First vs Arc First ● Exhaustive vs Shortcutted ● Random vs Smart
  • 12. Types of crawlers UI VS API UI API Uses UI to navigate through the application Uses API to navigate through the application Closer to user’s behaviour Faster to run Checks elements, not only links Focus mostly on links and API points
  • 13. Types of crawlers View first VS Arc first View First Arc First Focuses first on the view, then navigates Focuses first on the navigation, then check the view Better when the application has many checks but does not have too much navigation Better when views have few things to check but long list of navigation points
  • 14. Types of crawlers Exhaustive vs Shortcutted Exhaustive Shortcutted Aims to visit the entire application Stops after a number of visits Better for smaller applications or have a lot of time to cover it all Better if the application is too big, and not enough time to cover it all Might make too many calls or take too long finding issues Might end up after visiting the important parts of the application
  • 15. Types of crawlers Random vs Smart Random Smart Could be partially random Uses some logic to give priority to parts of the application Likely needs to be shortcutted Might end up after visiting the important parts of the application
  • 16. Components of a crawler ● View/node ● Arcs/links ● Visited storage ● Heat map Key concepts
  • 17. Components of a crawler ● View/node ● Arcs/links ● Visited storage ● Heat map Key concepts
  • 18. Components of a crawler ● How to tell when you are in a different one? ○ Website: URL ○ External links - Avoid navigation ○ Games or harder apps View/Node
  • 19. Components of a crawler ● View/node ● Arcs/links ● Visited storage ● Heat map Key concepts
  • 20. Components of a crawler ● How to navigate? ○ Clicks ○ API calls ○ Swiping and other actions ○ VR apps - other interactions ● Clickable objects? ○ Websites - href ○ All dom objects ■ Containers? ○ Moving/Changing objects Arcs/links
  • 21. Components of a crawler ● How to navigate? ○ Clicks ○ API calls ○ Swiping and other actions ○ VR apps - other interactions ● Clickable objects? ○ Websites - href ○ All dom objects ■ Containers? ○ Moving/Changing objects Arcs/links
  • 22. Components of a crawler ● How to navigate? ○ Clicks ○ API calls ○ Swiping and other actions ○ VR apps - other interactions ● Clickable objects? ○ Websites - href ○ All dom objects ■ Containers? ○ Moving/Changing objects Arcs/links
  • 23. Components of a crawler ● How to navigate? ○ Clicks ○ API calls ○ Swiping and other actions ○ VR apps - other interactions ● Clickable objects? ○ Websites - href ○ All dom objects ■ Containers? ○ Moving/Changing objects ○ Dynanism ○ Hidden elements? Arcs/links
  • 24. Components of a crawler ● View/node ● Arcs/links ● Visited storage ● Heat map Key concepts index.html
  • 25. Components of a crawler ● View/node ● Arcs/links ● Visited storage ● Heat map Key concepts index.html Second view
  • 26. Components of a crawler ● View/node ● Arcs/links ● Visited storage ● Heat map Key concepts
  • 27. Components of a crawler ● By usage ● By issues found ● By novelty ● Others Heat map
  • 28. Crawling with Selenium Class WebCrawlerSelenium: def __init__(self): driver = webdriver.Chrome(...) self.top_level = 10 url = https://www.selenium.dev driver.get(url) view = view_class.ViewClass(url) self.explore(view, [], 0 driver) driver.close() Example Start the crawler Explore the first view
  • 29. Crawling with Selenium def explore(self, view, visited, current_level, driver): current_level = current_level + 1 if current_level >= self.top_level: sys.exit(“Max visit reached”) visit.append(view) check_status(node.url) If view.count == -1: view.count = 0 get_all_href(view) # adds view.count while view.count > 0: get_next_view(view) Example cont… Explore each level Initialize the linked views
  • 30. Crawling with Selenium def check_status(self, node): status_code = requests.get(url).status_code if status_code < 200 or status_code >= 400: sys.exit(“Error on url” + url) Example cont 2 … Check status with API
  • 31. Crawling with Selenium def get_all_href(self, view): for a_tag in driver.find_element(By.TAG_NAME, ‘a’): view.count = view.count + 1 href = a_tag.get_attribute(‘href’) view.actions[href] = a_tag.get_dom_attribute(‘href’) Example cont 3 … Get all references for the node Finding by a tag Add to actions
  • 32. Crawling with Selenium def get_next_view(self, view, visited): sub_url = view.actions.last() count = len(view.actions) while sub_url in visited and count > 0: sub_url = view.actions[count] count = count - 1 if count == 0: return subview = view_class.ViewClass(sub_url) self.try_click(sub_url, driver) # ui navigation, API - requests.get Example cont 4 … Initialize the view Get all the urls Click next action Explore next
  • 33. Crawling with Selenium def try_click(self, href, driver): xpath= ('//a[@href="'+href+'"]') try: element = driver.find_element(By.XPATH, xpath) element.click() except Exception: print(“Could not find the xpath”) Example cont 5 … Tries to click the element We could add here other actions
  • 34. What could go wrong ● How to identify views? (Already covered) ○ External links? ○ Keep track of visited ○ Top level ● How to identify navigation points/arcs? (already covered) ○ Partial vs full hrefs
  • 35. What could go wrong ● How to identify views? ○ External links? ○ Keep track of visited ○ Top level ● How to identify navigation points/arcs? ○ Partial vs full hrefs ● Forms, ex. login
  • 36. What could go wrong ● How to identify views? (Already covered) ○ External links? ○ Keep track of visited ○ Top level ● How to identify navigation points/arcs? (already covered) ○ Partial vs full hrefs ● Forms, ex. login ● Pop-ups
  • 37. What could go wrong ● How to identify views? (Already covered) ○ External links? ○ Keep track of visited ○ Top level ● How to identify navigation points/arcs? (already covered) ○ Partial vs full hrefs ● Forms, ex. login ● Pop-ups ● Cookies ● Dynamic objects ● Stale links
  • 38. What you need to succeed ● Know: graph, trees, types of traversals ○ Tracking visited nodes ● App knowledge ○ Experience or tool (head map generator…) ● What type of issues are you looking for? ○ API? UI? When do they happen? ● Make sure you cannot cover these with other testing!!!
  • 39. Summary ● What’s a crawler? ● Why and when do we need a crawler? ○ Discovery ○ Common issues ○ Quick coverage ● Types of crawlers ● Components of a crawler ● Example ● What can go wrong ● What you need to succeed
  • 40. Summary ● What’s a crawler? ● Why and when do we need a crawler? ● Types of crawlers ○ UI/API/MIXED ○ VIEW FIRST / DEPTH FIRST ○ EXHAUSTIVE / SHORTCUTTED ○ RANDOM / SMART ● Components of a crawler ● Example ● What can go wrong ● What you need to succeed
  • 41. Summary ● What’s a crawler? ● Why and when do we need a crawler? ● Types of crawlers ● Components of a crawler ○ View/node ○ Arcs/links ○ Visited storage ○ Heat map ● Example ● What can go wrong ● What you need to succeed
  • 42. Summary ● What’s a crawler? ● Why and when do we need a crawler? ● Types of crawlers ● Components of a crawler ● Example ● What can go wrong ● What you need to succeed

Editor's Notes

  1. In this presentation we will see a web app example
  2. VR - looking for a while could be an action