SlideShare a Scribd company logo
1 of 82
Download to read offline
RESEARCH: Validating session isolation for
web crawling to provide data integrity
● Web Rendering, Search Engines, and Web Crawlers
● Research context
● What is Session Isolation?
● Session Isolation in the wild
● Solving Session Isolation
● Tests
● Conclusion
Table of contents
@giacomozecchini
Web Rendering, Search
Engines, and Web Crawlers
We are not in the 90s anymore
As new web rendering patterns got traction on the
web, we moved from static HTML pages to more
complex ways of rendering content.
@giacomozecchini
https://www.patterns.dev/posts/rendering-introduction/
New rendering patterns emerged
With the massive use of rendering patterns such
as Client-Side Rendering and Progressive
Hydration, search engines were somehow forced
to start rendering web pages and retrieve almost
as much content as the users would get with their
browsers.
@giacomozecchini
https://developers.google.com/search/docs/crawling-indexing/javascript/javascript-seo-basics
Web Rendering Systems to save the day
Search Engines have developed their own web
rendering systems (or web rendering services).
These are a piece of software that is able to
render a large number of web pages by using
automated browsers.
@giacomozecchini
“Googlebot & JavaScript: A Closer Look at the WRS” by Martin Splitt: https://www.youtube.com/watch?v=Qxd_d9m9vzo
Web Crawler tools followed Search Engines
Web crawling tools also started to build
rendering systems to keep up with the evolution
of the web and mimic search engines'
capabilities.
@giacomozecchini
But rendering is hard!
There is no industry standard for rendering
pages, which means that not even leading search
engines such as Google are doing it in the
“correct” way.
Each web rendering system is built to serve
specific use cases, which results in inevitable
tradeoffs.
@giacomozecchini
Research Context
At Merj we’ve been happy users of many web
crawling tools and during the years we probably
used all of them at least once.
We’ve been using web crawling tools for years
@giacomozecchini
We’ve been building custom WRS solutions
For use cases such as custom data sources in
complex data pipelines for enterprises, we have
been building our own web crawling systems .
@giacomozecchini
Data integrity assurances
The starting point of this research was a recent
project that required us to provide assurances to
a legal and compliance team about the data
quality and integrity of a data source (rendered
pages). These were to be ingested into a machine
learning model.
@giacomozecchini
Data validation process
In addition to other checks present in our data
integrity validation process, we tested the output
of multiple web crawling tools.
We found some unexpected values which varied
across tools.
@giacomozecchini
What is Session Isolation?
What is Session Isolation?
While rendering a page in an isolated rendering
session, the page must not be able to get any
data from previous rendering sessions and be
influenced by other pages' renderings.
@giacomozecchini
Stateless is a similar concept
This is similar to the concept of “stateless” as
used for web crawlers, where all fetches are
completed without reusing cookies and without
keeping in memory any specific data.
@giacomozecchini
Session Isolation in the Wild
Content customisations based on navigation
Session isolation real world problems can be
found by observing the rendering of pages that
have content customisations based on user
navigation.
@giacomozecchini
The “Recently view products” feature
A practical examples are the "Recently viewed
products" boxes.
These boxes show the user's recent browsing
history, with links to various products, and can be
found on many websites.
@giacomozecchini
Ikea.com
@giacomozecchini
Asos.com
@giacomozecchini
Adidas.com
@giacomozecchini
Visited pages are saved in memory
For all three of the previous examples, the
"Recently viewed products" box is implemented
by saving the pages visited by the user in the
browser memory.
@giacomozecchini
Saved data may affect rendering
For those web crawlers that render web pages
without isolating the rendering sessions, the data
saved in the browser's memory may affect the
rendering of other web pages of the same
website.
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
Tools with session isolation behave differently
The result is different if we look at how Search
Engines or Web Crawlers that implement correct
session isolation are rendering pages.
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
@giacomozecchini
Additional content and “ghost links”
These different ways of rendering pages will
produce additional content and a considerable
percentage of “ghost links”, only visible by web
crawlers affected by session isolation issues.
@giacomozecchini
@giacomozecchini
Without session isolation
With session isolation
Crawling/rendering order matters
Depending on the crawling/rendering order, a web
crawling tool with session isolation issues may
create arbitrary HTML content that changes every
time.
@giacomozecchini
@giacomozecchini
Starting from PAGE 1
Starting from PAGE 3
Three main implications
● Lack of data integrity
● The rendered pages are not an accurate
representation of what search engines will
render and use
● Developers may waste time (and money)
investigating issues which are not present
@giacomozecchini
Analyses are based on wrong data, for example:
● Content Analysis with additional content
● Internal linking analysis with X% arbitrary links
* those additional content and links are not visible to Google & Co
Effects on SEOs’ day-to-day
@giacomozecchini
These wrong analyses often translate into:
● Waste of time & money
● Wrong choices
Effects on SEOs’ day-to-day
@giacomozecchini
Session isolation isn’t limited to web crawlers
All systems that use browser-based
functionalities might be affected such as dynamic
rendering services, web performance analysis
tools, and CI/CD pipeline tests.
@giacomozecchini
If it’s an option, it should be clear
There are some cases where you need to keep
data for specific tests, but that option should be
really clear and intended, not a side effect of a
hidden problem.
@giacomozecchini
Solving Session Isolation
Partial or incorrect solutions
There are many partial or incorrect ways of
tackling session isolation for web crawling
purposes, let’s have a look at some of them.
@giacomozecchini
Partial or incorrect solution #1
Clearing cookies manually after the rendering of a
page. The problem here is that Cookies are not
the only Web API that can store data.
@giacomozecchini
Partial or incorrect solution #2
Opening and closing the browser for each page
you want to render, manually deleting the folders
where the browser stores data. This option is not
efficient at all.
@giacomozecchini
Partial or incorrect solution #3
Using the incognito profile hides some possible
pitfalls as well. Within an incognito profile the
rendered pages might share storage and
cross-tab communication is possible. This option
would solve our problem only if, again, we don’t
render pages in parallel and we start/stop the
browser for each page.
@giacomozecchini
The optimal solution
Introduced at BlinkOn 6, Browser Context is an
efficient way to have correct session isolation.
Every Browser Context session runs in a separate
renderer process, isolating the storage (cookies,
cache, local storage, etc.) and preventing
cross-tab communication.
@giacomozecchini
@giacomozecchini
How to use Browser Context effectively
Rendering a single page per Browser Context,
closing it at the end of the rendering, and then
opening a new Browser Context for the next page
will guarantee isolated rendering sessions
without the need to restart the browser every
time.
@giacomozecchini
Data integrity > Performance
Using this solution will have a minimal effect on
the web crawlers' performance. In most
real-world cases, the majority of web crawling
tools users would not compromise data integrity
caused by session isolation for an overall
performance difference of a few seconds.
@giacomozecchini
Documentation and example
Additional documentation and examples on the use of
Browser Context can be found here:
● https://chromedevtools.github.io/devtools-protocol/tot/
Target/#method-createBrowserContext
● https://pptr.dev/next/api/puppeteer.browser.createinco
gnitobrowsercontext
● https://playwright.dev/docs/api/class-browsercontext
@giacomozecchini
Tests
@giacomozecchini
Methodology
We set up a testing environment with 1,000 pages
that try to communicate with each other using the
storage and cross-tab communication.
@giacomozecchini
Avoiding false negatives
Rendering 1,000 pages will increase the chances
of having two or more pages rendered at the
same time in parallel or by the same browser,
using fewer pages may cause false negatives if
the tested web rendering system uses a high
number of machines in parallel.
@giacomozecchini
Storage isolation tests
Storage isolation tests are focused on Web APIs
that save or access data from the browser's
memory. The goal of each test is to find race
conditions in accessing data saved from previous
or parallel page renderings.
@giacomozecchini
Test #1 - Cookies
Cookies don’t need presentation. The Cookie interface lets you read
and write small pieces of information in the browser storage.
Test explanation: When the rendering starts the page creates and
saves a Cookie, then reads if there are cookies saved from other pages.
Fail criterion: if there are cookies other than the ones created for the
rendered page, the test fails.
@giacomozecchini
Test #2 - IndexedDB
IndexedDB is a transactional database system that lets you store and
retrieve objects from Browser memory.
Test explanation: When the rendering starts the page, it creates or
connects to an IndexedDB database. Then, it creates and saves a
record in the database to eventually start reading if there are records
saved from other pages.
Fail criterion: If there are records other than the ones created for the
rendered page, the test fails.
@giacomozecchini
Test #3 - LocalStorage
LocalStorage is a mechanism that uses the Web Storage API by which
browsers can store key/value pairs. Data persists when the browser is
closed and reopened.
Test explanation: When the rendering starts, the page creates or saves
a data item in the Local Storage, and then it reads if there are data
items saved from other pages.
Fail criterion: If there are data items other than the ones created for the
rendered page, the test fails.
@giacomozecchini
Test #4 - SessionStorage
SessionStorage is a mechanism that uses the Web Storage API by
which browsers can store key/value pairs. Data lasts as long as the tab
or the browser is open and survives over page reloads and restores.
Test explanation: When the rendering service starts the page, creates,
or saves a data item in the Session Storage, and then it reads if there
are data items saved from other pages.
Fail criterion: If there are data items other than the ones created for the
rendered page, the test fails.
@giacomozecchini
Cross-tab communication tests
Cross-tab communication tests are focused on
Web APIs that send or receive data. The goal of
each test is to find if during rendering a page can
receive messages from other pages rendered in
parallel.
@giacomozecchini
Test #5 - Broadcast Channel
The Broadcast Channel API allows communication between windows,
tabs, frames, iframes, and workers of the same origin.
Test explanation: When the rendering starts the page connects to the
channel and then starts sending its page title as a message to the
channel. If there are other pages connected that are sending messages
through the channel the page gets and saves them.
Fail criterion: If the rendered page gets even a single message from the
Broadcast Channel sent by other pages, the test fails.
@giacomozecchini
Test #6 - Shared Worker
The Shared Worker is a Worker that allows communication between
windows, tabs, frames, iframes, and workers on the same origin.
Test explanation: When the rendering starts the page connects to the
Shared Worker, then it starts sending messages to the Worker and
eventually starts listening for messages from other pages sent through
the worker.
Fail criterion: If the rendered page gets even a single message from the
Shared Worker sent by other pages, the test fails.
@giacomozecchini
71% of web crawlers failed at least one test
@giacomozecchini
Test Results
Cookie 29% of web crawlers failed this test
IndexedDB 64% of web crawlers failed this test
LocalStorage 71% of web crawlers failed this test
SessionStorage 21% of web crawlers failed this test
Broadcast Channel 14% of web crawlers failed this test
Shared Worker 14% of web crawlers failed this test
Source Code on GitHub
Replicate the testing
environment using the
following code.
https://github.com/merj/test-
crawl-session-isolation
@giacomozecchini
Conclusion
Cause of storage isolation problems
It’s complex to predict what’s causing the storage
isolation issue. The implementation may
drastically vary and we can only speculating on
the cause.
@giacomozecchini
Cause of cross-tab communication problems
A possible cause for failing the cross-tab
communication tests (Broadcast Channel and
Shared Worker) is having the same browser used
to render pages in parallel using multiple
windows and/or tabs.
@giacomozecchini
Don’t clean it manually!
Web crawlers might pass all tests included in this
research by manually cleaning every single
storage at the end of every page rendering
session, but this approach is not a secure and
viable solution to guarantee data integrity.
@giacomozecchini
Workarounds are not future proof solutions
Web APIs and browser interfaces included in the
research aren't the only ones that might have
access to browser memory/cache and trying to
keep up with the development of all new
standards and web features is a complex and
time-consuming process.
@giacomozecchini
Our goal is to improve web crawling!
Not all web crawlers have been able to fix the
session isolation issues yet while they investigate
further. The Docker crawling test framework was
able to support those who have fixed the session
isolation and might be included in their future
release checks. Some web crawlers included us
through the entire remediation process.
@giacomozecchini
@giacomozecchini
Web Crawler Status
Ahrefs Fixed - 15 Nov 2022
Botify Passed all tests
ContentKing Fixed - 27 Oct 2022
FandangoSEO Looking into this
JetOctopus Looking into this
Lumar (formerly Deepcrawl) Passed all tests
Netpeak Spider Looking into this
OnCrawl Passed all tests
Ryte Fixed - 10 Oct 2022
Screaming Frog Fixed - 17 Aug 2022
SEO PowerSuite WebSite Auditor Looking into this
SEOClarity Looking into this
Sistrix Passed all tests
Sitebulb Looking into this
Last update: 15 Nov 2022
Final thoughts
● Rendering is hard, we hope that in the future
there will be an industry standard
● Make sure you validate your data
@giacomozecchini
Blog
We published the research on
our blog:
https://merj.com/blog/validat
ing-session-isolation-for-web-
crawling-to-provide-data-integ
rity
@giacomozecchini
Thank you for your time!
@GiacomoZecchini on Twitter, Slideshare & Speakerdeck
We work with enterprise clients to support them with SEO Innovation,
Research, & Development. Want to work with us?
rfp@merj.com +44 (0) 203 322 2660
7 Pancras Square
London, N1C 4AG

More Related Content

Similar to Validating Session Isolation for Web Crawling to Provide Data Integrity

You Can Work on the Web Patform! (GOSIM 2023)
You Can Work on the Web Patform! (GOSIM 2023)You Can Work on the Web Patform! (GOSIM 2023)
You Can Work on the Web Patform! (GOSIM 2023)Igalia
 
Headless browser: puppeteer and git client : GitKraken
Headless browser: puppeteer and git client : GitKrakenHeadless browser: puppeteer and git client : GitKraken
Headless browser: puppeteer and git client : GitKrakenSheikhMoonwaraAnjumM
 
Are you there Page Experience? It's Me, DevTools.
Are you there Page Experience? It's Me, DevTools.Are you there Page Experience? It's Me, DevTools.
Are you there Page Experience? It's Me, DevTools.Rachel Anderson
 
Are you there Page Experience? It's me, DevTools
Are you there Page Experience? It's me, DevToolsAre you there Page Experience? It's me, DevTools
Are you there Page Experience? It's me, DevToolsJamie Indigo
 
Web Performance & Search Engines - A look beyond rankings
Web Performance & Search Engines - A look beyond rankingsWeb Performance & Search Engines - A look beyond rankings
Web Performance & Search Engines - A look beyond rankingsGiacomo Zecchini
 
Headless Browser – A Stepping Stone Towards Developing Smarter Web Applicatio...
Headless Browser – A Stepping Stone Towards Developing Smarter Web Applicatio...Headless Browser – A Stepping Stone Towards Developing Smarter Web Applicatio...
Headless Browser – A Stepping Stone Towards Developing Smarter Web Applicatio...pCloudy
 
Stapling and patching the web of now - ForwardJS3, San Francisco
Stapling and patching the web of now - ForwardJS3, San FranciscoStapling and patching the web of now - ForwardJS3, San Francisco
Stapling and patching the web of now - ForwardJS3, San FranciscoChristian Heilmann
 
Web topic 26 browser compatibilty and security
Web topic 26  browser compatibilty and securityWeb topic 26  browser compatibilty and security
Web topic 26 browser compatibilty and securityCK Yang
 
Professional web development with libraries
Professional web development with librariesProfessional web development with libraries
Professional web development with librariesChristian Heilmann
 
rendre AJAX crawlable par les moteurs
rendre AJAX crawlable par les moteursrendre AJAX crawlable par les moteurs
rendre AJAX crawlable par les moteursSerge Esteves
 
Technical Tips: Visual Regression Testing and Environment Comparison with Bac...
Technical Tips: Visual Regression Testing and Environment Comparison with Bac...Technical Tips: Visual Regression Testing and Environment Comparison with Bac...
Technical Tips: Visual Regression Testing and Environment Comparison with Bac...Building Blocks
 
Mastering Mobile Web with 8 Key Rules
Mastering Mobile Web with 8 Key RulesMastering Mobile Web with 8 Key Rules
Mastering Mobile Web with 8 Key RulesMobile Labs
 
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
AD113  Speed Up Your Applications w/ Nginx and PageSpeedAD113  Speed Up Your Applications w/ Nginx and PageSpeed
AD113 Speed Up Your Applications w/ Nginx and PageSpeededm00se
 
Advanced workflows for mobile web design and development
Advanced workflows for mobile web design and developmentAdvanced workflows for mobile web design and development
Advanced workflows for mobile web design and developmentbrucebowman
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDamian T. Gordon
 
Mobile Web Compatibility @ Code Camp Cluj
Mobile Web Compatibility @ Code Camp ClujMobile Web Compatibility @ Code Camp Cluj
Mobile Web Compatibility @ Code Camp ClujIoana Chiorean
 
Top 13 best front end web development tools to consider in 2021
Top 13 best front end web development tools to consider in 2021Top 13 best front end web development tools to consider in 2021
Top 13 best front end web development tools to consider in 2021Samaritan InfoTech
 

Similar to Validating Session Isolation for Web Crawling to Provide Data Integrity (20)

You Can Work on the Web Patform! (GOSIM 2023)
You Can Work on the Web Patform! (GOSIM 2023)You Can Work on the Web Patform! (GOSIM 2023)
You Can Work on the Web Patform! (GOSIM 2023)
 
Headless browser: puppeteer and git client : GitKraken
Headless browser: puppeteer and git client : GitKrakenHeadless browser: puppeteer and git client : GitKraken
Headless browser: puppeteer and git client : GitKraken
 
Are you there Page Experience? It's Me, DevTools.
Are you there Page Experience? It's Me, DevTools.Are you there Page Experience? It's Me, DevTools.
Are you there Page Experience? It's Me, DevTools.
 
Are you there Page Experience? It's me, DevTools
Are you there Page Experience? It's me, DevToolsAre you there Page Experience? It's me, DevTools
Are you there Page Experience? It's me, DevTools
 
Web Performance & Search Engines - A look beyond rankings
Web Performance & Search Engines - A look beyond rankingsWeb Performance & Search Engines - A look beyond rankings
Web Performance & Search Engines - A look beyond rankings
 
Headless Browser – A Stepping Stone Towards Developing Smarter Web Applicatio...
Headless Browser – A Stepping Stone Towards Developing Smarter Web Applicatio...Headless Browser – A Stepping Stone Towards Developing Smarter Web Applicatio...
Headless Browser – A Stepping Stone Towards Developing Smarter Web Applicatio...
 
Stapling and patching the web of now - ForwardJS3, San Francisco
Stapling and patching the web of now - ForwardJS3, San FranciscoStapling and patching the web of now - ForwardJS3, San Francisco
Stapling and patching the web of now - ForwardJS3, San Francisco
 
Web topic 26 browser compatibilty and security
Web topic 26  browser compatibilty and securityWeb topic 26  browser compatibilty and security
Web topic 26 browser compatibilty and security
 
Modern Web Applications
Modern Web ApplicationsModern Web Applications
Modern Web Applications
 
Professional web development with libraries
Professional web development with librariesProfessional web development with libraries
Professional web development with libraries
 
rendre AJAX crawlable par les moteurs
rendre AJAX crawlable par les moteursrendre AJAX crawlable par les moteurs
rendre AJAX crawlable par les moteurs
 
Technical Tips: Visual Regression Testing and Environment Comparison with Bac...
Technical Tips: Visual Regression Testing and Environment Comparison with Bac...Technical Tips: Visual Regression Testing and Environment Comparison with Bac...
Technical Tips: Visual Regression Testing and Environment Comparison with Bac...
 
Mastering Mobile Web with 8 Key Rules
Mastering Mobile Web with 8 Key RulesMastering Mobile Web with 8 Key Rules
Mastering Mobile Web with 8 Key Rules
 
A Period of Transition
A Period of TransitionA Period of Transition
A Period of Transition
 
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
AD113  Speed Up Your Applications w/ Nginx and PageSpeedAD113  Speed Up Your Applications w/ Nginx and PageSpeed
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
 
Advanced workflows for mobile web design and development
Advanced workflows for mobile web design and developmentAdvanced workflows for mobile web design and development
Advanced workflows for mobile web design and development
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web Scraping
 
Mobile Web Compatibility @ Code Camp Cluj
Mobile Web Compatibility @ Code Camp ClujMobile Web Compatibility @ Code Camp Cluj
Mobile Web Compatibility @ Code Camp Cluj
 
Top 13 best front end web development tools to consider in 2021
Top 13 best front end web development tools to consider in 2021Top 13 best front end web development tools to consider in 2021
Top 13 best front end web development tools to consider in 2021
 
Transforming the web into a real application platform
Transforming the web into a real application platformTransforming the web into a real application platform
Transforming the web into a real application platform
 

Recently uploaded

costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Validating Session Isolation for Web Crawling to Provide Data Integrity

  • 1. RESEARCH: Validating session isolation for web crawling to provide data integrity
  • 2. ● Web Rendering, Search Engines, and Web Crawlers ● Research context ● What is Session Isolation? ● Session Isolation in the wild ● Solving Session Isolation ● Tests ● Conclusion Table of contents @giacomozecchini
  • 4. We are not in the 90s anymore As new web rendering patterns got traction on the web, we moved from static HTML pages to more complex ways of rendering content. @giacomozecchini https://www.patterns.dev/posts/rendering-introduction/
  • 5. New rendering patterns emerged With the massive use of rendering patterns such as Client-Side Rendering and Progressive Hydration, search engines were somehow forced to start rendering web pages and retrieve almost as much content as the users would get with their browsers. @giacomozecchini https://developers.google.com/search/docs/crawling-indexing/javascript/javascript-seo-basics
  • 6. Web Rendering Systems to save the day Search Engines have developed their own web rendering systems (or web rendering services). These are a piece of software that is able to render a large number of web pages by using automated browsers. @giacomozecchini “Googlebot & JavaScript: A Closer Look at the WRS” by Martin Splitt: https://www.youtube.com/watch?v=Qxd_d9m9vzo
  • 7. Web Crawler tools followed Search Engines Web crawling tools also started to build rendering systems to keep up with the evolution of the web and mimic search engines' capabilities. @giacomozecchini
  • 8. But rendering is hard! There is no industry standard for rendering pages, which means that not even leading search engines such as Google are doing it in the “correct” way. Each web rendering system is built to serve specific use cases, which results in inevitable tradeoffs. @giacomozecchini
  • 10. At Merj we’ve been happy users of many web crawling tools and during the years we probably used all of them at least once. We’ve been using web crawling tools for years @giacomozecchini
  • 11. We’ve been building custom WRS solutions For use cases such as custom data sources in complex data pipelines for enterprises, we have been building our own web crawling systems . @giacomozecchini
  • 12. Data integrity assurances The starting point of this research was a recent project that required us to provide assurances to a legal and compliance team about the data quality and integrity of a data source (rendered pages). These were to be ingested into a machine learning model. @giacomozecchini
  • 13. Data validation process In addition to other checks present in our data integrity validation process, we tested the output of multiple web crawling tools. We found some unexpected values which varied across tools. @giacomozecchini
  • 14. What is Session Isolation?
  • 15. What is Session Isolation? While rendering a page in an isolated rendering session, the page must not be able to get any data from previous rendering sessions and be influenced by other pages' renderings. @giacomozecchini
  • 16. Stateless is a similar concept This is similar to the concept of “stateless” as used for web crawlers, where all fetches are completed without reusing cookies and without keeping in memory any specific data. @giacomozecchini
  • 18. Content customisations based on navigation Session isolation real world problems can be found by observing the rendering of pages that have content customisations based on user navigation. @giacomozecchini
  • 19. The “Recently view products” feature A practical examples are the "Recently viewed products" boxes. These boxes show the user's recent browsing history, with links to various products, and can be found on many websites. @giacomozecchini
  • 23. Visited pages are saved in memory For all three of the previous examples, the "Recently viewed products" box is implemented by saving the pages visited by the user in the browser memory. @giacomozecchini
  • 24. Saved data may affect rendering For those web crawlers that render web pages without isolating the rendering sessions, the data saved in the browser's memory may affect the rendering of other web pages of the same website. @giacomozecchini
  • 32. Tools with session isolation behave differently The result is different if we look at how Search Engines or Web Crawlers that implement correct session isolation are rendering pages. @giacomozecchini
  • 40. Additional content and “ghost links” These different ways of rendering pages will produce additional content and a considerable percentage of “ghost links”, only visible by web crawlers affected by session isolation issues. @giacomozecchini
  • 42. Crawling/rendering order matters Depending on the crawling/rendering order, a web crawling tool with session isolation issues may create arbitrary HTML content that changes every time. @giacomozecchini
  • 43. @giacomozecchini Starting from PAGE 1 Starting from PAGE 3
  • 44. Three main implications ● Lack of data integrity ● The rendered pages are not an accurate representation of what search engines will render and use ● Developers may waste time (and money) investigating issues which are not present @giacomozecchini
  • 45. Analyses are based on wrong data, for example: ● Content Analysis with additional content ● Internal linking analysis with X% arbitrary links * those additional content and links are not visible to Google & Co Effects on SEOs’ day-to-day @giacomozecchini
  • 46. These wrong analyses often translate into: ● Waste of time & money ● Wrong choices Effects on SEOs’ day-to-day @giacomozecchini
  • 47. Session isolation isn’t limited to web crawlers All systems that use browser-based functionalities might be affected such as dynamic rendering services, web performance analysis tools, and CI/CD pipeline tests. @giacomozecchini
  • 48. If it’s an option, it should be clear There are some cases where you need to keep data for specific tests, but that option should be really clear and intended, not a side effect of a hidden problem. @giacomozecchini
  • 50. Partial or incorrect solutions There are many partial or incorrect ways of tackling session isolation for web crawling purposes, let’s have a look at some of them. @giacomozecchini
  • 51. Partial or incorrect solution #1 Clearing cookies manually after the rendering of a page. The problem here is that Cookies are not the only Web API that can store data. @giacomozecchini
  • 52. Partial or incorrect solution #2 Opening and closing the browser for each page you want to render, manually deleting the folders where the browser stores data. This option is not efficient at all. @giacomozecchini
  • 53. Partial or incorrect solution #3 Using the incognito profile hides some possible pitfalls as well. Within an incognito profile the rendered pages might share storage and cross-tab communication is possible. This option would solve our problem only if, again, we don’t render pages in parallel and we start/stop the browser for each page. @giacomozecchini
  • 54. The optimal solution Introduced at BlinkOn 6, Browser Context is an efficient way to have correct session isolation. Every Browser Context session runs in a separate renderer process, isolating the storage (cookies, cache, local storage, etc.) and preventing cross-tab communication. @giacomozecchini
  • 56. How to use Browser Context effectively Rendering a single page per Browser Context, closing it at the end of the rendering, and then opening a new Browser Context for the next page will guarantee isolated rendering sessions without the need to restart the browser every time. @giacomozecchini
  • 57. Data integrity > Performance Using this solution will have a minimal effect on the web crawlers' performance. In most real-world cases, the majority of web crawling tools users would not compromise data integrity caused by session isolation for an overall performance difference of a few seconds. @giacomozecchini
  • 58. Documentation and example Additional documentation and examples on the use of Browser Context can be found here: ● https://chromedevtools.github.io/devtools-protocol/tot/ Target/#method-createBrowserContext ● https://pptr.dev/next/api/puppeteer.browser.createinco gnitobrowsercontext ● https://playwright.dev/docs/api/class-browsercontext @giacomozecchini
  • 59. Tests
  • 61. Methodology We set up a testing environment with 1,000 pages that try to communicate with each other using the storage and cross-tab communication. @giacomozecchini
  • 62. Avoiding false negatives Rendering 1,000 pages will increase the chances of having two or more pages rendered at the same time in parallel or by the same browser, using fewer pages may cause false negatives if the tested web rendering system uses a high number of machines in parallel. @giacomozecchini
  • 63. Storage isolation tests Storage isolation tests are focused on Web APIs that save or access data from the browser's memory. The goal of each test is to find race conditions in accessing data saved from previous or parallel page renderings. @giacomozecchini
  • 64. Test #1 - Cookies Cookies don’t need presentation. The Cookie interface lets you read and write small pieces of information in the browser storage. Test explanation: When the rendering starts the page creates and saves a Cookie, then reads if there are cookies saved from other pages. Fail criterion: if there are cookies other than the ones created for the rendered page, the test fails. @giacomozecchini
  • 65. Test #2 - IndexedDB IndexedDB is a transactional database system that lets you store and retrieve objects from Browser memory. Test explanation: When the rendering starts the page, it creates or connects to an IndexedDB database. Then, it creates and saves a record in the database to eventually start reading if there are records saved from other pages. Fail criterion: If there are records other than the ones created for the rendered page, the test fails. @giacomozecchini
  • 66. Test #3 - LocalStorage LocalStorage is a mechanism that uses the Web Storage API by which browsers can store key/value pairs. Data persists when the browser is closed and reopened. Test explanation: When the rendering starts, the page creates or saves a data item in the Local Storage, and then it reads if there are data items saved from other pages. Fail criterion: If there are data items other than the ones created for the rendered page, the test fails. @giacomozecchini
  • 67. Test #4 - SessionStorage SessionStorage is a mechanism that uses the Web Storage API by which browsers can store key/value pairs. Data lasts as long as the tab or the browser is open and survives over page reloads and restores. Test explanation: When the rendering service starts the page, creates, or saves a data item in the Session Storage, and then it reads if there are data items saved from other pages. Fail criterion: If there are data items other than the ones created for the rendered page, the test fails. @giacomozecchini
  • 68. Cross-tab communication tests Cross-tab communication tests are focused on Web APIs that send or receive data. The goal of each test is to find if during rendering a page can receive messages from other pages rendered in parallel. @giacomozecchini
  • 69. Test #5 - Broadcast Channel The Broadcast Channel API allows communication between windows, tabs, frames, iframes, and workers of the same origin. Test explanation: When the rendering starts the page connects to the channel and then starts sending its page title as a message to the channel. If there are other pages connected that are sending messages through the channel the page gets and saves them. Fail criterion: If the rendered page gets even a single message from the Broadcast Channel sent by other pages, the test fails. @giacomozecchini
  • 70. Test #6 - Shared Worker The Shared Worker is a Worker that allows communication between windows, tabs, frames, iframes, and workers on the same origin. Test explanation: When the rendering starts the page connects to the Shared Worker, then it starts sending messages to the Worker and eventually starts listening for messages from other pages sent through the worker. Fail criterion: If the rendered page gets even a single message from the Shared Worker sent by other pages, the test fails. @giacomozecchini
  • 71. 71% of web crawlers failed at least one test @giacomozecchini Test Results Cookie 29% of web crawlers failed this test IndexedDB 64% of web crawlers failed this test LocalStorage 71% of web crawlers failed this test SessionStorage 21% of web crawlers failed this test Broadcast Channel 14% of web crawlers failed this test Shared Worker 14% of web crawlers failed this test
  • 72. Source Code on GitHub Replicate the testing environment using the following code. https://github.com/merj/test- crawl-session-isolation @giacomozecchini
  • 74. Cause of storage isolation problems It’s complex to predict what’s causing the storage isolation issue. The implementation may drastically vary and we can only speculating on the cause. @giacomozecchini
  • 75. Cause of cross-tab communication problems A possible cause for failing the cross-tab communication tests (Broadcast Channel and Shared Worker) is having the same browser used to render pages in parallel using multiple windows and/or tabs. @giacomozecchini
  • 76. Don’t clean it manually! Web crawlers might pass all tests included in this research by manually cleaning every single storage at the end of every page rendering session, but this approach is not a secure and viable solution to guarantee data integrity. @giacomozecchini
  • 77. Workarounds are not future proof solutions Web APIs and browser interfaces included in the research aren't the only ones that might have access to browser memory/cache and trying to keep up with the development of all new standards and web features is a complex and time-consuming process. @giacomozecchini
  • 78. Our goal is to improve web crawling! Not all web crawlers have been able to fix the session isolation issues yet while they investigate further. The Docker crawling test framework was able to support those who have fixed the session isolation and might be included in their future release checks. Some web crawlers included us through the entire remediation process. @giacomozecchini
  • 79. @giacomozecchini Web Crawler Status Ahrefs Fixed - 15 Nov 2022 Botify Passed all tests ContentKing Fixed - 27 Oct 2022 FandangoSEO Looking into this JetOctopus Looking into this Lumar (formerly Deepcrawl) Passed all tests Netpeak Spider Looking into this OnCrawl Passed all tests Ryte Fixed - 10 Oct 2022 Screaming Frog Fixed - 17 Aug 2022 SEO PowerSuite WebSite Auditor Looking into this SEOClarity Looking into this Sistrix Passed all tests Sitebulb Looking into this Last update: 15 Nov 2022
  • 80. Final thoughts ● Rendering is hard, we hope that in the future there will be an industry standard ● Make sure you validate your data @giacomozecchini
  • 81. Blog We published the research on our blog: https://merj.com/blog/validat ing-session-isolation-for-web- crawling-to-provide-data-integ rity @giacomozecchini
  • 82. Thank you for your time! @GiacomoZecchini on Twitter, Slideshare & Speakerdeck We work with enterprise clients to support them with SEO Innovation, Research, & Development. Want to work with us? rfp@merj.com +44 (0) 203 322 2660 7 Pancras Square London, N1C 4AG