This talk delves into the transformative journey of transitioning adidas e-commerce from monolithic architectures to microservices, exploring the key challenges faced by Site Reliability Engineers (SREs) and highlighting crucial insights gained throughout the process. From scalability concerns and deployment intricacies to monitoring and incident response, we’ll examine the intricacies of maintaining reliability while embracing the flexibility of microservices. Join us to glean practical lessons, best practices, and success stories, as we navigate the evolving landscape of SRE in the era of microservices.
5. Faster Time to
Market
6 weeks release cycle.
Increased order
throughput rate
Fast selling article needed a
platform that can process higher
throughput rate.
Scaling Infra
faster
Either On-Demand scaling or auto
scaling.
Reduce
operational costs
Licenses, platform and labor,
additionally minimizing vendor lock
in.
THE EVOLVING DEMANDS OF OUR EXPANDING BUSINESS NECESSITATED A REDESIGN
WHY THE SHIFT
A
D
I
D
A
S
A
G
5
7. Microservices
• Individual pieces of business
functionality
• End2End automation of CICD
API - first
• All functionality is exposed via
APIs, reusable
• Channel agnostic
• SDKs
Cloud native
• Software as a service, leveraging
cloud
• Entire platform into multiple
containerized backend services
Headless
• Decouples front-end experience
from back-end logic
• Enables flexibility, innovation,
and personalized experiences
ADAPTING TO CHANGING TRENDS IN TECH AND CONSUMER BEHAVIOR
IMPLEMENTING ‘MACH’ ARCHITECTURE
A
D
I
D
A
S
A
G
7
9. INTRODUCTION OF MICROSERVICES POSED NEW CHALLENGES
SRE CHALLENGES
A
D
I
D
A
S
A
G
9
Breaking Changes in
Deployments
• Forced sync of deployments
resulted in downtime ~ 30 mins
MTTR
Availability of
Service/Observability
Tracing
• Promotion service downtime
induced latency in entire Checkout
Flow ~ 2 Hrs MTTR
• Missing observability, tracing and
more importantly alerting at
Promotions end.
Repeated Code
• Multiple teams doing same thing
differently
Security
• Top 3 out of 5 highest bot impacted Sneakers
are from Adidas.
• Each new endpoint needs to be protected,
maintained and consistently reviewed.
• Authentication & Authorization – Have
secured authentication between
microservices
11. EACH FAILURE IS AN EXPERIENCE :) GIVING US NEW INSIGHTS TO IMPROVE
BUILDING RESILIENCE
A
D
I
D
A
S
A
G
11
Failover
• Standardization of Timeouts across
Services
• Circuit Breaking
• Inbuilt tool with Feature Flag Activations
• Canary Deployments, Versioning of
deployments
Observability
• End to End Tracing with APM tool connecting different
microservices and Infra
• OpenTelemetry based tracing integrated into APM and
vendor agnostic
• Unified view of Logs separated by different indexes
• One Stop Observability Dashboard depicting the
overall health of each service in a single view.
Security
• Securing all Ingress with TLS
• Protecting the Public endpoints from Bots
• Secured Authentication via SOPs
• Single codeBase with allowedlist Ips that can
be leveraged across microservices
13. Incident Debrief
• Mainly focusses on the 5 ‘why.’
• Derive on the tasks to each team.
What is stability?
• Any product that is not measured is
difficult to manage.
• Defined Metrics MTTD, MTTR, Revenue
Impact.
Incident
Commander
• Incident commander brings in
the needed and right team into
the call.
• Drive the incident to mitigation
or fix.
GETTING THE INCIDENTS TO CLOSURE WITH BLAMELESS POSTMORTEM ANALYSIS
INCIDENT MANAGEMENT
A
D
I
D
A
S
A
G
13
Problem Management
• Missing Observability/architectural review
/ bug fix.
• Process Gaps.