OSDC 2019 | Evolution of a Microservice-Infrastructure by Jan Martens

Evolution of a Microservice Infrastructure
Jan Martens
OSDC 2019, Berlin

2
Details REWE Group
Turnover
>57 bn
History
>90 years
Employees
>345.000
Industries
Food Retail,
Tourism,
DIY
Shops
>15.000

Service/team growth
2014
40 15
100
28
200
46
2015 2016 2017
# Services
# Dev Teams
1 2
2018
270
48

We’re operating a custom Docker-Environment consisting of:
Everything was cool. Developers could bring Code live. All was well.
Recap
The current state of 2018, achievements so far

… and looks like
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* nS1 S1
S2 S2 S1
S1S1
S2

… and looks like
Ingress-Nodes
● Nginx-config written by
consul-template on
change of
Consul-information
● Routes external
Hostnames
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* nS1 S1
S2 S2 S1
S1S1
S2

… and looks like
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* nS1 S1
S2 S2 S1
S1S1
S2
Docker-Host
● Nginx-config written by
consul-template on
change of
Consul-information
● Routes internal
Hostnames to containers
● Runs containers

… and looks like
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* nS1 S1
S2 S2 S1
S1S1
S2
Consul-Server & Swarm-Master
● Contain knowledge of all
services
● Deployments are started
from here
● Act as DNS-Servers for
service-discovery

… and looks like
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* n
Other centrally managed
“platform-services”
● Kafka
● Databases
● ELK-Stack
● Prometheus & Grafana
● ...
S1 S1
S2 S2 S1
S1S1
S2

● Blue-Green
○ only one color can be active
● Independent set of containers per color
○ started via docker-compose
Deployments
how we do them

● Teams can deploy themselves
● Color Switch per Service
● Both colors addressed via the same DNS record
Deployments
how we do them

Deployments
docker-2docker-1 docker-3
Consul Key/Value
S1
S2
S2
S1

Deployments
Consul Key/Value
S1
S2
S2
S1
deploy basket blue

Deployments
Consul Key/Value
S1
S2
S2
S1
deploy basket blue
basket

Deployments
basket:
- active: green
Consul Key/Value
S1
S2
S2
S1
deploy basket blue
basket

Deployments
basket:
- active: green
Consul Key/Value
S1
S2
S2
S1
basket
reload Nginx

Deployments
basket:
- active: blue
Consul Key/Value
S1
S2
S2
S1
basket
active: blue

Deployments
basket:
- active: blue
Consul Key/Value
S1
S2
S2
S1
basket
reload Nginx
active: blue

Deployments
basket:
- active: blue
Consul Key/Value
S1
S2
S2
S1
basket
reload Nginx
active: blue
deploy basket blue

● Both colors have the same DNS record
○ Consul will return IPs for all hosts
where the Service is running
● Nginx running on each Worker Node
○ routes to color depending on used Port
Request routing
how can services be addressed

Request Routing
S2
S2
S1
basket basket

Request Routing
docker-2 docker-3
S2
S1
basket
basket-service.rewe
docker-1
S2
basket

Request Routing
docker-2 docker-3
S2
S1
basket
basket-service.rewe
docker-1.rewe
docker-2.rewe
docker-1
S2
basket

Request Routing
docker-2 docker-3
S2
S1
basket
:80
docker-1
S2
basket

Request Routing
docker-2 docker-3
S2
S1
basket
:90
docker-1
S2
basket

Request Routing
docker-2 docker-3
S2
S1
basket
:80
docker-1
S2
basket
basket-service.rewe
docker-1.rewe
docker-2.rewe

● Requests which never reached their destination
● Keepalive connections dropped after short time
Always happened at the time of deployments
Problems with Nginx
increased with the size of the environment

● Requests which never reached their destination
● Keepalive connections dropped after short time
Always happened at the time of deployments
● Consul-template would reload all Nginx instances
at the same time
Problems with Nginx
increased with the size of the environment

● Look for different reverse proxy
○ no reload on config change
○ dynamic configuration
Problems with Nginx
looking for solutions

Problems with Nginx
possible replacements

● Dynamically configurable
● Live reloading of configuration
● Lots of metrics
● Nice web ui
● Single Go binary
Traefik

1. Install alongside Nginx on Worker and Ingress Nodes
○ listen on different ports
2. Check that configured routes are correct and work
3. Change port mapping host by host
4. Remove Nginx
Traefik
how to migrate

Traefik
how to migrate
:80
docker-1
basket
basket-service.rewe
docker-1.rewe
docker-2.rewe

Traefik
how to migrate
:80
docker-1
basket
basket-service.rewe
docker-1.rewe
docker-2.rewe
:10080

● Keepalive and connection problems immediately went away
● Almost real time data about service response time
● Web UI to check routes
● Rich access logs
Traefik
Benefits

● Poor container spread
○ all service instances running on one host
● No self healing
● Manual node draining
○ dependent on docker-compose files
Problems with standalone Swarm
also increased

● Look for different container Orchestrator
○ self healing
○ proper container spread
looking for solutions

possible replacements

● Seamless Consul integration
○ almost no setup needed
● Self healing
● Bin packing
● Single Go binary
● Nice Web UI
Nomad

1. Install alongside Swarm on Worker and Master Nodes
○ agnostic of other Docker Containers
2. Modify deployment Jobs
○ Start new deployments via Nomad
3. Remove Swarm
Nomad
how to migrate

Deployments
basket:
- active: green
Consul Key/Value
S1
S2
S2
S1
deploy basket blue

Deployments
basket:
- active: green
Consul Key/Value
deploy basket blue
basket
S1
S2 S2
S1

● Ensures proper container spread
● Self healing on Node outage
● ACLs
● Job history
Nomad
Benefits

● Not limited to Docker
○ Rkt and LXC
● Not limited to Containers
○ Jar files
○ Binaries
○ VMs
Nomad
Benefits

● Having a centralised deployment-toolset
○ perform all changes for all teams / developers at the same time
● Do Canary-like changes on our infrastructure
○ fully interoperable changes
○ nginx <-> Traefik
What helped us most?

● You might not need Kubernetes
What did we learn?

● Keeping your architecture pluggable helps
What did we learn?

● Computing resources are finite
○ Setting resource limits can be difficult
What did we learn?

● Computing resources are finite
○ Setting resource limits can be difficult
● Distributed systems can be hard
What did we learn?

Thank You!
Jan Martens | github.com/jan-martens
www.rewe-digital.com | @rewedigitaltech
All background pictures are licensed under CC0. Source: pexels.com
Evolution of a Microservice Infrastructure
OSDC 2019, Berlin

OSDC 2019 | Evolution of a Microservice-Infrastructure by Jan Martens

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OSDC 2019 | Evolution of a Microservice-Infrastructure by Jan Martens

Similar to OSDC 2019 | Evolution of a Microservice-Infrastructure by Jan Martens (20)

Recently uploaded

Recently uploaded (20)

OSDC 2019 | Evolution of a Microservice-Infrastructure by Jan Martens