This talk is about our journey from Nginx & Docker Swarm to Traefik & Nomad. With the increase of load & traffic on our container-environment over the time, we experienced some issues that were unnoticed when built the environment. Due to the way we dynamically configured Nginx with Consul-Template we started to experience a lot of dropped keepalive connections and connection resets. Also the traffic wasn’t distributed evenly throughout our container-infrastructure which led to single instances receiving most of the load. This is why made the decision to change to a reverse-proxy that can get dynamically configured. We were aware of the shortcommings of Docker Swarm (standalone) and seeked for a tool that would allow us to distribute containers more evenly without totally reconstructing everything and provide us with self healing capabilities. Performing these changes under the hood, transparent for our developers, was one of our key objectives.
8. We’re operating a custom Docker-Environment consisting of:
Everything was cool. Developers could bring Code live. All was well.
Recap
The current state of 2018, achievements so far
9. … and looks like
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* nS1 S1
S2 S2 S1
S1S1
S2
10. … and looks like
Ingress-Nodes
● Nginx-config written by
consul-template on
change of
Consul-information
● Routes external
Hostnames
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* nS1 S1
S2 S2 S1
S1S1
S2
11. … and looks like
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* nS1 S1
S2 S2 S1
S1S1
S2
Docker-Host
● Nginx-config written by
consul-template on
change of
Consul-information
● Routes internal
Hostnames to containers
● Runs containers
12. … and looks like
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* nS1 S1
S2 S2 S1
S1S1
S2
Consul-Server & Swarm-Master
● Contain knowledge of all
services
● Deployments are started
from here
● Act as DNS-Servers for
service-discovery
13. … and looks like
Ingress-Nodes
Consul-Server
a.k.a
Master-Nodes
Docker-Hosts
a.k.a.
Worker-Nodes
Other “aaS”
* n
Other centrally managed
“platform-services”
● Kafka
● Databases
● ELK-Stack
● Prometheus & Grafana
● ...
S1 S1
S2 S2 S1
S1S1
S2
25. ● Both colors have the same DNS record
○ Consul will return IPs for all hosts
where the Service is running
● Nginx running on each Worker Node
○ routes to color depending on used Port
Request routing
how can services be addressed
35. ● Requests which never reached their destination
● Keepalive connections dropped after short time
Always happened at the time of deployments
Problems with Nginx
increased with the size of the environment
36. ● Requests which never reached their destination
● Keepalive connections dropped after short time
Always happened at the time of deployments
● Consul-template would reload all Nginx instances
at the same time
Problems with Nginx
increased with the size of the environment
37. ● Look for different reverse proxy
○ no reload on config change
○ dynamic configuration
Problems with Nginx
looking for solutions
39. ● Dynamically configurable
● Live reloading of configuration
● Lots of metrics
● Nice web ui
● Single Go binary
Traefik
40. 1. Install alongside Nginx on Worker and Ingress Nodes
○ listen on different ports
2. Check that configured routes are correct and work
3. Change port mapping host by host
4. Remove Nginx
Traefik
how to migrate
45. ● Keepalive and connection problems immediately went away
● Almost real time data about service response time
● Web UI to check routes
● Rich access logs
Traefik
Benefits
49. ● Poor container spread
○ all service instances running on one host
● No self healing
● Manual node draining
○ dependent on docker-compose files
Problems with standalone Swarm
also increased
50. ● Look for different container Orchestrator
○ self healing
○ proper container spread
Problems with standalone Swarm
looking for solutions
52. ● Seamless Consul integration
○ almost no setup needed
● Self healing
● Bin packing
● Single Go binary
● Nice Web UI
Nomad
53. 1. Install alongside Swarm on Worker and Master Nodes
○ agnostic of other Docker Containers
2. Modify deployment Jobs
○ Start new deployments via Nomad
3. Remove Swarm
Nomad
how to migrate
65. ● Having a centralised deployment-toolset
○ perform all changes for all teams / developers at the same time
● Do Canary-like changes on our infrastructure
○ fully interoperable changes
○ nginx <-> Traefik
What helped us most?
66. ● You might not need Kubernetes
What did we learn?
67. ● You might not need Kubernetes
● Keeping your architecture pluggable helps
What did we learn?
68. ● You might not need Kubernetes
● Keeping your architecture pluggable helps
● Computing resources are finite
○ Setting resource limits can be difficult
What did we learn?
69. ● You might not need Kubernetes
● Keeping your architecture pluggable helps
● Computing resources are finite
○ Setting resource limits can be difficult
● Distributed systems can be hard
What did we learn?
71. Thank You!
Jan Martens | github.com/jan-martens
www.rewe-digital.com | @rewedigitaltech
All background pictures are licensed under CC0. Source: pexels.com
Evolution of a Microservice Infrastructure
OSDC 2019, Berlin