This is part one of my Monitoring Distributed Apps series.
Here we explore premises of Distributed Application monitoring focusing on metrics, why do we need them and gradually introducing Prometheus as a solution.
The video recording is available here: https://youtu.be/lvogDmRN-Hs
4. Cloud Out of the box
✘ VM CPU/Network/Disk stats…
✘ But not memory
➢ requires vendor-specific agents
4
5. Cloud Out of the box
✘ VM CPU/Network/Disk stats…
✘ But not memory
➢ requires vendor-specific agents
✘ Gets much better with GKE
➢ Memory usage for pods IF your YAMLs behave
5
6. Cloud Out of the box
✘ VM CPU/Network/Disk stats…
✘ But not memory
➢ requires vendor-specific agents
✘ Gets much better with GKE
➢ Memory usage for pods IF your YAMLs behave
✘ All of the above are metrics… But about our app?
6
9. When logs are not enough
✘ Too detailed to see the big picture
✘ Hard to see KPIs / trends
9
10. Logs to metrics
✘ Log-based metrics in StackDriver
➢ Fragile configuration
➢ Disjoint from the app
➢ Vendor specific
✘ Smart log parsers, e.g. Coralogix
Parsed 100 docs in 0.31 seconds
Parsed 100 docs in 0.38 seconds
Parsed 100 docs in 0.34 seconds
✘ An option when retrofitting monitoring on the existing app
10
13. Metric examples
Process A:
2020-07-28T02:32:06Z http_requests_total 1239
2020-07-28T02:32:07Z http_requests_total 1245
Process B:
2020-07-28T02:32:06Z http_requests_total 1185
2020-07-28T02:32:07Z http_requests_total 1185
Now we can aggregate across!
13
22. Wait, there is more!
✘ 10-20k metrics per average K8s node
✘ That’s 1,200,000/minute for 15 node cluster
➢ Assuming 15s collection interval
✘ Or 52,560,000,000 samples a month!
✘ And that’s just for average sized app
22
26. Enhance the existing
✘ Both AWS and GCP give you so much for free already
✘ Just add your app metrics
➢ And they support that!
✘ No need to ship system metrics, e.g. K8s - they are alredy there
26
27. But it’s costly
✘ GCP StackDriver
➢ $84/month per 1k metrics
➢ Price drops after 300k metrics
27
28. But it’s costly
✘ GCP StackDriver
➢ $84/month per 1k metrics
➢ Price drops after 300k metrics
✘ AWS CloudWatch
➢ $300/month per 1k metrics
➢ $100/month after first 10k, $50 after the first 240k
28
29. But it’s costly
✘ GCP StackDriver
➢ $84/month per 1k metrics
➢ Price drops after 300k metrics
✘ AWS CloudWatch
➢ $300/month per 1k metrics
➢ $100/month after first 10k, $50 after the first 240k
✘ Logs are expensive too, btw
➢ GCP SD: $0.50/GB
➢ AWS CW: $0.60/GB + charge of $0.0057 per scanned GB for queries
29
30. But it’s costly
✘ 20 μSvc app with 1k metrics and 1GB logs/day per μSvc:
➢ 1*20*30 = 600GB/month
➢ 20k metrics/month
✘ Will cost you:
➢ GCP: $300 for logs + $1760 for metrics
➢ AWS: $360 for logs + $4000 for metrics
✘ It’s only half a story!
➢ With containers metrics are short lived
30
31. Further considerations
✘ Vendor specific APIs to ship
➢ Challenging for multi-cloud
➢ Gets better with K8s
✘ Limited to 1 minute resolution
31
33. There are many out there
✘ DataDog, Sysdig, NewRelic, Splunk, SumoLogic, Grafana (hosted)
✘ Once you see the pricing, GCP/AWS $-figures make sense :)
✘ Lot’s of added features though:
➢ AI-assisted anomaly detection, etc.
✘ Multi-cloud!
33
34. There are many out there
✘ DataDog, Sysdig, NewRelic, Splunk, SumoLogic, Grafana (hosted)
✘ Once you see the pricing, GCP/AWS $-figures make sense :)
✘ Lot’s of added features though:
➢ AI-assisted anomaly detection, etc.
✘ Multi-cloud!
✘ But now you need to ship all your system metrics
➢ Can become expensive quickly
34
39. Grafana Multiple Data sources
39
GCP
Pub/Sub
AWS
SES
Email
Parser
One Grafana Dashboard
StackDriver
CloudW
atch Prometheus
40. Hybrid SaaS
As a hybrid SaaS, or “Option 2.5” you can:
✘ Setup hosted Graphana on Graphana Labs
✘ Connect it to CloudWatch, StackDriver, etc.
✘ Ship your app-only metrics to Graphana Labs
➢ At $16/month per 1k metrics
✘ Still limited for 1 minute resolution for CloudWatch/Stackdriver
40
42. Where to?
✘ We have 3 billion app / 50 billion system metric samples per month
✘ Storage size per sample matters here
42
43. Where to?
✘ We have 3 billion app / 50 billion system metric samples per month
✘ Storage size per sample matters here
✘ MySQL
➢ ~50 bytes per sample (including indexing, etc)
➢ 2.3TB for 50b samples
✘ ElasticSearch
➢ ~20 bytes per sample
➢ 930GB for 50b samples
43
44. General purpose DBs are expensive
✘ MySQL
➢ $230 for 1 month retention
➢ $1380 for 3 month retention
✘ ElasticSearch
➢ $93 for 1 month retention
➢ $550 for 3 month retention
✘ That’s just for storage! For one app!
44
45. But metrics data is unique
✘ Immutable (no updates)
✘ Write once
✘ Lots of metrics do not change often
✘ And this is why Time Series Databases were born!
45
47. Prometheus at glance
✘ Not a first TSDB, but became a golden standard
✘ 1-2 bytes per sample
➢ $30-$60 storage cost for 3 month retention as in the previous example
✘ Can process 1 million samples per minute on your laptop
47
48. Not just TSDB
✘ Prometheus discovers:
➢ Your GCE VM
➢ Your GKE pods
✘ Prometheus pulls metrics from targets
✘ Prometheus stores metrics and allows you to query them OR
✘ Federates them further to a central storage
48
53. Python example
53
import time
from flask import Flask
from prometheus_client import start_http_server, Summary
app = Flask(__name__)
REQUEST_TIME = Summary("request_processing_seconds",
"Time spent processing request")
@app.route("/")
@REQUEST_TIME.time()
def hello_world():
return "Hello, World!n"
if __name__ == "__main__":
start_http_server(8081)
app.run(port=8080)
Dedicated port!
54. Python example - in action!
54
$ python app.py &
$ curl localhost:8080
$ curl localhost:8080
$ curl localhost:8081
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 247.0
python_gc_objects_collected_total{generation="1"} 151.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 60.0
python_gc_collections_total{generation="1"} 5.0
python_gc_collections_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="3",version="3.8.3"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 2.34852352e+08
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.6411008e+07
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.29000000000000004
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024.0
# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
request_processing_seconds_count 2.0
request_processing_seconds_sum 1.3547949492931366e-05
# HELP request_processing_seconds_created Time spent processing request
# TYPE request_processing_seconds_created gauge
request_processing_seconds_created 1.5959190974287152e+09