This document discusses telemetry in applications. It defines telemetry as gathering data on application use and performance. Telemetry includes logs, metrics, and traces. Logs provide information for debugging, monitoring, and analysis. Metrics measure performance, health, errors and other indicators. Traces provide low-level debugging information. The document discusses how to implement logging, metrics and tracing in applications and what types of data each provides to various consumers like developers and security teams.
2. Who we are?
• UK-based data security products and services
company
• Building security tools to prevent sensitive data
leakage and to comply with data security
regulations
• Cryptographic tools, security consulting, training
• We are cryptographers, system engineers,
applied engineers, infrastructure engineers
• We support community, speak, teach, open
source a lot
3. What we are going to talk
• Why do we need telemetry?
• What are the different kinds of telemetry?
• Borders of applicability of various types of
telemetry
• Approaches and mistakes
• Implementation
4. What is telemetry?
«Gathering data on the use of applications and
application components, measurements of start-up
time and processing time, hardware, application
crashes, and general usage statistics.»
5. Why do we need telemetry at all?
Who are the consumers?
− developers
− devops/sysadmins
− analysts
− security staff
What purposes?
− debug
− monitor state and health
− measure and tune performance
− business analysis
− intrusion detection
6. It is worthwhile, indeed
• speed up developing process
• increase overall stability
• reduce the reaction time on crashes and intrusions
• adequate business planning
7. It is worthwhile, indeed
• speed up developing process
• increase overall stability
• reduce the reaction time on crashes and intrusions
• adequate business planning
• COST of development
• COST of use
8. What data do we have to export?
… we can ask any specialist.
9. What data do we have to export?
… we can ask any specialist.
— ALL!… will be their answer.
26. Logs : kinds of data
• initial information about the application
• state changes (start/ready/…/stop)
• health changes
• audit trail (security-relevant list of activities: financial
operations, health care data transactions, changing keys,
changing configuration)
• user sessions (sign-in attempts, sign-out, actions)
• not expected actions (wrong URLs, sign-in fails, etc.)
• various information in string format
27. Logs : on start
• new state: starting
• application name
• component name
• commit hash / build number
• configuration in use
• deprecation warnings
• running mode
28. Logs : on ready
• new state: ready
• listen interfaces, ports and sockets
• health
29. Logs : on state or health change
• new state
• reason
• URL to documentation
30. Logs : on state or health change
• new state
• reason
• URL to documentation
Use traffic-light highlight system for health states:
● — completely unhealthy
● — partially healthy, reduced functionality
● — completely healthy
31. Logs : on shutdown
• reason
• status of preparing to shutdown
• new state: stopped (final goodbye)
32. Logs : each line
• timestamps (ISO8601, TZ, reasonable precission)
• PID
• application/component short name
• application version (JSON, CEF, protobuf)
• severity (CEF: 0→10, rfc5427: 7→0)
• event code (HTTP style)
• human-readable message
33. Logs : do not export!
• passwords, tokens, any sensitive data — security risks
• private data — legal risks
Use:
− masking
− anonymisation / pseudonymisation
36. Logs : CEF
• old (2009), but widely used standard
• simple: easy to generate, easy to parse (supported
even by devices without powerful CPUs)
• well documented:
− field name dictionaries
− field types
CEF:Version|Device Vendor|Device Product|Device Version|
Signature ID|Name|Severity|Extension
Sep 19 08:26:10 host CEF:0|security|threatmanager|1.0|100|
worm successfully stopped|10|src=10.0.0.1 dst=2.1.2.2
spt=1232
37. CEF naming, data formats
+
JSON/protobuf/… transport
=
painless logging
38. Logs : bear in mind [1/3]
• Logs will be read by humans. Often, when failure
happens. With limited time to reaction. Be brief and
eloquent. Give information that may help to solve a
problem.
• Logs will be searched. Don’t be a poet, be a technical
specialist. Use expected words.
• Logs will be parsed automatically; indeed, they will.
There are too many different systems that want telemetry
from your application.
• Carefully classify the severity of events. Many error
messages instead of warnings in non-critical situations
will lead to ignoring information from the logs.
39. Logs : bear in mind [2/3]
• Whenever it possible, base on existing standards.
Grouping event codes according to the HTTP
error code table is not bad idea.
• Logs are the first resource to analyze security
incidents.
• Logs will be archived and stored for a long period
of time. It will be almost impossible to cut off
some pieces of data.
• Should be configurable: formats, transport
protocols, paths, severity.
40. Logs : bear in mind [3/3]
• Your application may run in many different
environments with different standards of logging (VM,
docker). Application should be able to direct all logs
into one channel. Splitting may be an option.
• Do not implement log files rotation. Give possibility to
inform your application when it needs to gracefully
recreate the log file after being rotated by an external
service.
• When big trouble occurs and nothing works, your
application should be able to print readable logs in the
simplest manner — to stderr/stdout.
41. Logs : implementation
• native Ruby methods
• semantic_logger
https://github.com/rocketjob/semantic_logger
(a lot of destinations: DBs, HTTP, UDP, syslog)
• ougai
https://github.com/tilfin/ougai
(JSON)
• httplog
https://github.com/trusche/httplog
(HTTP logging, JSON support)
43. Metrics : approaches
• USE method
Utilization, Saturation, Errors
• Google SRE book
Latency, Traffic, Errors, Saturation
• RED method
Rate, Errors, Duration
44. Metrics : utilization
• Hardware resources: CPU, disk system, network
intefaces
• File system: capacity, usage
• Memory: capacity, cache, heap, queue
• Resources: file descriptors, threads, sockets, connections
The average time that the resource was busy
servicing work.
Usage of resource.
45. Metrics : traffic, rate
• normal operations:
− requests
− queries
− transactions
− sending network packets
− processing flow bytes
A measure of how much demand is being placed
on your system. (Google SRE book)
The number of requests, per second, you services
are serving. (RED Method)
46. Metrics : latency, duration
The time it takes to service a request. (Google SRE
book)
• latency of operations:
− requests
− queries
− transactions
− sending network packets
− processing flow bytes
47. Metrics : errors
• error events:
− hardware errors
− software exceptions
− invalid requests / input
− authentication fails
− invalid URLs
The count of error events. (USE Method)
The rate of requests that fail, either explicitly,
implicitly, or by policy. (Google SRE book)
48. Metrics : saturation
• calculated value, measure of current load
The degree to which the resource has extra work
which it can't service, often queued. (USE Method)
How "full" your service is. A measure of your
system fraction, emphasizing the resources that are
most constrained. (Google SRE book)
49. Metrics : saturation
• can be calculated internally or measured
externally
• high utilization is a problem
• high saturation is a problem
• low utilization level does not guarantee that
everything is OK
• low saturation (in the case of a correct calculation)
most likely indicates that everything is OK
50. OpenMetrics : based on Prometheus metric types
• Gauge
single numerical value
− memory used
− fan speed
− connections count
• Counter
single monotonically increasing counter
− operations done
− errors occured
− requests processed
• Histogram
increment counter per buckets
− requests count per latency buckets
− CPU load values count per range buckets
• Summary
similar to the Histogram, but φ-quantiles are calculated on client-side;
calculating of other quantiles is not possible
https://openmetrics.io/
https://prometheus.io/docs/concepts/metric_types/
58. Metrics : export data
• current state
• current health
• event counters:
− AAA events
− not expected actions (wrong URLs, sign-in fails)
− errors during normal operations
• performance metrics
− normal operations
− queues
− utilization, saturation
− query latency
• application info:
− version
− warnings/notifications gauge
59. Metrics : formats
• suggest using Prometheus format
− native for Prometheus
− OpenMetrics — open source specification
− simple and clear
− HTTP-based
− can be easily converted
− libraries exist
• Influx or similar format if you really need to implement
push model
• protobuf / gRPC
− custom
− high load
61. Metrics : bear in mind [1/2]
• Split statistic by types. For example, the aggregation
of successful (relatively long) and failed (relatively
short) durations may lead to the illusion of
performance increase when multiple failures occur.
• Whenever it possible use Saturation to determine
load of system. Utilization is not complete
information.
• Be sure to export the metrics of the component
closest to the user. This will allow to evaluate the SLI.
• Implement configurable buckets sizes.
62. Metrics : bear in mind [2/2]
• Export appropriate metrics as buckets. It lower
polling rate and makes possible to get statistics
in percentiles.
• Add units to metric names.
• Whenever it possible, use SI units.
• Follow the naming standard. Prometheus
“Metric and label naming” document is a good
base.
64. Traces : definition
In software engineering, tracing involves a
specialized use of logging to record information
about a program's execution.
…
There is not always a clear distinction between
tracing and other forms of logging, except that the
term tracing is almost never applied to logging that is
a functional requirement of a program.
— Wikipedia
65. Traces : use cases
• Debugging during development
• Measuring and tuning performance
• Analyze failures and security incidents
https://www.cossacklabs.com/blog/how-
to-implement-distributed-tracing.html
• Approaches
• Library comparison
• Implementation example
• Use cases
67. Traces : spans in trace tree
https://static.googleusercontent.com/media/research.google.com/uk/pubs/archive/36356.pdf
68. Traces : kinds of data
• trace id
• span id
• parent span id
• application info (product, component)
• module name
• method name
• context data (session/request id, user id, …)
• operation name and code
• start time
• end time
Per request/query tracking:
74. Checklist : Logs
□ Each line:
□ timestamps (ISO8601, TZ, reasonable precission)
□ PID
□ component name
□ severity
□ event code
□ human-readable message
□ Events to log:
□ state changes (start/ready/pause/stop)
□ health changes (new state, reason, doc URL)
□ user sign-in attempts (including failed with reasons), actions, sign-out
□ audit trail
□ errors
□ On start:
□ product name, component name
□ version (+build, +commit hash)
□ running mode (debug/normal, daemon/)
□ deprecation warnings
□ which configuration in use (ENV, file, configuration service)
□ On ready: communication sockets and ports
□ On exit: reason
□ Do not log:
□ passwords, tokens
□ personal data
75. Checklist : Metrics
□ Data to export:
□ application (version, warning/notification)
□ utilization (resources, capacities, usage)
□ saturation (internally calculated or appropriate metrics)
□ rate (operations)
□ errors
□ latencies
□ Split metrics by types
□ Export as buckets when reasonable
□ Configure size of buckets
□ Export metrics for SLI
□ Determine required resolution
□ Normalize, use SI units, add units to names
□ Prefer poll model if it possible
□ Clear counters on restart
76. Links [1/2]
• Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
https://static.googleusercontent.com/media/
research.google.com/uk//pubs/archive/36356.pdf
• How to Implement Tracing in a Modern Distributed Application
https://www.cossacklabs.com/blog/how-to-implement-
distributed-tracing.html
• OpenTracing
https://opentracing.io/
• OpenMetrics
https://github.com/RichiH/OpenMetrics
• OpenCensus
https://opencensus.io
77. Links [2/2]
• CEF
https://kc.mcafee.com/resources/sites/MCAFEE/content/live/
CORP_KNOWLEDGEBASE/78000/KB78712/en_US/
CEF_White_Paper_20100722.pdf
• Metrics : USE method
http://www.brendangregg.com/usemethod.html
• Google SRE book
https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-
systems/
• Metrics : RED method
https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-
architecture/
• MS Azure : monitoring and diagnostic
https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring
• Prometheus : Metrics and label names
https://prometheus.io/docs/practices/naming/