4. 4
Join our public Oracle DevRel
Workspace
oracledevrel.slack.com
Join the dedicated Slack channel to
be part of the conversation and
raise your questions to our Experts:
Step 1: Access the Slack
OracleDevRel Workspace following
this link:
https://bit.ly/devrel-slack-emea
Oracle Cloud Free Tier + Special Promo
Try Always Free. No Time Limits.
Step 2: Search for Phil Wilkins
philip.wilkins@oracle.com
Free credits you can use for additional OCI
services
300$ 500$ in Oracle Cloud Credits – 1 month
12. Highly Pluggable Framework
Input
• TCP/UDP
• Unix Sockets
• HTTP
• Many file formats
• SNMP traps
• OS (Linux/UNIX)
• Log4J, SLF4J and other
related frameworks for .Net,
JavaScript
Output
• ALM solutions e.g. Splunk, cloud native
solutions, loggly, logzio etc.
• HTTP
• Prometheus
• Grafana
• Many file formats
• DB (SQL/NoSQL)
• Event Streams e.g. Kafka, Kenesis, MQTT
• Social notifications e.g. Jabber, Slack, emai,
twilio l etc
• Support mgmt tools like Pager Duty
Buffer /
Cache
• Custom in memory cache
• Redis & Coherence
Formatter
• XML
• JSON
• CSV/TSV etc.
• Compressed formats
Storage
• S3 buckets
• DB (No) SQL
• File
Filters
• Value based conditions
• REGEX expressions
Parser
• Multline text to single event
• Event info extraction e.g. date & time
Custom components
Amusing Ruby Gems it is possible to build any
custom components using the framework provided
29. 55
oracledevrel.slack.co
m
Join the dedicated Slack channel to be
part of the conversation and raise your
questions to our Experts:
Step 1: Access the Slack OracleDevRel
Workspace following this link:
https://bit.ly/devrel-slack-emea
Step 2: Search for Phil Wilkins
philip.wilkins@oracle.com
Join our public Oracle DevRel Workspace
Editor's Notes
12 Factor App was created in 2012 – 10 years ago, and predates the mainstream uptake of K8s
This is true = but most Java apps use logging frameworks. Infact the language has had its own logging framework java.util.logging came with Java 5 – 2004
Yes logging is most commonly sent to file BUT logging frameworks allow us to be smarter
Logs are streams – this is true, but our default approach of using files tends to break this thinking. Unless we either
Stop using files
convert the file into a stream of events
Formatting comes as a result of framework use – rarely active logic to structure
Logs generally have a beginning and an end
But not always – stack dumps
Some apps generate log fragmentsLogs have no fixed beginning or end, but flow continuously as long as the app is operating.
STDOUT:
- formatting
Switching off for production
Kubernetes handling – pod goes – logs go
Getting meaning – more work
Not write log files – well writing a log stream is key
Writing the logs to std out has implications and inference …
Easy to read this as just use system.out.println
we lose benefits of the logging framework (consistent structure of the log – point made previously)
switching off logging that is for debug/dev purposes only
IF we want to capture and process we now need to focus stdout – which is different for Oses – more variability not less
If we take a stream of text – we need to reassert the structure again – additional compute overhead
Console watching works for unit testing. But when you test in a container, K8s – this becomes trickier AS we often want to scroll back through logs – console retention is only so long, ~500 lines
What about SQL and stored procedures?
Captured by the execution environment – need to be careful we don’t always have control of this.
For example In K8s it captures std out and writes to its own log files which are limited and controlled only by K8s and not always tunable. 1.5K – fine UNLESS someone is dumping blobs of content – messages, images etc
If you container is killed off – then so are the bulk of logs – K8s truncates logs
Moving logs for archival – missing an opportunity
In the world of microservices we may be used for multiple apps – how do you duplicate
Watching a liver terminal, yes. But by the time we get to a terminal the problem may have long gone. Our container recycled
No consideration for in the now.
General purpose data-warehouse. Hadoop – dates the statement. Substitute text based search engines Apache Lucene, OpenSearch / Elastic search.
Active alerting – yes, but why wait?
https://threadreaderapp.com/thread/1020188389721530368.html
Twitter had an observability team ten years ago
Hungarian-American engineer Rudolf E. Kálmán formalized the idea of observability in a paper describing characteristics of linear dynamic systems 1961
https://en.wikipedia.org/wiki/Rudolf_E._K%C3%A1lm%C3%A1n
His work was important Nasa
Google’s 4 golden signals
Latency
Traffic
Errors
saturation
https://threadreaderapp.com/thread/1020188389721530368.html
Twitter had an observability team ten years ago
Hungarian-American engineer Rudolf E. Kálmán formalized the idea of observability in a paper describing characteristics of linear dynamic systems 1961
https://en.wikipedia.org/wiki/Rudolf_E._K%C3%A1lm%C3%A1n
His work was important Nasa
Google’s 4 golden signals
Latency
Traffic
Errors
saturation
https://threadreaderapp.com/thread/1020188389721530368.html
Twitter had an observability team ten years ago
Hungarian-American engineer Rudolf E. Kálmán formalized the idea of observability in a paper describing characteristics of linear dynamic systems 1961
https://en.wikipedia.org/wiki/Rudolf_E._K%C3%A1lm%C3%A1n
His work was important Nasa
Google’s 4 golden signals
Latency
Traffic
Errors
saturation
Understand the cause of performance issues
Insight into who / what is interacting with the system(s)
Spotting when unexpected errors occur (e.g. unexpected edge case scenarios)
Performance management – harvesting slow running queries, scissor lockout and deadly embraces on threads, infinite loop conditions, unresponsive threads
Blend multiple logs to get end to end picture
Understand potential causes of loss of data integrity
What is Fluentd?
Fluentd is an open source log collector, processor, and aggregator that was created back in 2011 by the folks at Treasure Data. Written in Ruby, Fluentd was created to act as a unified logging layer — a one-stop component that can aggregate data from multiple sources, unify the differently formatted data into JSON objects and route it to different output destinations.
Design wise — performance, scalability, and reliability are some of Fluentd’s outstanding features. A vanilla Fluentd deployment will run on ~40MB of memory and is capable of processing above 10,000 events per second. Adding new inputs or outputs is relatively simple and has little effect on performance. Fluentd uses disk or memory for buffering and queuing to handle transmission failures or data overload and supports multiple configuration options to ensure a more resilient data pipeline.
Fluentd has been around for some time now and has developed a rich ecosystem consisting of more than 700 different plugins that extend its functionality. Fluentd is the de-facto standard log aggregator used for logging in Kubernetes and as mentioned above, is one of the widely used Docker images.
What is Fluent Bit?
Fluent Bit is an open source log collector and processor also created by the folks at Treasure Data in 2015. Written in C, Fluent Bit was created with a specific use case in mind — highly distributed environments where limited capacity and reduced overhead (memory and CPU) are a huge consideration.
To serve this purpose, Fluent Bit was designed for high performance and comes with a super light footprint, running on ~450KB only. An abstracted I/O handler allows asynchronous and event-driven read/write operations. For resiliency and reliability, various configuration option are available for defining retries and the buffer limit.
Fluent Bit is also extensible, but has a smaller eco-system compared to Fluentd. Inputs include syslog, tcp, systemd/journald but also CPU, memory, and disk. Outputs include Elasticsearch, InfluxDB, file and http. For Kubernetes deployments, a dedicated filter plugin will add metadata to log data, such as the pod’s name and namespace, and the containers name/ID.
What is Fluentd?
Fluentd is an open source log collector, processor, and aggregator that was created back in 2011 by the folks at Treasure Data. Written in Ruby, Fluentd was created to act as a unified logging layer — a one-stop component that can aggregate data from multiple sources, unify the differently formatted data into JSON objects and route it to different output destinations.
Design wise — performance, scalability, and reliability are some of Fluentd’s outstanding features. A vanilla Fluentd deployment will run on ~40MB of memory and is capable of processing above 10,000 events per second. Adding new inputs or outputs is relatively simple and has little effect on performance. Fluentd uses disk or memory for buffering and queuing to handle transmission failures or data overload and supports multiple configuration options to ensure a more resilient data pipeline.
Fluentd has been around for some time now and has developed a rich ecosystem consisting of more than 700 different plugins that extend its functionality. Fluentd is the de-facto standard log aggregator used for logging in Kubernetes and as mentioned above, is one of the widely used Docker images.
What is Fluent Bit?
Fluent Bit is an open source log collector and processor also created by the folks at Treasure Data in 2015. Written in C, Fluent Bit was created with a specific use case in mind — highly distributed environments where limited capacity and reduced overhead (memory and CPU) are a huge consideration.
To serve this purpose, Fluent Bit was designed for high performance and comes with a super light footprint, running on ~450KB only. An abstracted I/O handler allows asynchronous and event-driven read/write operations. For resiliency and reliability, various configuration option are available for defining retries and the buffer limit.
Fluent Bit is also extensible, but has a smaller eco-system compared to Fluentd. Inputs include syslog, tcp, systemd/journald but also CPU, memory, and disk. Outputs include Elasticsearch, InfluxDB, file and http. For Kubernetes deployments, a dedicated filter plugin will add metadata to log data, such as the pod’s name and namespace, and the containers name/ID.
Any distributed monitoring and log management solution typically follows the following sequence of events.
Depending on the toolset and goal of the monitoring, it may result in one or more steps may be fulfilled by a single tool. For example the combination of Splunk agents, Splunk engine & dashboard all of these stages are covered within a single tool.
Note FluentD does NOT provide deep data analyse capabilities – for this we leaverage tools
What do I mind by vendor neutral
No license cost
Not tied to any specific cloud or platform
Cortex – A horizontally scalable, highly available, multi-tenant, long term Prometheus. (incubating)
Fonio - A Kubernetes operator for running synthetic checks as pods. Works great with Prometheus! (Sandbox)
Kuberhealthy - A Kubernetes operator for running synthetic checks as pods. Works great with Prometheus! (sandbox)
OpenMetrics - Evolving the Prometheus exposition format into a standard (incubating)
Pixie - Open source Kubernetes observability for developers (Sandbox)
Prometheus - The Prometheus monitoring system and time series database. (Graduated)
Skooner - Simple Kubernetes real-time dashboard and management (sandbox)
Thanos - Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project. (incubating)
Trickerster - Open Source HTTP Reverse Proxy Cache and Time Series Dashboard Accelerator (Sandbox)
Jaeger - CNCF Jaeger, a Distributed Tracing Platform (Graduated)
OpenTelemetry - OpenTelemetry community content (Incubating)
Open Tracing – Depricated
------
Observability & Analysis
Now that we've worked our way through the layers of the CNCF landscape, we'll focus on the columns starting with observability and analysis.
Before diving into these categories, let's first define observability and analysis. Observability is a system characteristic describing the degree to which a system can be understood from its external outputs. Measured by CPU time, memory, disk space, latency, errors, etc., computer systems can be more or less observable. Analysis is an activity in which you look at this observable data and make sense of it.
To ensure there is no service disruption, you'll need to observe and analyze every aspect of your application so every anomaly gets detected and rectified right away. This is what this category is all about. It runs across and observes all layers which is why it's on the side and not embedded in a specific layer.
Tools in this category are broken down into logging, monitoring, tracing, and chaos engineering. Please note that the category name is somewhat misleading — although chaos engineering is listed here, consider it a reliability tool rather than an observability or analysis tool.
-----
Monitoring
What it is
Monitoring refers to instrumenting an app to collect, aggregate, and analyze logs and metrics to improve our understanding of its behavior. While logs describe specific events, metrics are a measurement of a system at a given point in time — they are two different things but both necessary to get the full picture of your system's health. Monitoring includes everything from watching disk space, CPU usage, and memory consumption on individual nodes to doing detailed synthetic transactions to see if a system or application is responding correctly and in a timely manner. There are a number of different approaches to monitor systems and applications.
Problem it addresses
When running an application or platform, you want it to accomplish a specific task as designed and ensure it's only accessed by authorized users. Monitoring allows you to know if it is working correctly, securely, cost effectively, only accessed by authorized users, as well as any other characteristic you may be tracking.
How it helps
Good monitoring allows operators to respond quickly, and even automatically, when an incident arises. It provides insights into the current health of a system and watches for changes. Monitoring tracks everything from application health to user behaviour and is an essential part of effectively running applications.
Technical 101
Monitoring in a cloud native context is generally similar to monitoring traditional applications. You need to track metrics, logs, and events to understand the health of your applications. The main difference is that some of the managed objects are ephemeral, meaning they may not be long lasting so tying your monitoring to objects like auto generated resource names won’t be a good long term strategy. There are a number of CNCF projects in this space that largely revolve around Prometheus, the CNCF graduated project.
Chaos Engineering
Chaos Mesh - A Chaos Engineering Platform for Kubernetes. (Incubating)
ChaosBlade - An easy to use and powerful chaos engineering experiment toolkit (Sandbox)
Litmus Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q (Incubating)
Cortex – A horizontally scalable, highly available, multi-tenant, long term Prometheus. (incubating)
Fonio - A Kubernetes operator for running synthetic checks as pods. Works great with Prometheus! (Sandbox)
Kuberhealthy - A Kubernetes operator for running synthetic checks as pods. Works great with Prometheus! (sandbox)
OpenMetrics - Evolving the Prometheus exposition format into a standard (incubating)
Pixie - Open source Kubernetes observability for developers (Sandbox)
Prometheus - The Prometheus monitoring system and time series database. (Graduated)
Skooner - Simple Kubernetes real-time dashboard and management (sandbox)
Thanos - Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project. (incubating)
Trickerster - Open Source HTTP Reverse Proxy Cache and Time Series Dashboard Accelerator (Sandbox)
Jaeger - CNCF Jaeger, a Distributed Tracing Platform (Graduated)
OpenTelemetry - OpenTelemetry community content (Incubating)
Open Tracing – Depricated
------
Observability & Analysis
Now that we've worked our way through the layers of the CNCF landscape, we'll focus on the columns starting with observability and analysis.
Before diving into these categories, let's first define observability and analysis. Observability is a system characteristic describing the degree to which a system can be understood from its external outputs. Measured by CPU time, memory, disk space, latency, errors, etc., computer systems can be more or less observable. Analysis is an activity in which you look at this observable data and make sense of it.
To ensure there is no service disruption, you'll need to observe and analyze every aspect of your application so every anomaly gets detected and rectified right away. This is what this category is all about. It runs across and observes all layers which is why it's on the side and not embedded in a specific layer.
Tools in this category are broken down into logging, monitoring, tracing, and chaos engineering. Please note that the category name is somewhat misleading — although chaos engineering is listed here, consider it a reliability tool rather than an observability or analysis tool.
-----
Monitoring
What it is
Monitoring refers to instrumenting an app to collect, aggregate, and analyze logs and metrics to improve our understanding of its behavior. While logs describe specific events, metrics are a measurement of a system at a given point in time — they are two different things but both necessary to get the full picture of your system's health. Monitoring includes everything from watching disk space, CPU usage, and memory consumption on individual nodes to doing detailed synthetic transactions to see if a system or application is responding correctly and in a timely manner. There are a number of different approaches to monitor systems and applications.
Problem it addresses
When running an application or platform, you want it to accomplish a specific task as designed and ensure it's only accessed by authorized users. Monitoring allows you to know if it is working correctly, securely, cost effectively, only accessed by authorized users, as well as any other characteristic you may be tracking.
How it helps
Good monitoring allows operators to respond quickly, and even automatically, when an incident arises. It provides insights into the current health of a system and watches for changes. Monitoring tracks everything from application health to user behaviour and is an essential part of effectively running applications.
Technical 101
Monitoring in a cloud native context is generally similar to monitoring traditional applications. You need to track metrics, logs, and events to understand the health of your applications. The main difference is that some of the managed objects are ephemeral, meaning they may not be long lasting so tying your monitoring to objects like auto generated resource names won’t be a good long term strategy. There are a number of CNCF projects in this space that largely revolve around Prometheus, the CNCF graduated project.
Chaos Engineering
Chaos Mesh - A Chaos Engineering Platform for Kubernetes. (Incubating)
ChaosBlade - An easy to use and powerful chaos engineering experiment toolkit (Sandbox)
Litmus Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q (Incubating)
The 12 Factor App doesn’t consider the question of distributed solution
----------
https://www.hostdime.com/blog/data-egress-fees-cloud/
https://azure.microsoft.com/en-gb/pricing/details/bandwidth/
https://aws.amazon.com/ec2/pricing/on-demand
https://www.oracle.com/uk/cloud/networking/pricing/#fastconnect/
https://cloud.google.com/vpc/network-pricing#all-networking-pricing
https://www.oracle.com/uk/cloud/networking/pricing/
100 Mb Leased line from BT Wholesale = 343
343 / (30*24) = 0.4763 (per hour)
100Mb can peak at 28GB per hour
0.4763 / 28 = 0.017 (@ 50% utilization 0.034)
GBP USD = 1.28
$0.022 (100% use) $0.043 (50%)
This approach leverages Fluentd filtering and bringing key log events to the center
Keeps to Spirit of 12 Factor App
----------
https://www.hostdime.com/blog/data-egress-fees-cloud/
https://azure.microsoft.com/en-gb/pricing/details/bandwidth/
https://aws.amazon.com/ec2/pricing/on-demand
https://www.oracle.com/uk/cloud/networking/pricing/#fastconnect/
https://cloud.google.com/vpc/network-pricing#all-networking-pricing
https://www.oracle.com/uk/cloud/networking/pricing/
100 Mb Leased line from BT Wholesale = 343
343 / (30*24) = 0.4763 (per hour)
100Mb can peak at 28GB per hour
0.4763 / 28 = 0.017 (@ 50% utilization 0.034)
GBP USD = 1.28
$0.022 (100% use) $0.043 (50%)
Single instance = only scale up
Resilient Pairing – scaling up, some accommodation if run in active-active for coping with bursts
Distributed – very do able as small footprint, can scale out and up and distribute widely
Ability to cope with network failure/bandwith capacity
Fluentd is a bit heavy – so we have FluentBit as an option
Walk through the demo resources
set SLACK_TOKEN=xoxp-735037803329-729971182386-1116938474596-b1541718ec7b1d21ed602d64dc7b2d35
fluentd -c Demo2/Fluentd/node2-file-source-multi-out-label-pipelines.conf
fluentd -c Demo2/Fluentd/node1-file-source-multi-out-label-pipelines.conf
CD Demo2
log-generator1.bat
log-generator2.bat
Stop – uncomment suppress
Walk through the demo resources
fluentd -c Demo2/Fluentd/node2-file-source-multi-out-label-pipelines.conf
fluentd -c Demo2/Fluentd/node1-file-source-multi-out-label-pipelines.conf
CD Demo2
log-generator1.bat
log-generator2.bat