Troubleshooting ClickHouse Performance

MinervaDB Inc., 340 S LEMON AVE #9718 WALNUT 91789 CA, US
Troubleshooting ClickHouse
Performance
Shiv Iyer

Intro: About me
● Shiv Iyer
○ Founder and Principal of MinervaDB Inc.
■ MinervaDB Inc. - Consultative Support and Managed Services Provider for MySQL,
MariaDB and PostgreSQL
○ Founder and Principal of ChistaDATA Inc.
■ ChistaDATA Inc. - Consultative Support and Managed Services for ClickHouse
○ Technology Focus
■ Open Source Database Systems: MySQL, MariaDB, PostgreSQL and ClickHouse
■ Full-Stack Performance Troubleshooting and Optimization
■ Capacity Planning and Sizing
○ Follow me on Twitter: @thewebscaledba
○ Email: ceo@minervadb.com / ceo@chistadata.com

Basic understanding of the art and science of
Systems Operations Performance - Before Solving
● How long it takes for operation/process to complete - Response Time
● Load on the system, thread handling and queueing:
○ Thread Performance
○ Memory Handling
○ Deadlocks
PLEASE DON'T QUANTIFY HIGH CPU USAGE, EXTENSIVE DISK
OPERATIONS AND LOW NETWORK BANDWIDTH METRICS ON
PERFORMANCE AUDIT PROGRAMS

Components/Pillars of Performance Engineering
● MIPS is how many millions of
instructions executed per second.
But, Higher MIPS is not optimal
performance or execution plan.
The MIPS rating is only
acceptable

Performance Troubleshooting - How it begins ?
● UNHAPPY BUSINESS
○ Customers spending more time on requests
○ Cost of technology infrastructure is seriously impacting margins and budgets
○ Technology Automation Process Failure
■ Average Response Time of queries increasing significantly so scaling business
operations with more people
■ Delayed Demand FulFillment:
● Unhappy customers/suppliers/partners/employees/investors
● Losses / Layoffs / Shutdown
● More pains - Direct/Indirect impact on economy
PERFORMANCE IS A BUSINESS ACCELERATOR AND NOT JUST A
FEATURE. SYSTEMS WHICH MADE AN IMPACT ARE OPTIMAL AND
RELIABLE

UNDERSTANDING CLICKHOUSE AND CHALLENGES
● Open Source Column-oriented Database Management System for Online Analytical
Processing (OLAP) Queries
● Persistent data on ClickHouse is sorted by Primary Key, This make OLAP applications
deployed on ClickHouse optimal
● ClickHouse supports Parallel Processing on Multiple Cores
● ClickHouse supports Distributed OLAP Queries
Challenges with large ClickHouse Infrastructure Operations
● OLAP Database Management Systems grows really big with time and ClickHouse is no
exception there (though ClickHouse provides compelling compression algorithms-
Specialized Codecs) so troubleshooting ClickHouse query performance is a specialized
skill.

Troubleshooting ClickHouse Performance - Methodology
● Understanding Application Latency - Response Time
○ Time spent for the completion of a process
● Measure the load on ClickHouse infrastructure:
○ Latency of query operations
○ Throughput - Queries Per Minute (QPM)
● Evidence Collection / Performance Forensics Methods:
○ Observability Tools
○ Profiling Techniques
○ Tracing Methods
LATENCY IS A TIME-BASED METRIC IN PERFORMANCE ENGINEERING

Simple and powerful Linux/UNIX tools for
troubleshooting Systems Operations Performance
Tool Name Description
top Top processes by latency and throughput
procstat Detailed report on individual performance statistics
sar General purpose system performance monitoring tool
vmstat Virtual memory statistics collector and systemwide CPU usage aggregator
iostat Disk I/O performance statistics collector/aggregator
sockstat Network performance statistics collector

CPU Performance - Cycles Per Instruction

Monitoring Top Processes
By default, top(1) displays all the details of ‘top’ processes on each system and periodically updates this
information every 2.0 seconds using the raw cpu use percentage to rank the processes in the list. Technically, The
top command tracks detailed throughput replated information about the cpu and processes

Monitoring CPU usage with bdsar
System Activity Reporter (SAR) for FreeBSD
systems. Detailed analysis of network, cpu,
memory, swap, and NFS usage.

Memory Available, Used and Free - sar -r 1 3
To calculate free memory from Average value use the below formula:
● kbmemfree + kbbuffers + kbcached = actual free memory on the system

vmstat - Reports Virtual Memory Statistics
The vmstat utility reports
certain kernel statistics
kept about process, virtual
memory, disk, trap and cpu
activity.

iostat - Reports I/O Statistics
The iostat utility displays kernel I/O
statistics on terminal, device and cpu
operations.

Why Tracing the Application Infrastructure important?
● Understand the Execution Plan/Data Access Path of both usual(expected)
and unusual(unexpected) incidents happening on your infrastructure
● Record both successful and unsuccessful events happening on your
infrastructure
● Understanding how system components are consuming available resources
● Threads/Process Handling - Both Latency and Throughput
● Cost efficient Capacity Planning/Sizing

eBPF

DTrace

Using DTrace for Troubleshooting System Performance
If you can understand how the hardware infrastructure potential is consumed
efficiently, It will be very easy to set expectations on both latency and throughput,
This is what we call Performance Goal setting. We use DTrace for detailed
analysis of fully-stack infrastructure operations, This helps us in troubleshooting
purely based on evidence:
● CPU usage and distribution - Process handling and thread activity
● RAM/Memory usage
● Disk I/O operations
● Network Infrastructure I/O
● Full-stack Software Infrastructure Operations

Why DTrace is super cool tracing tool for troubleshooting?
Provider Description
profile Profiling/tracing CPU resource usage patterns. You can use these probes to report
some aspect of system state every unit time and samples are used to infer system
behavior / performance forensics.
sysinfo sysinfo provider include probes that correspond to kernel statistics which are
classified by the name sys.These probes are based on mpstat statistics
plockstat The lockstat provider provides probes that can be used to discern lock contention
statistics, or to understand virtually any aspect of locking behavior.

DTrace one-liners for Troubleshooting Latency
### The time spent in read(), in nanoseconds, print as a histogram.
# dtrace -n 'syscall::read:entry { self->ts = timestamp; } syscall::read:return /self->ts/ { @ =
quantize(timestamp - self->ts); self->ts = 0; }'
### Sum kernel adaptive lock block time by process name (ns)
# dtrace -n 'lockstat:::adaptive-block { @[execname] = sum(arg1); }'

Tracing ClickHouse Performance
● system.trace_log
○ By default records the performance metrics of queries run longer than 1 sec.
○ query_profiler_real_time_period_ns:
■ The clock timer of the query profiler. Real clock timer counts wall-clock time.
■ Recommended values:
● 10000000 (100 times a second) nanoseconds and less for single queries.
● 1000000000 (once a second) for cluster-wide profiling.
○ query_profiler_cpu_time_period_ns:
■ CPU clock timer of the query profiler. This timer counts only CPU time.
■ Recommended values:
● 10000000 (100 times a second) nanoseconds and more for single queries.
● 1000000000 (once a second) for cluster-wide profiling.

Default configuration of trace_log variable in
config.xml
<trace_log>
<database>system</database>
<table>trace_log</table>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</trace_log>

system.trace_log - Accommodates stack traces collected
by the sampling query profiler.
SELECT * FROM system.trace_log LIMIT 1 G
Row 1:
──────
event_date: 2022-03-01
event_time: 2022-03-01 06:11:18
event_time_microseconds: 2022-03-01 06:11:18.116138
timestamp_ns: 3712951364193637139
trace_type: ………….
thread_id: ………………………

system.trace_log
● event_date (Date) — Date of sampling moment.
● event_time (DateTime) — Timestamp of the sampling moment.
● event_time_microseconds (DateTime64) — Timestamp of the sampling moment with microseconds precision.
● timestamp_ns (UInt64) — Timestamp of the sampling moment in nanoseconds.
● revision (UInt32) — ClickHouse server build revision.
When connecting to the server by clickhouse-client, you see the string similar to Connected to ClickHouse server
version 19.18.1 revision 54429.. This field contains the revision, but not the version of a server.
● trace_type (Enum8) — Trace type:
○ Real represents collecting stack traces by wall-clock time.
○ CPU represents collecting stack traces by CPU time.
○ Memory represents collecting allocations and deallocations when memory allocation exceeds the subsequent
watermark.
○ Memory Sample represents collecting random allocations and deallocations.
● thread_number (UInt32) — Thread identifier.
● query_id (String) — Query identifier that can be used to get details about a query that was running from the
query_log system table.
● trace (Array(UInt64)) — Stack trace at the moment of sampling. Each element is a virtual memory address inside
ClickHouse server process

System Tables on ClickHouse to troubleshoot
more intuitively
System Table Description
system.processes Detailed reporting on both active and idle ClickHouse threads/processes
system.query_log Log of all the queries executed - start time, end time, duration, errors
system.query_thread_log Detailed report on threads and queries executed - thread name, thread start
time, duration
system.trace_log Tracing ClickHouse operations to build Data Access Path/Execution Plan

How do you consolidate the effort in ClickHouse
Troubleshooting?
● Diagnostic tools you can use for performance forensics/troubleshooting
● Quantify performance against throughput for proactive capacity
planning/sizing
● What is not performance troubleshooting?
● How you can use the historical performance data to plan for future
● Choosing tools to access only the relevant data
● Root cause analysis in Performance Audit
● Building systems for performance

WE ARE HIRING

Thank you!

Troubleshooting ClickHouse Performance

Recommended

Recommended

More Related Content

Similar to Troubleshooting ClickHouse Performance

Similar to Troubleshooting ClickHouse Performance (20)

More from DoKC

More from DoKC (20)

Recently uploaded

Recently uploaded (20)

Troubleshooting ClickHouse Performance