Telemetry Types

Metrics collected by the Rezolus agent, organized by sampler category.

Each sampler can be individually enabled or disabled in the agent config. Metrics are labeled with dimensions like state, op, direction, etc. Many metrics are also collected per-cgroup.

memory

CPU

Usage

CPU time by state, softirq breakdown

expand_more
cpu_usagePer-CPU nanoseconds by state: user, system
softirqPer-CPU interrupt count by kind: hi, timer, net_tx, net_rx, block, irq_poll, tasklet, sched, hrtimer, rcu
softirq_timePer-CPU nanoseconds spent in softirq, same kinds
cgroup_cpu_usagePer-cgroup nanoseconds by state: user, system

Frequency

APERF, MPERF, TSC cycle counters

expand_more
cpu_aperfPer-CPU actual performance cycles
cpu_mperfPer-CPU maximum performance cycles
cpu_tscPer-CPU timestamp counter cycles

Performance Counters

Cycles, instructions, branch predictions, cache, TLB

expand_more
cpu_cyclesPer-CPU cycle count
cpu_instructionsPer-CPU retired instructions
cpu_branch_instructionsPer-CPU branch instructions
cpu_branch_missesPer-CPU branch mispredictions
cpu_dtlb_missPer-CPU data TLB misses (with op: load, store on Intel)
cpu_l3_accessPer-CPU L3 cache accesses
cpu_l3_missPer-CPU L3 cache misses
cpu_tlb_flushPer-CPU TLB flush count by reason: task_switch, remote_shootdown, local_shootdown, etc.
cgroup_cpu_cyclesPer-cgroup cycle count
cgroup_cpu_instructionsPer-cgroup retired instructions

Bandwidth & Migrations

CFS throttling, CPU migration events

expand_more
cpu_coresNumber of online logical cores (gauge)
cpu_migrationsPer-CPU migration count with direction: from, to
cgroup_cpu_bandwidth_*Per-cgroup CFS quota, period, throttled time, period counts
cgroup_cpu_migrationsPer-cgroup CPU migration count
schedule

Scheduler

Runqueue

Scheduling latency, running time, off-CPU time, context switches

expand_more
scheduler_runqueue_latencyHistogram of time tasks wait in the runqueue (ns)
scheduler_runningHistogram of time tasks spend running on CPU (ns)
scheduler_offcpuHistogram of time tasks spend off-CPU (ns)
scheduler_context_switchPer-CPU involuntary context switches
scheduler_runqueue_waitPer-CPU total nanoseconds spent waiting
cgroup_scheduler_*Per-cgroup: runqueue_wait, offcpu, context_switch
storage

Block I/O

Latency

I/O latency distributions by operation type

expand_more
blockio_latencyHistogram (ns) with op: read, write, flush, discard

Requests

Operation counts, bytes transferred, size distributions

expand_more
blockio_operationsCounter with op: read, write, flush, discard
blockio_bytesCounter (bytes) with op: read, write, flush, discard
blockio_sizeHistogram (bytes) with op: read, write, flush, discard
lan

Network

Traffic

Aggregate bytes and packets

expand_more
network_bytesCounter with direction: receive, transmit
network_packetsCounter with direction: receive, transmit

Interfaces

Drops, transmit errors, timeouts

expand_more
network_dropDropped packets counter
network_transmit_busyTransmit busy counter
network_transmit_completeCompleted transmissions counter
network_transmit_timeoutTransmit timeout events

Ethtool (ENA)

AWS EC2 Elastic Network Adapter allowance counters

expand_more
network_ena_bandwidth_allowance_exceededWith direction: receive, transmit
network_ena_pps_allowance_exceededPackets-per-second limit exceeded
network_ena_conntrack_allowance_exceededConnection tracking limit exceeded
network_ena_linklocal_allowance_exceededLink-local traffic limit exceeded
cable

TCP

Traffic

Bytes, packets, and segment size distributions

expand_more
tcp_bytesCounter with direction: receive, transmit
tcp_packetsCounter with direction: receive, transmit
tcp_sizeHistogram (bytes) with direction: receive, transmit

Latency

Connection establishment, packet delivery, jitter, RTT

expand_more
tcp_connect_latencyHistogram (ns) — time to establish connection
tcp_packet_latencyHistogram (ns) — receive-to-read latency
tcp_jitterHistogram (ns) — inter-packet jitter
tcp_srttHistogram (ns) — smoothed round-trip time
tcp_retransmitCounter — retransmitted packets
terminal

Syscall

Counts & Latency

Invocation counts and latency distributions by syscall category

expand_more
syscallCounter with op label (see categories below)
syscall_latencyHistogram (ns) with op label (same categories)
cgroup_syscallPer-cgroup counters with same op labels

Syscall categories (op values):

read write poll lock time sleep socket yield filesystem memory process query ipc timer event other
dynamic_form

Memory

Meminfo & VMStat

System memory gauges and NUMA allocation counters

expand_more
memory_totalGauge (bytes)
memory_freeGauge (bytes)
memory_availableGauge (bytes)
memory_buffersGauge (bytes)
memory_cachedGauge (bytes)
memory_numa_hitCounter — allocations on intended node
memory_numa_missCounter — allocations on non-intended node
memory_numa_foreignCounter — allocations intended for this node, placed elsewhere
memory_numa_interleaveCounter — interleave policy allocations
memory_numa_localCounter — allocations on local node
memory_numa_otherCounter — allocations on remote node
developer_board

GPU

NVIDIA

Memory, power, temperature, clocks, utilization (Linux)

expand_more
gpu_memoryPer-GPU gauge (bytes) with state: free, used
gpu_power_usagePer-GPU gauge (milliwatts)
gpu_energy_consumptionPer-GPU counter (millijoules)
gpu_temperaturePer-GPU gauge (Celsius)
gpu_clockPer-GPU gauge (Hz) with clock: compute, graphics, memory, video
gpu_utilizationPer-GPU gauge (percentage)
gpu_memory_utilizationPer-GPU gauge (percentage)
gpu_pcie_bandwidthPer-GPU gauge (bytes/sec) with direction: receive
gpu_pcie_throughputPer-GPU gauge (bytes/sec) with direction: receive, transmit
gpu_sm_utilizationPer-GPU gauge (%) — Hopper+ only
gpu_sm_occupancyPer-GPU gauge (%) — Hopper+ only
gpu_dram_bandwidth_utilizationPer-GPU gauge (%) — Hopper+ only
gpu_tensor_utilizationPer-GPU gauge (%) — Hopper+ only

Apple Silicon

Power, clocks, utilization (macOS)

expand_more
gpu_power_usagePer-GPU gauge (milliwatts)
gpu_energy_consumptionPer-GPU counter (millijoules)
gpu_clockPer-GPU gauge (Hz) with clock: graphics
gpu_utilizationPer-GPU gauge (percentage)
speed

Rezolus (self-monitoring)

Resource Usage

Rezolus process CPU, memory, I/O, and context switches

expand_more
rezolus_cpu_usageCounter (ns) with state: user, system
rezolus_memory_usage_resident_set_sizeGauge (bytes)
rezolus_memory_page_reclaimsCounter
rezolus_memory_page_faultsCounter
rezolus_blockio_operationsCounter with op: read, write
rezolus_context_switchCounter with kind: voluntary, involuntary