🚀TiDB
Metrics

Prometheus for TiDB (Legacy Notes)

When Prometheus is a good fit (and when it is not), plus the essential concepts you need for TiDB monitoring and troubleshooting.

Prometheus is great for monitoring, alerting, and trend analysis. But if you need 100% accurate accounting (for example, billing per request), Prometheus is usually not the right primary system: sampling gaps, aggregation, and retention policies make it hard to treat it as a “source of truth”.

In practice:

  • Use logs/events/databases for exact accounting.
  • Use Prometheus for operational monitoring and alerting.

1. What Prometheus is (in one paragraph)

Prometheus is an open-source monitoring and alerting system built around time series. A time series is identified by a metric name plus a set of labels (key/value pairs), and over time it produces samples (timestamp, value). PromQL lets you filter, aggregate, and transform these series to build dashboards and alerts.

Official overview: https://prometheus.io/docs/introduction/overview/#what-is-prometheus

2. Data model (the most important thing)

Prometheus stores metrics as:

metric_name{label_name="label_value", ...}

Each unique label set is a separate time series.

3. Time selection: instant, range, offset

Common PromQL time patterns:

  • Instant vector: the value(s) at a single evaluation time.
  • Range vector: samples within a window via [...].
  • Offset: shift the evaluation back in time via offset.

Examples:

  • tidb_server_handle_query_duration_seconds_bucket{type="select"}
  • tidb_server_handle_query_duration_seconds_bucket{type="select"}[2m]
  • tidb_server_handle_query_duration_seconds_bucket{type="select"} offset 5m

4. Aggregation: “group by labels”

Aggregation reduces many time series into fewer series by keeping only selected labels. Typical functions include sum, avg, max, min, etc.

Operational tip:

  • Start with more labels (more detail) to locate the scope.
  • Aggregate after you know what you’re looking for.

5. Metric types: Counter / Gauge / Histogram / Summary

Prometheus itself stores samples as time series, but on the client side we typically use these semantic types:

  • Counter: monotonically increasing (resets on restart). Good for totals like requests/errors.
  • Gauge: can go up/down. Good for current state like connections, queue length, region count.
  • Histogram: bucketed distribution. Good for latency/size distributions; often used with histogram_quantile.
  • Summary: quantiles computed on the client side (trade-offs vs histograms).

In TiDB/TiKV/TiCDC dashboards, you will mostly deal with counters, gauges, and histograms.

6. Practical tips for TiDB monitoring

  1. Prefer latency percentiles (p95/p99) over averages for external SLAs.
  2. Use a “zoom-in” workflow: cluster-level → instance/job → query/type labels.
  3. Don’t alert on single-point spikes; use for: durations and trends.
  4. Treat Prometheus as monitoring, not auditing. For strict correctness, use a different system.