🚀TiDB
Metrics

Prometheus in TiDB: Metrics and Query Interpretation

1. What Prometheus is good at

Prometheus is excellent for operational monitoring, alerting, and trend analysis.

It is not intended for strict billing-grade accounting where 100% event completeness is required.

Core strengths:

  • Time-series data model with metric + labels
  • Powerful query language (PromQL)
  • Flexible visualization (Grafana)
  • Strong ecosystem of exporters and integrations

2. Data model refresher

Prometheus stores data as time series:

<metric_name>{label_1="v1", label_2="v2", ...} -> (t0, v0), (t1, v1), ...

Example dimensions in TiDB metrics often include:

  • instance
  • job
  • sql_type
  • le (histogram bucket upper bound)

Prometheus model

3. Time selectors in PromQL

3.1 Instant vector

Returns one sample per time series at evaluation time.

Example:

tidb_server_handle_query_duration_seconds_bucket{type="select"}

3.2 Range vector

Returns all samples in a time window.

Example:

tidb_server_handle_query_duration_seconds_bucket{type="select"}[1m]

3.3 Offset

Evaluates the same expression in a shifted time context.

Example:

tidb_server_handle_query_duration_seconds_bucket offset 5m

4. Aggregation and rate functions

4.1 rate()

Use for average per-second increase over a range (commonly for counters).

4.2 increase()

Use for total increase over a range.

4.3 delta()

Difference between first and last sample in a range. Useful for gauges.

4.4 irate()

Per-second instant rate based on the last two points in a range.

4.5 histogram_quantile()

Computes quantiles (for example p99.9) from histogram bucket rates.

5. TiDB Duration p99.9 query breakdown

A common query in Grafana is:

histogram_quantile(
  0.999,
  sum(rate(tidb_server_handle_query_duration_seconds_bucket[1m])) by (le)
)

Interpretation pipeline:

  1. ...[1m]: collect bucket samples from the last minute
  2. rate(...): convert cumulative buckets to per-second bucket growth
  3. sum(...) by (le): merge across labels except bucket boundary le
  4. histogram_quantile(0.999, ...): estimate p99.9 latency

Supporting visuals:

Duration panel query Selector step Rate step Sum by le step Quantile step

6. Metric types in practice

6.1 Counter

Monotonically increasing metric (except resets).

6.2 Gauge

Can increase or decrease.

6.3 Histogram

Stores sampled observations in buckets; useful for latency distributions.

6.4 Summary

Client-side quantile estimation.

In TiDB ecosystem practice, histograms are more commonly used for cross-instance quantile analysis in Prometheus + Grafana workflows.

7. Practical troubleshooting hints

When a latency panel rises unexpectedly:

  1. Verify whether increase is global or limited to a subset of instance/sql_type
  2. Check bucket shape changes (le) before only reading one quantile line
  3. Compare rate() and increase() views in the same window
  4. Correlate with TiDB/TiKV resource metrics (CPU, IO, lock wait)

8. References