🚀TiDB
TiCDC

Reading TiCDC monitoring dashboards (Legacy notes)

A practical guide to the most useful TiCDC/TiKV CDC metrics: what they mean and how to use them for troubleshooting.

This page is a legacy note that summarizes how to interpret common TiCDC dashboards. It is intentionally “operational”: what a metric means, how it changes in normal vs abnormal conditions, and what to check next.

1. Preface: what dashboards can (and cannot) tell you

  • Prometheus scrapes on an interval (often 15s), so dashboards are sampling views, not perfect ground truth.
  • Most troubleshooting should correlate multiple signals: lag, throughput, error counters, IO, and queue/backpressure.

2. The panels that usually matter first

2.1 Changefeed / lag

Start here when you see “replication is slow”:

  • Checkpoint ts / lag
  • Resolved ts / lag
  • Unresolved regions (if available)
  • Per-capture table distribution (imbalance can cause tail latency)

Practical reasoning:

  • Resolved lag grows linearly → often “some regions/tables are stuck initializing or backpressured”.
  • Checkpoint lag grows but resolved lag is stable → sink/downstream may be slow.

2.2 Sink / downstream

Most sustained lag problems end up being downstream bottlenecks:

  • Sink flush latency / batch size
  • Error counters / retries
  • Disk IO (if sorter/redo is enabled and spill-to-disk happens)

2.3 KV client / TiKV CDC

When initialization is slow or when backpressure is suspected:

  • gRPC receive rate / message size
  • Backoff / retry metrics from client-go
  • TiKV-side CDC CPU/network/memory (if available)

3. Coordination and control-plane

3.1 PD / etcd health

TiCDC relies on PD/etcd for coordination state. Watch:

  • etcd txn latency / size
  • etcd health checks
  • PD leader changes (correlate with instability)

4. “How to use this page”

This note is not exhaustive. Use it as a checklist:

  1. Identify if the symptom is lag, errors, or resource saturation.
  2. Determine whether it’s downstream, TiCDC internal, or upstream TiKV.
  3. Confirm with at least one “boundary signal” (e.g., sink flush slow → sorter blocks → upstream recv slows).

If you want, I can convert this legacy note into a versioned, metric-by-metric troubleshooting guide that matches your Grafana dashboards.