Reading TiCDC monitoring dashboards (Legacy notes)

A practical guide to the most useful TiCDC/TiKV CDC metrics: what they mean and how to use them for troubleshooting.

This page is a legacy note that summarizes how to interpret common TiCDC dashboards. It is intentionally “operational”: what a metric means, how it changes in normal vs abnormal conditions, and what to check next.

1. Preface: what dashboards can (and cannot) tell you

Prometheus scrapes on an interval (often 15s), so dashboards are sampling views, not perfect ground truth.
Most troubleshooting should correlate multiple signals: lag, throughput, error counters, IO, and queue/backpressure.

2. The panels that usually matter first

2.1 Changefeed / lag

Start here when you see “replication is slow”:

Checkpoint ts / lag
Resolved ts / lag
Unresolved regions (if available)
Per-capture table distribution (imbalance can cause tail latency)

Practical reasoning:

Resolved lag grows linearly → often “some regions/tables are stuck initializing or backpressured”.
Checkpoint lag grows but resolved lag is stable → sink/downstream may be slow.

2.2 Sink / downstream

Most sustained lag problems end up being downstream bottlenecks:

Sink flush latency / batch size
Error counters / retries
Disk IO (if sorter/redo is enabled and spill-to-disk happens)

2.3 KV client / TiKV CDC

When initialization is slow or when backpressure is suspected:

gRPC receive rate / message size
Backoff / retry metrics from client-go
TiKV-side CDC CPU/network/memory (if available)

3. Coordination and control-plane

3.1 PD / etcd health

TiCDC relies on PD/etcd for coordination state. Watch:

etcd txn latency / size
etcd health checks
PD leader changes (correlate with instability)

4. “How to use this page”

This note is not exhaustive. Use it as a checklist:

Identify if the symptom is lag, errors, or resource saturation.
Determine whether it’s downstream, TiCDC internal, or upstream TiKV.
Confirm with at least one “boundary signal” (e.g., sink flush slow → sorter blocks → upstream recv slows).

If you want, I can convert this legacy note into a versioned, metric-by-metric troubleshooting guide that matches your Grafana dashboards.

Reading TiCDC monitoring dashboards (Legacy notes)

On this page