Reading TiCDC monitoring dashboards (Legacy notes)
A practical guide to the most useful TiCDC/TiKV CDC metrics: what they mean and how to use them for troubleshooting.
This page is a legacy note that summarizes how to interpret common TiCDC dashboards. It is intentionally “operational”: what a metric means, how it changes in normal vs abnormal conditions, and what to check next.
1. Preface: what dashboards can (and cannot) tell you
- Prometheus scrapes on an interval (often 15s), so dashboards are sampling views, not perfect ground truth.
- Most troubleshooting should correlate multiple signals: lag, throughput, error counters, IO, and queue/backpressure.
2. The panels that usually matter first
2.1 Changefeed / lag
Start here when you see “replication is slow”:
- Checkpoint ts / lag
- Resolved ts / lag
- Unresolved regions (if available)
- Per-capture table distribution (imbalance can cause tail latency)
Practical reasoning:
- Resolved lag grows linearly → often “some regions/tables are stuck initializing or backpressured”.
- Checkpoint lag grows but resolved lag is stable → sink/downstream may be slow.
2.2 Sink / downstream
Most sustained lag problems end up being downstream bottlenecks:
- Sink flush latency / batch size
- Error counters / retries
- Disk IO (if sorter/redo is enabled and spill-to-disk happens)
2.3 KV client / TiKV CDC
When initialization is slow or when backpressure is suspected:
- gRPC receive rate / message size
- Backoff / retry metrics from client-go
- TiKV-side CDC CPU/network/memory (if available)
3. Coordination and control-plane
3.1 PD / etcd health
TiCDC relies on PD/etcd for coordination state. Watch:
- etcd txn latency / size
- etcd health checks
- PD leader changes (correlate with instability)
4. “How to use this page”
This note is not exhaustive. Use it as a checklist:
- Identify if the symptom is lag, errors, or resource saturation.
- Determine whether it’s downstream, TiCDC internal, or upstream TiKV.
- Confirm with at least one “boundary signal” (e.g., sink flush slow → sorter blocks → upstream recv slows).
If you want, I can convert this legacy note into a versioned, metric-by-metric troubleshooting guide that matches your Grafana dashboards.