TiCDC resolved-ts lag spikes: incremental scan stuck due to slow disk

A real-world incident where slow TiCDC disk IO (Unified Sorter spill-to-disk) caused backpressure, stalled TiKV incremental scan, and drove Changefeed resolved ts lag up.

1. Symptoms

You can think of resolved-ts as a global watermark from TiKV to TiCDC. If it cannot advance in time, it impacts both the changefeed checkpoint and downstream consumption progress.

A changefeed/table resolved-ts is essentially the minimum resolved-ts across all Regions that are being replicated
If some Regions stay in CDC initialization not finished (TiKV incremental scan / initial scan not completed) for a long time, TiCDC won’t advance those Regions’ resolved-ts
The slowest Region’s resolved-ts is stuck → the global resolved-ts is stuck → resolved ts lag keeps increasing over time (often “linear growth”)
resolved ts lag is also a lower bound for checkpoint ts lag, so checkpoint lag can keep increasing as well

result

Typical business impact: downstream (MQ / MySQL / other sinks) consumption latency increases and triggers SLA alerts.

2. Quick diagnosis

Besides seeing changefeed resolved ts lag growing linearly, the incremental scan speed is also very slow.
unresolved region count may stay non-zero for a long time and cannot drop to 0.
When backlog is obvious (resolved-ts lag), disk write throughput should usually be high, but the actual write throughput is low (e.g. write speed < 50MB/s).
From experience: if TiCDC itself has no obvious bottleneck and the TiKV cluster is large enough, aggregate incremental scan throughput (all TiKVs together) can reach ~200GB/s. Compared to that, the observed write throughput is clearly too low.

3. Quick mitigation

If you don’t replace it with a higher performance disk, it won’t get better.

4. Full resolution

Move TiCDC’s data directory to a high-performance disk (higher IOPS, lower latency, e.g. local NVMe / high-end SSD)
See the official recommendation: 500 GB+ SSD
After upgrading the disk, you should observe write speed reaching ~200MB/s

TiCDC’s Unified Sorter is disk-heavy. When TiCDC disk performance is insufficient, slower spill-to-disk triggers TiCDC flow control/backpressure, which slows down TiKV incremental scan sending. The incremental scan can’t finish in time, and resolved-ts lag rises abnormally.

5.2 Detailed mechanism

A slow sorter blocks the processing path such as engine.Add(...), so the puller consumes more slowly
Upstream KV client / region worker channel regionWorkerInputChanSize gets filled and blocks, so gRPC Recv slows down
gRPC flow control/backpressure propagates back to TiKV, so TiKV blocks when sending scan events at send_all(...).await
The incremental scan can’t finish in time, and the INITIALIZED event cannot be received/processed by TiCDC
The Region is still treated as “not initialized” → TiCDC doesn’t advance that Region’s resolved-ts → the global watermark is pinned by the slowest Region → resolved-ts lag grows linearly

5.3 Visualization

flowchart TB
  %% Data path: TiKV -> TiCDC -> sorter/sink
  subgraph TiKV["TiKV"]
    tikv_scan["CDC initializer<br/>incremental scan loop"]
    tikv_send["CDC gRPC send<br/>sink.send_all await"]
    tikv_scan -->|scan batch to events| tikv_send
  end

  subgraph TiCDC["TiCDC TiFlow"]
    cdc_recv["KV client receive<br/>shared_stream.receive Recv"]
    region_worker["KV client region worker<br/>inputCh=32 sendEvent"]
    puller["Puller<br/>inputCh=1024 runEventHandler"]
    sorter["Sort engine<br/>engine.Add unified pebble"]
    sink["Sink / redo / downstream<br/>disk IO heavy"]
  end

  tikv_send -->|gRPC stream ChangeDataEvents| cdc_recv
  cdc_recv -->|dispatch| region_worker
  region_worker -->|emit to eventCh| puller
  puller -->|consume to engine.Add| sorter
  sorter -->|flush compaction write| sink

  %% Backpressure / flow control: downstream slow -> upstream blocks (where it happens)
  sink -.->|disk slow write flush blocks| sorter
  sorter -.->|engine.Add blocks| puller
  puller -.->|inputCh fills emit blocks| region_worker
  region_worker -.->|inputCh fills sendEvent blocks| cdc_recv
  cdc_recv -.->|Recv slow gRPC flow control| tikv_send
  tikv_send -.->|send_all waits scan loop stalls| tikv_scan