TiCDC log puller memory quota exhaustion: resolved-ts stuck and changefeeds stalled

When a subscription span has a huge number of Regions and many holes/unlocked ranges, resolved-ts advancement becomes too expensive, dynstream pending grows and triggers PauseArea, and resolved-ts stops advancing; fixed by PR #4088.

1. Symptoms

Typical symptoms include:

Changefeed resolved-ts / checkpoint-ts stop advancing; downstream latency keeps growing.
Log puller (dynstream) pending memory usage keeps increasing and hits PauseArea (default threshold >= 80%).
After Pause, upstream event pushing gets blocked. From outside it looks like “TiCDC is stuck and stops processing events”, and the blast radius can expand to multiple changefeeds.

Public issue: #4084 Log Puller Memory Quota Exhaustion Leads to All Changefeed Getting Stuck

2. Quick diagnosis

You can usually confirm it quickly with metrics + logs:

dynstream memory / queue metrics
- log puller dynstream used/max keeps rising and approaches the 1GB cap (default area max pending size is 1GB).
- pending queue length / event channel size keeps growing and doesn’t recover.

Typical log signals

Pause/Resume feedback:

subscription client pause push region event
subscription client resume push region event

A large number of holes (unsubscribed/uncovered key ranges):
```
subscription client holes exist
```

Business impact
- resolved-ts lag / checkpoint lag grows linearly
- downstream delay increases and triggers alerts

3. Quick mitigation

The smallest mitigation is to enable resolved-ts advancement throttling:

Change kvclient.advance-interval-in-ms from 0 to 100 (or larger, e.g. 200) to reduce the frequency of global resolved-ts calculations.
Restart / rolling restart TiCDC to quickly clear existing pending backlog (restart without config change often reproduces the issue).

Example (TiCDC server config):

[kv-client]
advance-interval-in-ms = 100

Notes:

advance-interval-in-ms = 0 effectively means “advance resolved-ts as soon as possible”. With a very large number of Regions in a single span (e.g. hundreds of thousands), it can cause significant CPU/traversal overhead and queue buildup.
A moderate throttle (e.g. 100ms) is usually not noticeable for business latency, but it can greatly reduce per-event overhead.

4. Full resolution

Upgrade to a version that includes PR #4088 (or backport the patch):

PR: #4088 logpuller: fix puller resolvedTs stuck

PR #4088 includes two key changes:

Default config change: set advance-interval-in-ms default from 0 to 100 to avoid high-frequency advancement overhead in large-Region scenarios.
Advancement logic optimization: when advanceInterval > 0, handleResolvedTs advances only at the throttled cadence and uses a safer min-resolved-ts calculation path to reduce per-resolved-ts overhead.

5. Root-cause chain

5.1 Summary

When a subscription span (usually a table) contains an extremely large number of Regions and there are many holes/unlocked ranges, log puller’s resolved-ts advancement can repeatedly trigger an expensive “global min resolved-ts” computation. Processing becomes slower than TiKV’s incoming resolved-ts rate, dynstream pending keeps growing and triggers PauseArea, and resolved-ts stops advancing (changefeeds get stuck).

5.2 Detailed mechanism

The code links below are pinned to PR #4088’s merge commit for easier cross-checking.

TiKV periodically sends ChangeDataEvents over a CDC gRPC stream. A resolved-ts batch can include a large number of Regions. TiCDC receives it in receiveAndDispatchChangeEvents and routes to dispatchResolvedTsEvent.
dispatchResolvedTsEvent packages the batch into a single dynstream event: regionEvent{resolvedTs, states} where states contains all related Region states, and pushes it into dynstream (same link as above).
When creating a dynstream path for the subscription span, the puller area maxPendingSize defaults to 1GB: Subscribe sets NewAreaSettingsWithMaxPendingSize(1*1024*1024*1024, ...).
When dynstream handles a resolved-ts event, it iterates event.states and calls handleResolvedTs(...) once per Region: see the resolved-ts branch in regionEventHandler.Handle.
Before the fix, the default kvclient.advance-interval-in-ms=0 (“advance as soon as possible”) effectively made handleResolvedTs(...) attempt advancement on every call and trigger the global-min computation hot path (handleResolvedTs → RangeLock.GetHeapMinTs).
GetHeapMinTs() also reads unlockedRanges.getMinTs() (same link), and rangeTsMap.getMinTs() is a full btree Ascend traversal. With many holes/unlocked ranges, it becomes very expensive.
The more Regions a resolved-ts batch contains, the more times handleResolvedTs(...) runs, and with advance-interval-in-ms=0 it can trigger a huge number of btree traversals → processing can’t keep up with incoming rate → dynstream pending grows steadily.
When pending usage crosses 80%, puller memory control triggers PauseArea (see thresholds in ShouldPauseArea). The subscription client sets paused=true and blocks upstream push in pushRegionEventToDS until Resume → the gRPC receive/dispatch chain is effectively stalled → backpressure propagates upstream to TiKV and the blast radius can expand.

5.3 Visualization

flowchart TB
  subgraph TiKV["TiKV"]
    tikv_resolved["CDC stream send<br/>ResolvedTs batch (many Regions)"]
  end

  subgraph TiCDC["TiCDC (log puller / dynstream)"]
    recv["regionRequestWorker.Recv<br/>receiveAndDispatchChangeEvents"]
    pack["dispatchResolvedTsEvent<br/>regionEvent{resolvedTs, states[]}"]
    ds["dynstream pending (1GB max)<br/>MemoryControlForPuller"]
    handler["regionEventHandler.Handle<br/>for state in states: handleResolvedTs"]
    mincalc["RangeLock global min<br/>GetHeapMinTs / rangeTsMap.getMinTs"]
    pause["PauseArea (>=80%)<br/>ResumeArea (<50%)"]
    blocked["subscriptionClient.paused=true<br/>pushRegionEventToDS blocks"]
  end

  tikv_resolved --> recv --> pack --> ds --> handler --> mincalc

  mincalc -.->|too slow| handler
  handler -.->|can’t keep up| ds
  ds -.->|pending grows| pause -.-> blocked -.-> recv
  recv -.->|gRPC backpressure| tikv_resolved

5.4 Why it “doesn’t recover” quickly after Pause

Even after Pause, it may stay stuck for a long time because:

Lower resume threshold (hysteresis): Pause triggers at >= 0.8, but Resume requires < 0.5, so you must drain a large portion of backlog.
Backlog is huge and draining is still slow: even if input stops, the accumulated resolved-ts events still need to be processed one by one; if advancement is still expensive, draining is very slow.
Potential blocking paths: when the handler enters await mode (waiting for downstream callbacks), dynstream marks the path as blocking. A blocking path won’t be popped, so memory may never drop below the Resume threshold.

5.5 How #4088 fixes it

PR #4088 makes the hot path lighter via “default throttling + logic split”:

Default advance-interval-in-ms=100, so global min calculation and advancement happen at most once per 100ms instead of on every resolved-ts event.
When advanceInterval > 0, handleResolvedTs avoids maintaining/updating min-state for every event and concentrates heavier work only at the throttled advancement point.