TiCDC log puller memory quota exhaustion: resolved-ts stuck and changefeeds stalled
When a subscription span has a huge number of Regions and many holes/unlocked ranges, resolved-ts advancement becomes too expensive, dynstream pending grows and triggers PauseArea, and resolved-ts stops advancing; fixed by PR #4088.
1. Symptoms
Typical symptoms include:
- Changefeed
resolved-ts/checkpoint-tsstop advancing; downstream latency keeps growing. - Log puller (dynstream)
pendingmemory usage keeps increasing and hits PauseArea (default threshold>= 80%). - After Pause, upstream event pushing gets blocked. From outside it looks like āTiCDC is stuck and stops processing eventsā, and the blast radius can expand to multiple changefeeds.
Public issue: #4084 Log Puller Memory Quota Exhaustion Leads to All Changefeed Getting Stuck
2. Quick diagnosis
You can usually confirm it quickly with metrics + logs:
-
dynstream memory / queue metrics
- log puller dynstream
used/maxkeeps rising and approaches the 1GB cap (default area max pending size is 1GB). - pending queue length / event channel size keeps growing and doesnāt recover.
- log puller dynstream
-
Typical log signals
-
Pause/Resume feedback:
subscription client pause push region event subscription client resume push region event -
A large number of holes (unsubscribed/uncovered key ranges):
subscription client holes exist
-
-
Business impact
- resolved-ts lag / checkpoint lag grows linearly
- downstream delay increases and triggers alerts
3. Quick mitigation
The smallest mitigation is to enable resolved-ts advancement throttling:
- Change
kvclient.advance-interval-in-msfrom0to100(or larger, e.g.200) to reduce the frequency of global resolved-ts calculations. - Restart / rolling restart TiCDC to quickly clear existing pending backlog (restart without config change often reproduces the issue).
Example (TiCDC server config):
[kv-client]
advance-interval-in-ms = 100Notes:
advance-interval-in-ms = 0effectively means āadvance resolved-ts as soon as possibleā. With a very large number of Regions in a single span (e.g. hundreds of thousands), it can cause significant CPU/traversal overhead and queue buildup.- A moderate throttle (e.g. 100ms) is usually not noticeable for business latency, but it can greatly reduce per-event overhead.
4. Full resolution
Upgrade to a version that includes PR #4088 (or backport the patch):
PR #4088 includes two key changes:
- Default config change: set
advance-interval-in-msdefault from0to100to avoid high-frequency advancement overhead in large-Region scenarios. - Advancement logic optimization: when
advanceInterval > 0,handleResolvedTsadvances only at the throttled cadence and uses a safer min-resolved-ts calculation path to reduce per-resolved-ts overhead.
5. Root-cause chain
5.1 Summary
When a subscription span (usually a table) contains an extremely large number of Regions and there are many holes/unlocked ranges, log pullerās resolved-ts advancement can repeatedly trigger an expensive āglobal min resolved-tsā computation. Processing becomes slower than TiKVās incoming resolved-ts rate, dynstream pending keeps growing and triggers PauseArea, and resolved-ts stops advancing (changefeeds get stuck).
5.2 Detailed mechanism
The code links below are pinned to PR #4088ās merge commit for easier cross-checking.
- TiKV periodically sends ChangeDataEvents over a CDC gRPC stream. A resolved-ts batch can include a large number of Regions. TiCDC receives it in
receiveAndDispatchChangeEventsand routes todispatchResolvedTsEvent. dispatchResolvedTsEventpackages the batch into a single dynstream event:regionEvent{resolvedTs, states}wherestatescontains all related Region states, and pushes it into dynstream (same link as above).- When creating a dynstream path for the subscription span, the puller area
maxPendingSizedefaults to 1GB:SubscribesetsNewAreaSettingsWithMaxPendingSize(1*1024*1024*1024, ...). - When dynstream handles a resolved-ts event, it iterates
event.statesand callshandleResolvedTs(...)once per Region: see the resolved-ts branch inregionEventHandler.Handle. - Before the fix, the default
kvclient.advance-interval-in-ms=0(āadvance as soon as possibleā) effectively madehandleResolvedTs(...)attempt advancement on every call and trigger the global-min computation hot path (handleResolvedTsāRangeLock.GetHeapMinTs). GetHeapMinTs()also readsunlockedRanges.getMinTs()(same link), andrangeTsMap.getMinTs()is a full btreeAscendtraversal. With many holes/unlocked ranges, it becomes very expensive.- The more Regions a resolved-ts batch contains, the more times
handleResolvedTs(...)runs, and withadvance-interval-in-ms=0it can trigger a huge number of btree traversals ā processing canāt keep up with incoming rate ā dynstream pending grows steadily. - When pending usage crosses 80%, puller memory control triggers
PauseArea(see thresholds inShouldPauseArea). The subscription client setspaused=trueand blocks upstream push inpushRegionEventToDSuntil Resume ā the gRPC receive/dispatch chain is effectively stalled ā backpressure propagates upstream to TiKV and the blast radius can expand.
5.3 Visualization
flowchart TB
subgraph TiKV["TiKV"]
tikv_resolved["CDC stream send<br/>ResolvedTs batch (many Regions)"]
end
subgraph TiCDC["TiCDC (log puller / dynstream)"]
recv["regionRequestWorker.Recv<br/>receiveAndDispatchChangeEvents"]
pack["dispatchResolvedTsEvent<br/>regionEvent{resolvedTs, states[]}"]
ds["dynstream pending (1GB max)<br/>MemoryControlForPuller"]
handler["regionEventHandler.Handle<br/>for state in states: handleResolvedTs"]
mincalc["RangeLock global min<br/>GetHeapMinTs / rangeTsMap.getMinTs"]
pause["PauseArea (>=80%)<br/>ResumeArea (<50%)"]
blocked["subscriptionClient.paused=true<br/>pushRegionEventToDS blocks"]
end
tikv_resolved --> recv --> pack --> ds --> handler --> mincalc
mincalc -.->|too slow| handler
handler -.->|canāt keep up| ds
ds -.->|pending grows| pause -.-> blocked -.-> recv
recv -.->|gRPC backpressure| tikv_resolved5.4 Why it ādoesnāt recoverā quickly after Pause
Even after Pause, it may stay stuck for a long time because:
- Lower resume threshold (hysteresis): Pause triggers at
>= 0.8, but Resume requires< 0.5, so you must drain a large portion of backlog. - Backlog is huge and draining is still slow: even if input stops, the accumulated resolved-ts events still need to be processed one by one; if advancement is still expensive, draining is very slow.
- Potential blocking paths: when the handler enters await mode (waiting for downstream callbacks), dynstream marks the path as blocking. A blocking path wonāt be popped, so memory may never drop below the Resume threshold.
5.5 How #4088 fixes it
PR #4088 makes the hot path lighter via ādefault throttling + logic splitā:
- Default
advance-interval-in-ms=100, so global min calculation and advancement happen at most once per 100ms instead of on every resolved-ts event. - When
advanceInterval > 0,handleResolvedTsavoids maintaining/updating min-state for every event and concentrates heavier work only at the throttled advancement point.