🚀TiDB
TiCDC

TiCDC architecture model: streaming, semantics, and state (Legacy)

A conceptual view of TiCDC as a streaming system: source/sink abstraction, delivery semantics, and how checkpoint/resolved timestamps are maintained.

This page summarizes TiCDC from a “streaming system” perspective. It focuses on concepts that help you reason about correctness and lag, rather than version-specific configuration details.

1. Streaming vs batch (why CDC is streaming)

Data replication is a form of data processing: moving data from one “end” to another. Processing models are often grouped into:

  • Batch processing: bounded datasets; collect-then-process.
  • Streaming processing: unbounded datasets; event-driven processing with ordering/time semantics.

CDC is naturally a streaming problem: the dataset is unbounded, and correctness often depends on ordering and “time” (e.g., commit timestamps).

2. Source & sink abstraction

TiCDC can be understood via a standard streaming abstraction:

  • Source: produces a stream of change events
  • Sink: consumes the stream and outputs it to a target system

There are two important “source → sink” segments in the TiCDC pipeline:

  1. TiKV CDC component → TiCDC Capture (gRPC change event stream)
  2. TiCDC Puller → downstream sink (MySQL / MQ / etc.)

07TiCDC_DataFlow_Model

3. Delivery semantics: at-most-once vs at-least-once vs exactly-once

In streaming systems, delivery semantics define what happens when failures occur:

  1. At-most-once: events are processed at most once; failures can drop data.
  2. At-least-once: events can be processed multiple times; retries can cause duplicates.
  3. Exactly-once: effects are applied exactly once (requires stronger coordination).

TiCDC is typically described as at-least-once. Retries can introduce duplicates, so correctness relies on idempotency at the sink layer (or on downstream consumers that can handle duplicates).

4. System architecture (roles and responsibilities)

TiCDC runs as a set of stateless nodes (Captures) with coordination stored in PD/etcd. A changefeed can be scheduled across multiple Captures.

Key concepts (high-level):

  • Owner: a special role elected among captures; responsible for global scheduling and coordination.
  • Processor: per-capture runtime that manages table pipelines for a changefeed.
  • Changefeed: a replication task definition (the “pipeline”).
  • Checkpoint TS: how far the sink has safely processed and emitted.
  • Resolved TS: a global watermark indicating all events before this ts are known/available for processing.

08ticdc_architecture

5. Incremental scan + streaming

When a changefeed starts (or when a table is added/moved), TiCDC/TiKV perform an initialization sequence:

  • Incremental scan: scan historical changes from a start point (implementation details vary by versions)
  • Continuous streaming: after the scan, TiKV continuously pushes new changes

Your practical takeaway:

  • If initialization for some regions/tables never completes, resolved-ts can be pinned and lag can “grow linearly”.

6. State and fault tolerance (why duplicates can happen)

With at-least-once semantics, failures can cause retries and duplicates. Two common patterns:

  1. Upstream (TiKV) reconnects and re-sends changes after a leader change.
  2. A capture failure causes ownership/scheduling changes and re-creation of processors/table pipelines, which re-pulls changes from the last checkpoint.

As long as downstream application is idempotent (or conflicts resolve deterministically), duplicates should not break correctness, but they can increase load.

References