🚀TiDB
DM

TiDB DM: Architecture, Components, and Operations Overview

Integrated from DM docs: architecture, quick start, Master/Worker/Syncer, shard DDL, metrics, and troubleshooting.

DM (TiDB Data Migration) is used to continuously migrate MySQL (single instance or sharded cluster) into TiDB.

0. Quick Index

0.1 What DM Solves

  • Full migration: dump + load
  • Incremental replication: binlog -> TiDB
  • Shard merge: multiple upstream tables merged into one downstream table

0.2 When to Prefer DM

  • You need MySQL-to-TiDB migration with a short downtime window
  • You need full load first, then incremental catch-up before cutover
  • Upstream is sharded MySQL and requires routing/merge

0.3 Common Non-DM Scenario

  • TiDB changefeed to downstream systems (this is usually TiCDC)

1. Architecture and Data Flow

Integrated architecture diagram (from DM docs):

All DM Components

1.1 Key Object Relationships

  • 1 source ~= 1 subtask
  • A subtask runs on a dm-worker
  • dm-master handles control plane (scheduling, election, coordination)
  • etcd handles metadata persistence and state sync

1.2 Main Data Flow

  1. Dumper exports upstream snapshot
  2. Loader imports into TiDB
  3. Syncer continuously consumes binlog and applies DML/DDL

If relay is enabled, the flow becomes:

Upstream MySQL -> Relay Log (local on worker) -> Syncer -> TiDB

2. Quick Start (Condensed)

2.1 Preconditions

  • binlog is enabled on upstream MySQL
  • binlog retention covers the catch-up window
  • account privileges and network connectivity are confirmed

2.2 Deployment Example (TiUP)

tiup install dm
tiup dm deploy dm-test v2.0.1 /home/tidb/dm/topology.yaml --user root -p
tiup dm start dm-test
tiup dm display dm-test

2.3 Example Upstream Data

create database user;
create database store;
create database log;

use user;
create table information (id int primary key,name varchar(20));
create table log (id int primary key,name varchar(20));

use log;
create table messages(id int primary key,name varchar(20));

3. DM Master (Control Plane)

DM Master Components

3.1 Etcd and High Availability

  • Master critical states are persisted in etcd
  • With election, only the leader executes key control logic

3.2 OpenAPI / dmctl

  • OpenAPI and dmctl are both operational entry points
  • Requests are handled by the leader; non-leaders forward them

3.3 Election

  • campaignLoop drives continuous leader election
  • The leader starts key components: Scheduler, Pessimist, Optimist

3.4 Scheduler

Scheduler handles:

  • worker registration/offline handling
  • keepalive state observation
  • source add/remove event watching
  • subtask scheduling and migration

4. DM Worker (Data Plane)

DM Worker Components

Typical components after worker startup:

  • KeepAlive
  • RelayHandler (optional)
  • SubTasks
  • Syncer

4.1 KeepAlive

  • Uses etcd lease/TTL heartbeat to maintain liveness
  • Default TTL is 1 minute
  • Master can reschedule if a worker becomes unhealthy

4.2 Relay

  • When enabled, binlog is first written to local relay, then consumed by syncer
  • Improves stability when upstream binlog reading is unstable

Typical directory:

<deploy_dir>/relay_log/
|-- <server-uuid>.000001/
|   |-- mysql-bin.000001
|   `-- relay.meta
`-- server-uuid.index

4.3 SubTask and SourceWorker

  • SubTask is the execution unit after task splitting
  • SourceWorker manages subtask/state/relay lifecycle for one source

5. Syncer (Incremental Replication Engine)

5.1 StreamController

Handles binlog stream control:

  • creates stream from remote or relay
  • resets by position or GTID
  • continuously reads next event

5.2 Main Loop

The Syncer main loop dispatches events to branches:

  • rotate event
  • query event (DDL)
  • rows event (DML)

Key rows events include:

  • WRITE_ROWS_EVENT
  • UPDATE_ROWS_EVENT
  • DELETE_ROWS_EVENT

5.3 syncDML

syncDML dispatches jobs from dmlJobCh to:

  • Compactor (optional)
  • Causality
  • DML worker queues

Common tuning knobs:

  • worker-count (parallelism)
  • batch (batch size)

5.4 Causality

  • buckets conflicts by PK/UK
  • keeps conflicting events ordered
  • executes non-conflicting events in parallel

5.5 syncDDL / DDLWorker / checkpointWorker

  • syncDDL switches logic by shard mode (pessimist/optimist)
  • DDLWorker handles DDL split/filter/execute and metrics
  • checkpointWorker flushes progress for resumability and final consistency

6. Shard DDL: Pessimist vs Optimist

DimensionPessimist ModeOptimist Mode
DML impactRelated DML may be blockedTries to keep main DML path unblocked
CoordinationOwner executes downstream DDL onceSchema-state conflict detection and coordination
Key structuresInfo / Operation / LockInfo / Operation (with more schema context)
Typical issuelock not resolved for long timeschema conflict resolution failure

7. Metrics and Monitoring

7.1 High-Priority Metrics

  • replicate lag
  • remaining time to sync
  • shard lock resolving
  • DML queue remain length
  • relay disk capacity and remaining space

7.2 Two Core Formulas

remainingSeconds = remainingSize / bytesPerSec
bytesPerSec = (totalBinlogSize - lastBinlogSize) / seconds
lag = now - tsOffset - headerTS

8. Troubleshooting Runbook (Suggested Order)

  1. First decide whether the issue is in control plane or data plane
  2. If lag grows, check downstream write capacity, then relay IO, then syncer parallel settings
  3. If stuck on DDL, check shard lock resolving and routing consistency
  4. Use dmctl/OpenAPI for manual intervention only when needed

9. Etcd Key Observability

10. References