CASE: Duration Increase Caused by NIC Bandwidth Saturation
Time: 2021-05-07
1. Problem Symptoms
1.1 Customer-Reported Issue
According to customer feedback, around 13:30 there was an overall SQL slowdown, and they asked for the cause and solution.
1.2 Grafana Monitoring
First, a quick observation confirmed that there was indeed an overall Duration increase;

However, SQL OPS did not increase during the same period, indicating there was no sudden application-side surge in SQL volume;

Looking at memory changes, we found that among the TiDB nodes (IP15, IP16), memory usage on IP15 increased significantly. The issue might be that SQL sent to this node became abnormal;

At the same time, overall TiKV CPU usage also increased, so this was unlikely to be caused by TiKV performance jitter or network jitter;

Looking at network traffic, inbound traffic on node IP15 reached 1G, while outbound traffic on other TiKV nodes increased. This indicates IP15 might have been saturated, which then affected overall cluster performance.

1.3 NIC Bandwidth
This production cluster uses 10GbE NICs;
[tidb@ip-9-23-3-15 log]$ ethtool bond0
Settings for bond0:
......
Speed: 10000Mb/s
......2. Root Cause Identification
2.1 Query Inspection
select query sql_text,
sum_query_time,
mnt as executions,
avg_query_time,
avg_proc_time,
avg_wait_time,
max_query_time,
avg_backoff_time,
(case
when avg_proc_time = 0 then
'point_get or commit'
when (avg_proc_time > avg_wait_time and
avg_proc_time > avg_backoff_time) then
'coprocessor_process'
when (avg_backoff_time > avg_wait_time and
avg_proc_time < avg_backoff_time) then
'backoff'
else
'coprocessor_wait'
end) as type
from (select substr(query, 1, 100) query,
count(*) mnt,
avg(query_time) avg_query_time,
avg(process_time) avg_proc_time,
avg(wait_time) avg_wait_time,
max(query_time) max_query_time,
sum(query_time) sum_query_time,
avg(backoff_time) avg_backoff_time
from information_schema.slow_query
where time >= '2021-05-07 13:20:00'
and time <= '2021-05-07 13:35:00'
and (lower(query) not like 'analyze%' or
lower(query) not like 'alter%')
group by substr(query, 1, 100)) t
order by sum_query_time desc,
executions desc,
avg_query_time desc limit 20;2.2 Output Results
*************************** 1. row ***************************
sql_text: SELECT id AS id, code AS code, relationNo AS relationNo, proposalNo AS proposalNo, ssProposalNo AS
sum_query_time: 904.0470936730003
executions: 1040
avg_query_time: 0.8692760516086542
avg_proc_time: 0.00001153846153846154
avg_wait_time: 0.00000576923076923077
max_query_time: 4.691674843
avg_backoff_time: 0
type: coprocessor_process
*************************** 2. row ***************************
sql_text: SELECT id AS id, orderId AS orderId, type AS type, insuredType AS insuredType, changeInsuredType AS
sum_query_time: 399.65042316099994
executions: 437
avg_query_time: 0.9145318607803202
avg_proc_time: 0.00043935926773455404
avg_wait_time: 0.001796338672768879
max_query_time: 3.965649058
avg_backoff_time: 0.000004576659038901602
type: coprocessor_wait
*************************** 3. row ***************************
sql_text: SELECT id AS id, orderId AS orderId, orderType AS orderType, funNo AS funNo, funUrl AS funUrl, isCo
sum_query_time: 286.730882666
executions: 255
avg_query_time: 1.1244348339843138
avg_proc_time: 0.0003490196078431374
avg_wait_time: 0.0027568627450980394
max_query_time: 5.012617706
avg_backoff_time: 0
type: coprocessor_wait
*************************** 4. row ***************************
sql_text: select id, product_model_id,message_template_id, relation_list,product_statement_title,fixed
sum_query_time: 218.80345473799997
executions: 193
avg_query_time: 1.1336966566735749
avg_proc_time: 0.00021761658031088085
avg_wait_time: 0
max_query_time: 5.437438295
avg_backoff_time: 0
type: coprocessor_process
*************************** 5. row ***************************
sql_text: INSERT INTO insurance_order_to_org (id, orderId, orderType, funNo, funUrl, isCompleted, sort, creat
sum_query_time: 213.487945341
executions: 173
avg_query_time: 1.234034366132948
avg_proc_time: 0
avg_wait_time: 0
max_query_time: 5.538053322
avg_backoff_time: 0
type: point_get or commit
*************************** 6. row ***************************
sql_text: SELECT id AS id, orderId AS orderId, pdfType AS pdfType, createDate AS createDate, updateDate AS up
sum_query_time: 185.341252661
executions: 188
avg_query_time: 0.9858577269202128
avg_proc_time: 0.0002553191489361702
avg_wait_time: 0.002398936170212766
max_query_time: 3.783315653
avg_backoff_time: 0
type: coprocessor_wait
*************************** 7. row ***************************
sql_text: commit;
sum_query_time: 169.47936289400002
executions: 155
avg_query_time: 1.0934152444774194
avg_proc_time: 0
avg_wait_time: 0
max_query_time: 2.877081402
avg_backoff_time: 0
type: point_get or commit
*************************** 8. row ***************************
sql_text: select id, product_id,main_clause_code,main_clause_name,clause_type, liability_type,clause_code,medi
sum_query_time: 157.704614038
executions: 144
avg_query_time: 1.0951709308194444
avg_proc_time: 0.0004097222222222223
avg_wait_time: 0.0001388888888888889
max_query_time: 4.499073241
avg_backoff_time: 0
type: coprocessor_process
*************************** 9. row ***************************
sql_text: INSERT IGNORE INTO rabbitmq_consume_record (id, orderId, pdfType, createDate, updateDate, consumeS
sum_query_time: 137.144392501
executions: 112
avg_query_time: 1.2245035044732142
avg_proc_time: 0
avg_wait_time: 0
max_query_time: 4.619352579
avg_backoff_time: 0
type: point_get or commit
*************************** 10. row ***************************
sql_text: SELECT * FROM `ncisp_orders` LIMIT 8729000, 1000;
sum_query_time: 136.27416871
executions: 1
avg_query_time: 136.27416871
avg_proc_time: 630.249
avg_wait_time: 20.584
max_query_time: 136.27416871
avg_backoff_time: 0
type: coprocessor_process
10 rows in set (1.23 sec) 2.3 Execution Plan
[ogg33@9.23.3.9:4000] [ncisp]> explain SELECT * FROM `ncisp_orders` LIMIT 8729000, 1000;
+----------------------------+------------+-----------+--------------------+----------------------------+
| id | estRows | task | access object | operator info |
+----------------------------+------------+-----------+--------------------+----------------------------+
| Limit_7 | 1000.00 | root | | offset:8729000, count:1000 |
| ©∏©§TableReader_12 | 8730000.00 | root | | data:Limit_11 |
| ©∏©§Limit_11 | 8730000.00 | cop[tikv] | | offset:0, count:8730000 |
| ©∏©§TableFullScan_10 | 8730000.00 | cop[tikv] | table:ncisp_orders | keep order:false |
+----------------------------+------------+-----------+--------------------+----------------------------+
4 rows in set (0.00 sec)
[ogg33@9.23.3.9:4000] [ncisp]> select count(*) from ncisp_orders;
+----------+
| count(*) |
+----------+
| 8789461 |
+----------+
1 row in set (0.83 sec) 2.4 Data Volume Estimation
[ogg33@9.23.3.9:4000] [ncisp]> show table status like '%ncisp_orders'\G
*************************** 1. row ***************************
Name: ncisp_orders
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 8788191
Avg_row_length: 16710
Data_length: 146854445481
Max_data_length: 0
Index_length: 1678429127
Data_free: 0
Auto_increment: NULL
Create_time: 2021-04-26 21:05:17
Update_time: NULL
Check_time: NULL
Collation: utf8mb4_bin
Checksum:
Create_options:
Comment: ∂©µ•±Ì
1 row in set (0.09 sec) The average row length is 16710 bytes (16K). The physical execution of this SQL in the execution plan includes a "full table scan". At the TiKV layer, 8,730,000 rows are scanned, which is approximately 8,730,000 * 16K = 133G after conversion; Converting 10,000M bit/s for a 10GbE NIC yields approximately 120G Bytes/s. This shows that node IP15's NIC was saturated by this SELECT.
3. Solution
It is necessary to ask the application side for the execution purpose of this SQL. If not necessary, try to avoid it or split this SQL into multiple batches of small-volume queries.