🚀TiDB
TiDB

CASE: Duration Increase Caused by NIC Bandwidth Saturation

Time: 2021-05-07

1. Problem Symptoms

1.1 Customer-Reported Issue

According to customer feedback, around 13:30 there was an overall SQL slowdown, and they asked for the cause and solution.

1.2 Grafana Monitoring

First, a quick observation confirmed that there was indeed an overall Duration increase;

01.jpg

However, SQL OPS did not increase during the same period, indicating there was no sudden application-side surge in SQL volume;

02.jpg

Looking at memory changes, we found that among the TiDB nodes (IP15, IP16), memory usage on IP15 increased significantly. The issue might be that SQL sent to this node became abnormal;

03.jpg

At the same time, overall TiKV CPU usage also increased, so this was unlikely to be caused by TiKV performance jitter or network jitter;

04.jpg

Looking at network traffic, inbound traffic on node IP15 reached 1G, while outbound traffic on other TiKV nodes increased. This indicates IP15 might have been saturated, which then affected overall cluster performance.

05.jpg

1.3 NIC Bandwidth

This production cluster uses 10GbE NICs;

[tidb@ip-9-23-3-15 log]$ ethtool bond0 
Settings for bond0: 
        ......
        Speed: 10000Mb/s
        ......

2. Root Cause Identification

2.1 Query Inspection

select query sql_text,
       sum_query_time,
       mnt as executions,
       avg_query_time,
       avg_proc_time,
       avg_wait_time,
       max_query_time,
       avg_backoff_time,
       (case
         when avg_proc_time = 0 then
          'point_get or commit'
         when (avg_proc_time > avg_wait_time and
              avg_proc_time > avg_backoff_time) then
          'coprocessor_process'
         when (avg_backoff_time > avg_wait_time and
              avg_proc_time < avg_backoff_time) then
          'backoff'
         else
          'coprocessor_wait'
       end) as type
  from (select substr(query, 1, 100) query,
               count(*) mnt,
               avg(query_time) avg_query_time,
               avg(process_time) avg_proc_time,
               avg(wait_time) avg_wait_time,
               max(query_time) max_query_time,
               sum(query_time) sum_query_time,
               avg(backoff_time) avg_backoff_time
          from information_schema.slow_query
         where time >= '2021-05-07 13:20:00'
           and time <= '2021-05-07 13:35:00'
           and (lower(query) not like 'analyze%' or
               lower(query) not like 'alter%')
         group by substr(query, 1, 100)) t
 order by sum_query_time desc,
          executions     desc,
          avg_query_time desc limit 20;

2.2 Output Results

*************************** 1. row *************************** 
        sql_text: SELECT  id AS id, code AS code, relationNo AS relationNo, proposalNo AS proposalNo, ssProposalNo AS  
  sum_query_time: 904.0470936730003 
      executions: 1040 
  avg_query_time: 0.8692760516086542 
   avg_proc_time: 0.00001153846153846154 
   avg_wait_time: 0.00000576923076923077 
  max_query_time: 4.691674843 
avg_backoff_time: 0 
            type: coprocessor_process 
*************************** 2. row *************************** 
        sql_text: SELECT  id AS id, orderId AS orderId, type AS type, insuredType AS insuredType, changeInsuredType AS 
  sum_query_time: 399.65042316099994 
      executions: 437 
  avg_query_time: 0.9145318607803202 
   avg_proc_time: 0.00043935926773455404 
   avg_wait_time: 0.001796338672768879 
  max_query_time: 3.965649058 
avg_backoff_time: 0.000004576659038901602 
            type: coprocessor_wait 
*************************** 3. row *************************** 
        sql_text: SELECT  id AS id, orderId AS orderId, orderType AS orderType, funNo AS funNo, funUrl AS funUrl, isCo 
  sum_query_time: 286.730882666 
      executions: 255 
  avg_query_time: 1.1244348339843138 
   avg_proc_time: 0.0003490196078431374 
   avg_wait_time: 0.0027568627450980394 
  max_query_time: 5.012617706 
avg_backoff_time: 0 
            type: coprocessor_wait 
*************************** 4. row *************************** 
        sql_text: select id, product_model_id,message_template_id,         relation_list,product_statement_title,fixed 
  sum_query_time: 218.80345473799997 
      executions: 193 
  avg_query_time: 1.1336966566735749 
   avg_proc_time: 0.00021761658031088085 
   avg_wait_time: 0 
  max_query_time: 5.437438295 
avg_backoff_time: 0 
            type: coprocessor_process 
*************************** 5. row *************************** 
        sql_text: INSERT INTO insurance_order_to_org  (id, orderId, orderType, funNo, funUrl, isCompleted, sort, creat 
  sum_query_time: 213.487945341 
      executions: 173 
  avg_query_time: 1.234034366132948 
   avg_proc_time: 0 
   avg_wait_time: 0 
  max_query_time: 5.538053322 
avg_backoff_time: 0 
            type: point_get or commit 
*************************** 6. row *************************** 
        sql_text: SELECT  id AS id, orderId AS orderId, pdfType AS pdfType, createDate AS createDate, updateDate AS up 
  sum_query_time: 185.341252661 
      executions: 188 
  avg_query_time: 0.9858577269202128 
   avg_proc_time: 0.0002553191489361702 
   avg_wait_time: 0.002398936170212766 
  max_query_time: 3.783315653 
avg_backoff_time: 0 
            type: coprocessor_wait 
*************************** 7. row *************************** 
        sql_text: commit; 
  sum_query_time: 169.47936289400002 
      executions: 155 
  avg_query_time: 1.0934152444774194 
   avg_proc_time: 0 
   avg_wait_time: 0 
  max_query_time: 2.877081402 
avg_backoff_time: 0 
            type: point_get or commit 
*************************** 8. row *************************** 
        sql_text: select id, product_id,main_clause_code,main_clause_name,clause_type, liability_type,clause_code,medi 
  sum_query_time: 157.704614038 
      executions: 144 
  avg_query_time: 1.0951709308194444 
   avg_proc_time: 0.0004097222222222223 
   avg_wait_time: 0.0001388888888888889 
  max_query_time: 4.499073241 
avg_backoff_time: 0 
            type: coprocessor_process 
*************************** 9. row *************************** 
        sql_text: INSERT IGNORE  INTO rabbitmq_consume_record  (id, orderId, pdfType, createDate, updateDate, consumeS 
  sum_query_time: 137.144392501 
      executions: 112 
  avg_query_time: 1.2245035044732142 
   avg_proc_time: 0 
   avg_wait_time: 0 
  max_query_time: 4.619352579 
avg_backoff_time: 0 
            type: point_get or commit 
*************************** 10. row *************************** 
        sql_text: SELECT * FROM `ncisp_orders` LIMIT 8729000, 1000; 
  sum_query_time: 136.27416871 
      executions: 1 
  avg_query_time: 136.27416871 
   avg_proc_time: 630.249 
   avg_wait_time: 20.584 
  max_query_time: 136.27416871 
avg_backoff_time: 0 
            type: coprocessor_process 
10 rows in set (1.23 sec) 

2.3 Execution Plan

[ogg33@9.23.3.9:4000] [ncisp]> explain SELECT * FROM `ncisp_orders` LIMIT 8729000, 1000; 
+----------------------------+------------+-----------+--------------------+----------------------------+ 
| id                         | estRows    | task      | access object      | operator info              | 
+----------------------------+------------+-----------+--------------------+----------------------------+ 
| Limit_7                    | 1000.00    | root      |                    | offset:8729000, count:1000 | 
| ©∏©§TableReader_12           | 8730000.00 | root      |                    | data:Limit_11              | 
|   ©∏©§Limit_11               | 8730000.00 | cop[tikv] |                    | offset:0, count:8730000    | 
|     ©∏©§TableFullScan_10     | 8730000.00 | cop[tikv] | table:ncisp_orders | keep order:false           | 
+----------------------------+------------+-----------+--------------------+----------------------------+ 
4 rows in set (0.00 sec) 
 

[ogg33@9.23.3.9:4000] [ncisp]> select count(*) from ncisp_orders; 
+----------+ 
| count(*) | 
+----------+ 
|  8789461 | 
+----------+ 
1 row in set (0.83 sec) 

2.4 Data Volume Estimation

[ogg33@9.23.3.9:4000] [ncisp]> show table status like '%ncisp_orders'\G  
*************************** 1. row *************************** 
           Name: ncisp_orders 
         Engine: InnoDB 
        Version: 10 
     Row_format: Compact 
           Rows: 8788191 
 Avg_row_length: 16710 
    Data_length: 146854445481 
Max_data_length: 0 
   Index_length: 1678429127 
      Data_free: 0 
 Auto_increment: NULL 
    Create_time: 2021-04-26 21:05:17 
    Update_time: NULL 
     Check_time: NULL 
      Collation: utf8mb4_bin 
       Checksum:  
 Create_options:  
        Comment: ∂©µ•±Ì 
1 row in set (0.09 sec) 

The average row length is 16710 bytes (16K). The physical execution of this SQL in the execution plan includes a "full table scan". At the TiKV layer, 8,730,000 rows are scanned, which is approximately 8,730,000 * 16K = 133G after conversion; Converting 10,000M bit/s for a 10GbE NIC yields approximately 120G Bytes/s. This shows that node IP15's NIC was saturated by this SELECT.

3. Solution

It is necessary to ask the application side for the execution purpose of this SQL. If not necessary, try to avoid it or split this SQL into multiple batches of small-volume queries.