Demystifying ‘Waiting for table metadata lock’ in mysql 5.7

Avishek Gurung
7 min readApr 24, 2021

--

Background

Back in those days when I was working with an org, we had a Mysql table which would store voluminous data. Everyday, we would store close to 2 million rows in the table. The table had a day wise partition as it was easy for us to take an incremental backup. The table would only store the data of the last 30 days to avoid filling disk space. So everyday, we had a job that would drop the partition of the last 30th day.

mysql> ALTER TABLE table_name DROP PARTITION partition_name;

Problem

Whenever the job to drop the partition would run, we found out that our applications using the above Mysql instance would clog and run out of connections. The applications would choke and their log showed us that they were unable to make any further connections with the database and the mysql thread pool had been exhausted. The current connections made by the application to the database were not freed and no new connections were being made as the pool was exhausted. We checked for the processes that were running in mysql.

mysql > SHOW FULL PROCESSLIST;

We found out that there were a large number of processes which had the state: “Waiting for table metadata lock”. The state of our Mysql instance would remain the same for several hours and our applications were on halt for the entire time. When we killed the process which was running the “DROP PARTITION” command, then almost all the processes which were in lock state, started executing and the DB state returned to it’s normalcy. This led us to believe in the beginning, that it was the DDL command (“DROP PARTITION”) that was causing all the queries to be in a state of “Waiting for table metadata lock”. However, there was still an ambiguity. Our past experiences had shown that executing the “DROP PARTITION” command would never take several hours. It would be completed within a few minutes even if the partition size were monstrous. Then why would the “DROP PARTITION” command now take several hours to complete? The problem demanded more insight!

Reproducing the problem

Our first approach to fix the problem was to understand the problem (and not forget the problem!). And what would have been a better way to understand a production problem by reproducing them in the local environment?

However, let’s first understand what a table metadata lock is and then we will navigate to the solution. A table metadata lock is not a row level lock or table level lock, it is a lock on a table that prevents table’s structure. When any query (DDL) that alters the structure of the table is run, then all the DML or SELECT queries will not be executed until the DDL query is complete and they will be in the state ‘Waiting for table metadata lock’ which can be viewed by ‘show full processlist’ command. There are two points to be summarised, which will help us understand the table metadata lock better and also allow us to replicate the above production scenario in our local environment. But prior to that, let’s consider a simple table for reference.

terminal-1> select * from employees;
+----+------------+------------+------+
| id | first_name | last_name | age |
+----+------------+------------+------+
| 1 | Bill | Gates | 40 |
| 2 | Larry | Page | 45 |
| 3 | Mark | Zuckerberg | 45 |
+----+------------+------------+------+

This is a simple table which holds some basic information about employees. We are going to use this table in our examples.

When any transaction is going on and if we try to change the structure of the table then the DDL command will experience a metadata lock.
Let’s use three mysql clients to reproduce this scenario.

terminal-1> begin;
terminal-1> select * from employees;
terminal-2> ALTER TABLE employees MODIFY last_name varchar(50);
//This command hangs and does not complete.
terminal-3> show full processlist \G
*************************** 1. row ***************************
Id: 2
User: root
Host: localhost
db: test
Command: Sleep
Time: 58
State:
Info: NULL
*************************** 2. row ***************************
Id: 3
User: root
Host: localhost
db: test
Command: Query
Time: 30
State: Waiting for table metadata lock
Info: ALTER TABLE employees MODIFY last_name varchar(50) NOT NULL
*************************** 3. row ***************************
Id: 4
User: root
Host: localhost
db: test
Command: Query
Time: 0
State: starting
Info: show full processlist

We can see that the DDL command that we had used to alter the table is in the state of metadata lock. We would have received the same result had we used a DDL command to drop the partition of the table. However, in the example, we are modifying the column and not dropping the partition for simplicity. If the transaction is complete, then the DDL will continue to run.

terminal-1> commit;terminal-2> Query OK, 0 rows affected (0.07 sec)

When a DDL query is in the metadata lock state then any query that we run, be it DML or SELECT, will also be in metadata lock state.

terminal-1> begin;
terminal-1> select * from employees;
terminal-2> ALTER TABLE employees MODIFY last_name varchar(50);
//This command hangs and does not complete.
terminal-3> select * from employees;
//This command hangs and does not complete.
terminal-4> update employees set first_name = "Billy" where id = 1;
//This command hangs and does not complete.
terminal-5> show full processlist \G;
*************************** 1. row ***************************
Id: 2
User: root
Host: localhost
db: test
Command: Sleep
Time: 206
State:
Info: NULL
*************************** 2. row ***************************
Id: 3
User: root
Host: localhost
db: test
Command: Query
Time: 204
State: Waiting for table metadata lock
Info: ALTER TABLE employees MODIFY last_name varchar(50) NOT NULL
*************************** 3. row ***************************
Id: 4
User: root
Host: localhost
db: test
Command: Query
Time: 194
State: Waiting for table metadata lock
Info: select * from employees
*************************** 4. row ***************************
Id: 5
User: root
Host: localhost
db: test
Command: Query
Time: 133
State: Waiting for table metadata lock
Info: update employees set first_name = "Billy" where id = 1
*************************** 5. row ***************************
Id: 6
User: root
Host: localhost
db: test
Command: Query
Time: 0
State: starting
Info: show full processlist

We see that if a DDL command is in metadata lock state then no queries which run after this DDL query will complete and all of them will be in the metadata lock state. If we keep executing DML commands then all the DB connections will be exhausted eventually, halting the clients (applications). This is what we had been experiencing in our Mysql instance. There are important points to note here:

  1. If the process which is running a transaction (Terminal 1) is committed or killed, then the DDL query will continue to run. And once the DDL query finishes its execution, then all the remaining DML and SELECT queries will run.
  2. If the transaction (Terminal 1) is still RUNNING but the DDL query is KILLED, then all the DML and SELECT queries (which had been in a metadata lock state) will continue to run.

So this is what we had been experiencing. We had a transaction that was not committed. The “DROP PARTITION” query started when the transaction was still not complete so it went into the metadata lock state. Now, when any DML or SELECT query ran, they went into metadata lock state. When we killed the “DROP PARTITION” query, all the DML and SELECT statements that had been blocked ran normally giving us the illusion that it was the “DROP PARTITION” query which was the culprit. However, it was the query that started the transaction but did not commit, which needed to be blamed.

Catching the culprit!

It was evident that the transaction which had not been committed or which had been running for several hours was the culprit behind all the QUERIES going to metadata lock state. We queried INNODB_TRX table to identify the transaction.

terminal-5> select * from information_schema.INNODB_TRX \G;                    trx_id: 281479623874352
trx_state: RUNNING
trx_started: 2021-04-24 11:56:33
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 0
trx_mysql_thread_id: 2
trx_query: NULL
trx_operation_state: NULL
trx_tables_in_use: 0
trx_tables_locked: 0
trx_lock_structs: 0
trx_lock_memory_bytes: 1136
trx_rows_locked: 0
trx_rows_modified: 0
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0

We saw that the trx_query is NULL. Had the query of the transaction been running, then we would have seen the QUERY statement. But the NULL value signifies that the query had already run but the transaction has not yet been committed. We had to identify the query that had last run in the transaction. We computed the following query to identify the last query that ran in the transaction.

terminal-5> SELECT performance_schema.threads.PROCESSLIST_ID,performance_schema.threads.THREAD_ID, performance_schema.events_statements_current.SQL_TEXT,information_schema.INNODB_TRX.trx_started FROM performance_schema.threads INNER JOIN information_schema.INNODB_TRX ON performance_schema.threads.PROCESSLIST_ID = information_schema.INNODB_TRX.trx_mysql_thread_id INNER JOIN performance_schema.events_statements_current ON performance_schema.events_statements_current.THREAD_ID = performance_schema.threads.THREAD_ID;

Now that we know which query was the culprit, we can go ahead fix them. The fix would not be the same for everyone. But once we get to know the culprit, I am sure it would not take long to fix it or tune it.

In our case, it was a report generation script written in python which was scheduled to run every few hours. The auto-commit was set to False and it was making a commit only for the update and insert statement but not for select. As a result, the commit would happen only when the complete report had finished and when the connection was garbage collected. So till the time this huge report was in the process of generation, all the QUERIES were in “Waiting for table metadata lock” state!

--

--