Debugging Lock Wait Timeouts (in Ruby on Rails)

Published in

Source Diving

9 min readMar 20, 2023

Lock Wait Timeouts mean frustration for users. If a request ends with a Lock Wait Timeout error, that means a) we failed to fulfil a user’s intended action, b) there was a long delay before the failure, and c) some other transaction was also slow, potentially meaning another user waited a long time.

I recently learnt — through an investigation that rumbled on for a while — that it’s not always straightforward to work out what’s causing Lock Wait Timeouts. So here, I’m going to share how that investigation unfolded, focusing on the techniques that helped.

First though, what is a Lock Wait Timeout?

Some database transactions¹ need to lock one or more rows in a table. A transaction may need to update or delete those rows, for example, so it ‘locks’ them to prevent other transactions making conflicting changes.

Sometimes, a transaction needs to lock a row but it can’t because the row is already locked by another transaction. In that scenario, the transaction has to wait for the lock it needs. If it waits longer than the innodb-lock-wait-timeout value (by default 50 seconds), MySQL throws an error. MySQL itself doesn't necessarily² roll back the whole transaction, but if the error happens inside an ActiveRecord transaction, it will trigger a rollback, like any other error. Meanwhile, the other transaction (which already holds the blocking lock) continues.

The tricky thing about debugging LockWaitTimeout errors is that after they happen, the errored transaction is no longer waiting for a lock. At that point, there's no easy way to know which transaction held the lock that blocked the errored transaction, or what that blocking lock was. And that makes it hard to know what caused the Lock Wait Timeout.

1. Checking the locks that are still held

What we can see immediately after a LockWaitTimeout error happens are the transactions that still hold one or more locks. One of those transactions will almost certainly be the one that blocked the transaction that was waiting for a lock. We can't know for certain which was the blocking transaction, but we can at least identify suspicious transactions and get more information about them.

At Cookpad, we were concerned about the number of Lock Wait Timeouts we were seeing across a range of endpoints, generally in a few 20–30 minute spikes each day. We didn’t understand the root cause, so I decided to try and get a clearer picture of the potentially blocking transactions.

To get a list of all currently held locks at a given moment, we can use the SHOW ENGINE INNODB STATUS statement³, which displays “extensive information…about the state of the InnoDB storage engine”. Here’s how I parsed the output to get a list of currently held locks👇

def record_lock_descriptions
  @_record_lock_descriptions ||= fetch_record_lock_descriptions
end

def fetch_record_lock_descriptions
  LocksDebugging::InnodbStatusReport.new(fetch_innodb_status).lock_descriptions
end

def fetch_innodb_status
  result_array = ActiveRecord::Base.connection.exec_query(
    "SHOW ENGINE INNODB STATUS"
  ).rows.flatten
  _type, _name, status = result_array
  status
end

module LocksDebugging
  class InnodbStatusReport
    SECTION_REGEX = /(?=\n[A-Z ]+\n)/

    def initialize(query_output)
      @query_output = query_output
    end

    def lock_descriptions
      @_lock_descriptions ||= select_lock_descriptions
    end

    private

      attr_reader :query_output

      def select_lock_descriptions
        transactions_section.split("\n").select do |line|
          line.start_with?("RECORD LOCKS") && line.exclude?("waiting")
        end
      end

      def transactions_section
        @_transactions_section ||= find_transactions_section
      end

      def find_transactions_section
        sections.find do |section|
          section.strip.start_with?("TRANSACTIONS")
        end
      end

      def sections
        @_sections ||= query_output.split(SECTION_REGEX)
      end
  end
end

Each record_lock_description looks something like this 👇

RECORD LOCKS space id 64905 page no 3 n bits 72 index PRIMARY of table `global_test`.`recipes` trx id 1833090 lock_mode X locks rec but not gap`

It tells us which table the lock is in, and the id of the transaction that holds the lock. The transaction id corresponds to the trx_id column in the information_schema.innodb_trx table, where we can fetch more detail about the transaction, like when it started, how many locks it holds, and what query it is currently executing.

Here’s how I put that innodb_trx detail together with the lock_descriptions 👇

class LocksSnapshot
  def self.to_h
    new.to_h
  end

  def initialize
    @transactions = fetch_transactions
    @report = build_report
  end

  def to_h
    report.each_with_index.to_h do |lock_summary, index|
      [index.to_s, lock_summary]
    end
  end

  private

    attr_reader :transactions, :report

    def build_report
      record_lock_descriptions.map.with_index do |lock_description, index|
        transaction_id = transaction_ids[index]
        transaction = transactions_by_id[transaction_id]

        {
          description: lock_description,
          transaction_current_query_present: transaction_current_query_present?(transaction),
          transaction_current_query_context: transaction_current_query_context(transaction),
          transaction_started: transaction_started(transaction),
          transaction_record_locks: transaction_record_locks(transaction)
        }.compact
      end
    end

    def record_lock_descriptions
      @_record_lock_descriptions ||= fetch_record_lock_descriptions
    end

    def fetch_record_lock_descriptions
      LocksDebugging::InnodbStatusReport.new(fetch_innodb_status).lock_descriptions
    end

    def fetch_innodb_status
      ActiveRecord::Base.connection.exec_query(
        "SHOW ENGINE INNODB STATUS"
      ).rows[0][2]
    end

    def transaction_ids
      @_transaction_ids ||= build_transaction_ids
    end

    def build_transaction_ids
      record_lock_descriptions.map do |lock_description|
        extract_transaction_id(lock_description)
      end
    end

    def extract_transaction_id(lock_description)
      match = lock_description.match(/trx id (\d+)/)

      if match
        match[1]
      end
    end

    def transactions_by_id
      @_transactions_by_id ||= build_transactions_by_id
    end

    def build_transactions_by_id
      transactions.index_by do |transaction_row|
        transaction_row["trx_id"]
      end
    end

    def fetch_transactions
      return [] if transaction_ids.blank?

      ActiveRecord::Base.connection.exec_query(
        "SELECT * FROM information_schema.innodb_trx WHERE #{transaction_id_condition}"
      ).to_a
    end

    def transaction_id_condition
      ActiveRecord::Base.sanitize_sql("trx_id IN (#{transaction_ids.compact.join(',')})")
    end

    def transaction_current_query_present?(transaction)
      return false if transaction.blank?

      transaction["trx_query"].present?
    end

    def transaction_current_query_context(transaction)
      return unless transaction_current_query_present?(transaction)

      match = transaction["trx_query"].match(/\/\*(\S+)\*\//)

      if match
        match[1]
      end
    end

    def transaction_started(transaction)
      return if transaction.blank?

      transaction["trx_started"]
    end

    def transaction_record_locks(transaction)
      return if transaction.blank?

      transaction["trx_rows_locked"]
    end
end

Note that the snapshot includes a transaction_current_query_context attribute for each lock. This extracts the special comment we prepend (using this gem) to sql queries triggered within a controller action, background job or kafka event handler. These sql comments tell us where the query was triggered in our codebase⁴.

We then used this LocksSnapshot to add context to the LockWaitTimeout error sent to Sentry 👇

# in some controller where LockWaitTimeouts are a problem

rescue_from ActiveRecord::LockWaitTimeout, with: :capture_with_locks_snapshot

def capture_with_locks_snapshot(exception)
  Sentry.configure_scope do |scope|
    scope.set_context("Locks", LocksSnapshot.to_h)
  end

  raise exception
end

Studying the output in Sentry, it didn’t take long to spot a pattern.

Whenever a cluster of LockWaitTimeouts happened, there was the same very long-running transaction in each LocksSnapshot. The start time of that transction was always soon before the first LockWaitTimeout. It always held locks on rows in the same four tables. And it always locked a very large number of rows.

It seemed suspicious 😅. But a) there were always several other locks in every LocksSnapshot, so this long-running transaction could be a coincidence, and b) it wasn't clear where this transaction originated or what it was doing, because whenever LocksSnapshot.to_h was called, the transaction was no longer executing a query

2. Checking what each transaction was doing previously

To find out where the suspicious transaction originated, the detail from the information_schema.innodb_trx table wasn't enough, because it only includes the query currently being executed by each transaction. For the suspicious transaction, the trx_query column was consistently blank by the time LocksSnapshot.to_h was called.

That's very normal, though it makes it harder to debug transactions. In general, a long-running transaction isn't necessarily slow because of a slow query - any slow operation within the transaction can be responsible - and once a transaction holds a lock, that lock isn't released until the transaction commits or rolls back.

To find out where the suspicious transaction originated, I wanted to see the queries it had already executed. This is doable with the performance_schema.events_statements_history table, if the relevant instrumentation (for storing query history) is enabled⁵.

Here’s how 👇

def get_thread_queries_for_transaction(transaction_id)
  return if transactions_by_id[transaction_id].blank?

  transactions_by_id[transaction_id].filter_map do |transaction_row|
    transaction_row["thread_query"]
  end.last(3)
end

def transactions_by_id
  @_transactions_by_id ||= build_transactions_by_id
end

def build_transactions_by_id
  transactions.group_by do |transaction_row|
    transaction_row["trx_id"]
  end
end

def transactions
  @_transaction ||= fetch_transactions
end

def fetch_transactions
  ActiveRecord::Base.connection.exec_query(
    <<~SQL.squish
      SELECT trx_id, trx_started, trx_query, trx_rows_locked, esh.sql_text as thread_query
      FROM information_schema.innodb_trx trx
        JOIN information_schema.processlist ps ON trx.trx_mysql_thread_id = ps.id
        LEFT JOIN performance_schema.threads th ON th.processlist_id = trx.trx_mysql_thread_id
        LEFT JOIN performance_schema.events_statements_history esh ON esh.thread_id = th.thread_id
      WHERE #{transaction_id_condition}
      AND ps.USER != 'SYSTEM_USER'
      ORDER BY esh.EVENT_ID
    SQL
  ).to_a
end

def transaction_id_condition
  ActiveRecord::Base.sanitize_sql("trx_id IN (#{transaction_ids.compact.join(',')})")
end

We join the relevant events_statements_history entries to each innodb_trx entry (via the performance_schema.threads table), then select the sql_text values from events_statements_history. Each row in the response includes a transaction id and the text for a query recently executed by the transaction (or more accurately by the thread where the transaction is running).

This is potentially very useful context to include in a LocksSnapshot. It could help us understand what the transactions were doing before a LockWaitTimeout happened, and therefore where the transactions originated.

We didn’t end up adding this to our LocksSnapshot, partly because it adds some complexity, partly because the necessary instrumentation has an impact on the overall performance of the database, and partly because I got lucky...

Eventually, after checking the errors in Sentry for several days, I found one snapshot where the suspicious transaction did include a query that was still being executed. It was from a LockWaitTimeout right at the beginning of a cluster of errors, though I still don't know why other similar clusters didn't include a snapshot with a currently executing query 🤷‍♂️.

Anyway, this told me the controller and action where the transaction started, so I knew — or had a decent idea⁶ — where to start looking for the source of this regular, long-running transaction that appeared blocking lots of others.

3. Logging lock contentions before Lock Wait Timeouts happen

Unfortunately, some initial poking around didn’t throw up any obvious culprit. I found some relatively complex transactions that were sometimes slow and I made some optimisations, but that didn’t affect the rate of LockWaitTimeouts. And I still wasn't sure the “suspicious transaction” was actually the problem. Maybe it was a red herring.

To know for sure which transaction causes another to wait for a lock, we need to check while the transaction is waiting.

We therefore set about adding a process that regularly (every 30 seconds) checks for current lock contentions. The process is totally separate from our main application, but connects to the same database so it can query the information_schema.innodb_lock_waits, information_schema.innodb_trx and information_schema.innodb_locks tables. In combination, these tables give a snapshot of transactions waiting for locks and exactly which transactions are preventing those locks being granted.

The LockContentionsSnapshot looks like this 👇

class LockContentionsSnapshot
  WAIT_AGE_THRESHOLD_SECONDS = 10

  def initialize(database_client = DatabaseClient.new)
    @database_client = database_client
    @lock_waits = fetch_lock_waits
    @formatted_rows = format_rows
  end

  def to_a
    formatted_rows
  end

  private

    attr_reader :database_client, :lock_waits, :formatted_rows

    def format_rows
      rows = lock_waits.to_a
      filter_queries(rows)
    end

    def filter_queries(rows)
      rows.each do |row|
        row[:r_trx_query] = FilteredQuery.new(row[:r_trx_query]).to_s
        row[:b_trx_query] = FilteredQuery.new(row[:b_trx_query]).to_s
      end

      rows
    end

    def columns
      @_columns ||= lock_waits.fields
    end

    def fetch_lock_waits
      database_client.query(lock_waits_sql, symbolize_keys: true)
    end

    def lock_waits_sql
      <<~SQL.squish
        SELECT
          TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_age_secs,
          rl.lock_table AS rl_table,
          rl.lock_index AS rl_lock_index,
          rl.lock_type AS rl_lock_type,
          rl.lock_mode AS rl_lock_mode,
          TIMESTAMPDIFF(SECOND, r.trx_started, NOW()) AS r_trx_age_secs,
          r.trx_query AS r_trx_query,
          bl.lock_table AS bl_table,
          bl.lock_index AS bl_lock_index,
          bl.lock_type AS bl_lock_type,
          bl.lock_mode AS bl_lock_mode,
          TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS b_trx_age_secs,
          b.trx_query AS b_trx_query,
          b.trx_rows_locked AS b_trx_rows_locked
        FROM information_schema.innodb_lock_waits w
          INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
          INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id
          INNER JOIN information_schema.innodb_locks bl ON bl.lock_id = w.blocking_lock_id
          INNER JOIN information_schema.innodb_locks rl ON rl.lock_id = w.requested_lock_id
        WHERE r.trx_wait_started <= '#{WAIT_AGE_THRESHOLD_SECONDS.seconds.ago}'
        ORDER BY r.trx_wait_started;
      SQL
    end
end

Note that we only include locks that a transaction has been waiting on for a minimum amount of time. Effectively, we only include waits that are very close to triggering a Lock Wait Timeout.

Once we got this running, it was clear pretty quickly — by cross-referencing transaction ids - that the “suspicious transaction” was indeed the culprit, confirming that it was definitely worth digging further into where the transaction originated, and why it was blocking so many other transactions…

…And?

And eventually we found it.

It was the transaction around an ActiveRecord callback only triggered when admins updated a certain kind of record in a certain way 😅. The callback updated all records that met criteria from a subquery. Why that update_all was so slow⁷ - and why the transaction stayed open way beyond our request timeout limit - is another story.

But there’s no way I would have found the problem (and fixed it) without the pieces of the puzzle from the LocksSnapshot and LockContentionsSnapshot shared above.

I knew the controller name and action where the blocking transaction had (on at least one occassion) originated. I knew which tables the transaction locked rows in. I knew it locked a large number of rows. These are the clues that got me looking for large batch updates (like update_all) as part of transactions that involved all of the relevant tables.

Getting to the bottom of the problem wasn’t a smooth process. I learnt plenty along the way, and hope sharing some of the techniques and ideas that helped will make the process a bit smoother for others facing similar issues 🤞. Good luck.

[1]: This post is only directly about MySQL, and in particular the InnoDB storage engine.

[2]: By default, the current statement is rolled back (not the entire transaction). To have the entire transaction roll back, MySQL can be started with the --innodb-rollback-on-timeout option.

[3]: We can’t use the information_schema.innodb_locks table, because it only includes locks that haven't yet been granted or are still blocking. Also note that the SHOW ENGINE INNODB STATUS statement only includes a summary of locks if innodb_status_output_locks is enabled, and that option is only supported from MySQL 5.7. SHOW ENGINE INNODB STATUS also requires a connection to the database as a user with at least the PROCESS privilege.

[4]: This functionality is now also provided in Rails itself by ActiveRecord::QueryLogs.

[5]: There’s more information on enabling it in this blog, which inspired the following get_thread_queries_for_transaction method.

[6]: Unfortunately, it turned out that the comments we add to sql queries include the controller name but not its namespace. This complicated the search and sent me looking in the wrong place initially.

[7]: Fundamentally, the problem was that the query used a DEPENDENT SUBQUERY.