I never like finding why a deadlock happens, all too often the solution is embedded in the application code where some error in logic has caused multiple sessions to do things in different orders. The text of the error message provided by Oracle implies “heap all blame at the developers’ door, they are the guilty ones and the solution lies with them”. And often, but not always that is so.
Each night, after the customer’s outlets close for the day they each transmit a single binary file to the central system. The central system unpicks this file into the many record types within the file and sends them off to various destinations in the business. However things are not quite that simple. The overarching rule in this process is that the file must contain validated data or it is rejected as a whole. So the processing of each file is wrapped in a single transaction. No rocket science here: write to tables (lots of them) check some business rules and then commit or rollback. The time to process a single file basically depends on its length; big files could take tens of seconds but small ones just fractions of a second. To speed processing the application processes the incoming data four files at a time using four separate threads.
The deadlocks always occurred during an update on a single table, the one that stored stock item, store and a couple of transaction dates.
OUTLET_ID NOT NULL NUMBER
STOCKITEM_ID NOT NULL NUMBER
The primary key is on outlet_id & stockitem_id. There is also an index on stockitem_id. Both outlet_id and stockitem_id have foreign keys.
Strangely, the deadlocks always seemed to come from outlets belonging to one of two brands. These brands usually made large daily data files and had rigorously maintained stock policies; nothing can be stocked unless it is set up at the centre first.
Trace file from the deadlock showed this:
Resource Name process session holds waits process session holds waits
TX-00050021-00014589 11 31 X 13 16 S
TX-00060017-00014503 13 16 X 11 31 S
session 31: DID 0001-000B-00000002 session 16: DID 0001-000D-00000002
session 16: DID 0001-000D-00000002 session 31: DID 0001-000B-00000002
Rows waited on:
Session 16: no row
Session 31: no row
Now the interesting things here are the ‘S’ in the last column and the no row values in the ‘Rows waited on’ bit. To me this actually indicates that Oracle can not mark a transaction in the data block header. This is either because the number of transactions has exceeded MAXTRANS (unlikely as we have only four process threads) or the block is full and there is no space the mark any additional transactions. What the deadlock graph does not show is a problem with the application code.
So can we come up with a reasonable model to support the case that the deadlocks are caused by contention for datablocks? That’s in the next part!