Inconsistent DB behaviour on DB failover in an HA Corda Setup


Hi All,

I had ran a few tests to check failover cases of DB between 2 parties PartyA and PartyB using the IOU CorDapp. The setup is done on a RHEL 7.6 system using corda enterprise 4.1

PartyA is in a HA Hot-Cold setup which is pointing to a clustered vault Postgres 9.6 DB, setup in an primary/secondary configuration. 

I fired continuous RPC requests of IOUFlow from PartyA to PartyB and vice versa. As the requests were being fired, the DB of PartyA node was brought down (process killed abruptly),

I observed the following behavior for 100 requests:

- After killing primary DB, there was a switch-over, where the secondary DB was now the primary and the new request were getting recorded in this DB's vault. (expected behavior)
- But the number of records written to the vault of PartyA and PartyB differed by 1 in some cases.
I compared the number of records of PartyA(iou_states + node_checkpoint) and PartyB(iou_states + node_checkpoint) and there was a difference of 1 record in some of the cases.

Sometimes the count of records differed in iou_states table of PartyA and PartyB
Sometimes the count of records differed in node_checkpoint table of PartyA and PartyB (because the node could not commit because of primary DB going down? But these checkpoints remain as is and do not get scheduled later as well. (even on node restart, they just stay there) 

- What can cause the difference in records? Is this normal? As i was hoping that the transactions recorded in vault of both parties to be the same after switchover

- Why does the node_checkpoint records do not get picked up after sometime?