Raft Notary


Sean
 

Hi,
  I'd like to know what limitations are know for the embedded Copycat Raft server inside notary per https://docs.corda.net/docs/corda-os/4.5/running-a-notary.html 
  Copycat has been upgraded to the Atomix server https://atomix.io/docs/latest/getting-started/. So do the previously mentioned limitations still apply to the new Atomix?

  Thanks.

\Sean


Thomas Schroeter
 

Hi Sean,

We found data synchronization issues with Copycat Raft when a notary cluster member rejoins the cluster after an offline period. I can't comment on the latest Copycat implementation, since I haven't tested it.

Thanks,
Thomas


From: corda-dev@groups.io <corda-dev@groups.io> on behalf of Sean via groups.io <sean.zhang@...>
Sent: 08 July 2020 16:59
To: corda-dev@groups.io <corda-dev@groups.io>
Subject: [corda-dev] Raft Notary
 
Hi,
  I'd like to know what limitations are know for the embedded Copycat Raft server inside notary per https://docs.corda.net/docs/corda-os/4.5/running-a-notary.html 
  Copycat has been upgraded to the Atomix server https://atomix.io/docs/latest/getting-started/. So do the previously mentioned limitations still apply to the new Atomix?

  Thanks.

\Sean


Sean
 

Thanks, Thomas for the info.
Trying to reproduce the synchronization issues, 
in the demo, I stopped one of the notary workers and let the other two continue to notarize transactions. After a few minutes, I restarted the stopped worker. Looks like only committed transactions got synched up, while request log and committed states did NOT synch - they only got the latest transactions and previous ones got lost.
Repeating the same exercise again showed the same behavior - only committed transactions got synched up, and request log and committed states only had the latest transactions while the previous ones got lost.
Is that the behavior you experienced? Or did you experience other sync issues? 
I wonder why it behaved as such - one table fully synched and the other two only had the latest txs losing the previous ones. Is that Copycat's bug or something related to our integration?

Any insight you can provide is appreciated.

Thanks.

\Sean


Thomas Schroeter
 

Hi Sean,

I focused on committed states and tested adding new workers to the cluster. I've experienced similar sync issues like the ones you describe and chalked it up to a library bug.

Thanks,
Thomas


From: corda-dev@groups.io <corda-dev@groups.io> on behalf of Sean via groups.io <sean.zhang@...>
Sent: 13 July 2020 18:43
To: corda-dev@groups.io <corda-dev@groups.io>
Subject: Re: [corda-dev] Raft Notary
 
Thanks, Thomas for the info.
Trying to reproduce the synchronization issues, 
in the demo, I stopped one of the notary workers and let the other two continue to notarize transactions. After a few minutes, I restarted the stopped worker. Looks like only committed transactions got synched up, while request log and committed states did NOT synch - they only got the latest transactions and previous ones got lost.
Repeating the same exercise again showed the same behavior - only committed transactions got synched up, and request log and committed states only had the latest transactions while the previous ones got lost.
Is that the behavior you experienced? Or did you experience other sync issues? 
I wonder why it behaved as such - one table fully synched and the other two only had the latest txs losing the previous ones. Is that Copycat's bug or something related to our integration?

Any insight you can provide is appreciated.

Thanks.

\Sean


Sean
 

Hi Thomas,
With some minor twist of the commit command and the snapshot install, I am able to reach sync for this use case:
all 3 workers (W0, W1, W2) up -> 1 tx -> check all in sync -> stop W2 -> 1 more tx -> start W2 -> W2 got in sync

Here by "in sync", I mean all three notary tables for all three workers have the same transaction data.

I also tested the use case of adding a new worker to the cluster:
2 workers (W0, W1) up -> 1 tx -> start W2 -> Is W2 in sync?
The answer depends on clusterAddresses. If clusterAddresses include all three workers, W2 will be in sync.

So it looks like for a static cluster, we can start and stop workers and the cluster will reach sync as long as N/2 + 1 workers are always up at any given time. Of course, more tests are needed to confirm that.

But, if we just add a new worker to an existing cluster which does not know anything about the new worker, the new worker would not catch any of the previous transactions by the cluster. That is probably because the cluster already had cleaned the commit log due to FULL compaction.

To anticipate future dynamic addition of workers, it seems the cluster should not clean the commit log at all. Then what would be the implication of keeping a permanent log in terms of storage and performance?

In some deploy scenarios, the cluster size is known at the beginning by design, so we can use the static cluster configuration to ensure consistence and still enable log compaction. That again, is pending more tests.

Those are just my very preliminary observations. I am sure I have missed something important. So please advise.

Thanks.

\Sean