Testing Resilience of CockroachDB - Replacing 1 node.

 TL;DR: I'm using a multi-node cluster to see what happens if I remove and add nodes.

Spoiler: it is Resilient, but within limits.

The Replication-Factor determines how many nodes you can loose without impacting availability. And there is a more or less long period of vulnerability after losing a node: Some ranges will run witout their "3rd replica" for a while, which makes the system vulnerable until all the under-replicated ranges are fixed.


Background:

A distributed database uses Replication of data to protect against failure. In the case of cockroach, the default replication factor is 3 and it uses the raft-protocol to determine who is the "leaseholder" and whether a large-enough (majority) of replicas is available to continue, the Quorum. Cockroach also features a more or less "automatic repair" to cover situations where a node disappears.


Testing: 3 nodes, a ha-proxy, and a running insert.

Nodes are called roach1 ... roach3, and additional nodes will be named accordingly. The start-script for the nodes are identical to the ones in earlier tests (link). Because I now only have 3 nodes, the ranges of all objects will be replicated over all three nodes.

I've put a the haproxy software in front of it (example here), telling it to search along 9 available addresses to try and connect. This will allow me to bring up+down nodes for my cluster.

I have two tables: t and u. Table t wil be used to "insert into while testing" Table u is loaded with 100,000 records, because I wanted some "sizeable" table in my database, especially when I expect data to be moved over nodes.

Notice that the ranges for tables t and u both seem spread (replicas) over all three nodes.

I then run an insert-script for 1 record per second, to see if I can spot hiccups or outages. The script reports the time since the previous insert, which is normally 1sec plus some overhead, typically some 1075 ms. I use this to spot any hiccups or errors.


First Test: kill 1 node: 

The effect is : Nothing. Inserts just keep going. If I am out of luck and one of my connections happens to be into the killed node, I will get an error on that session. But that is expected. 

Notice that the interval between inserts 39 and 40 goes up to 6816ms, but immediately resumes in the range of about 1050ms. Note that I was lucky: my test-loop wasnt connected to the failed-node at this point, hence no error at all.

Also, the haproxy will quickly find that roach1 is "unavailable" and wont try to connect to it anymore.

The overall cluster-health will indicate some "orange": 

There is 1 suspect/dead node in a cluster of 3 nodes. And thus the "ranges" are under-replicated. With only two nodes remaining, we miss 1 out of 3 replicas for every range.

From the SQL-client, we an see something similar:

The "gossip_nodes" reports 1 node no longer alive, and its ranges and leases have been taken up by the remaining nodes. node_id=1 no longer holds any leases, they have been taken up by the other remaining nodes. 

It is worth noting: the "ranges" for tables t and u are still "spread over nodes 1,2,3.

But for the "user", the insert-loop, there was barely a glitch, and I suspect that a more professional configuration of network can get much better (smaller) timings for that hiccup still.

As expected: a 3 node cluster is resilient to failure of 1 node.

As an aside: the warning about deprecated ranges-table is something I chose to ignore. The information in that table was too valuable, insightful, not to use. I hope to come back to that  in later episodes.


Second test, kill 2nd node:

I left the first node down, and killled the 2nd node.  Everything stops. 

The remaining node, roach3, node_id=3, can not find quorum (majority votes) for any of the objects it needs, and refuses to do anything. Some sessions may hang, others might fail with a timeout error and some even get to display a server-originated error. The exact error probably depends on the stage they were in when the kill occured.

But the good news is: when I re-start one of the down-nodes, roach1 or roach2, they join the still-running node roach3. With two of three nodes running, the cluster, or rather the "raft groups" obtains quorum for all ranges, and business resumes as usual. On a number of occasions with two-nodes-down, my insert-script had only been "hung" and continued with no further problems, in other cases, it had looped trying to re-connect, and in some rare cases, it had crashed and needed a re-start. 

In summary: 

As expected, a 3-node cluster with 2 nodes down is... Down.

But as soon as one of the missing nodes comes back, the cluster resumes operations and the database is back in business, even if only 2 nodes are available.


Third test: One node Down, Repair back to 3-nodes.

Next I will bring down 1 node of a 3-node cluster, and I will start a 4th node to join the cluster. When the cluster is back a 3 nodes I expect it to be resilient again.

So with the first node, roacht1, down, and a new node, roach4 joined, I had a 3-node cluster again. And I tried to kill roach2, assuming that the two remaining nodes would survive:

Not Quite. First, the insert-job seemed to hang for a several minuts, then it started throwing errors at a high-rate, something about lost-quorum. And it seemed as if it could query, but not insert. Because it threw errors on the insert, but kept on reporting the last-interval.

I quickly re-started roach2, to get back to 3-nodes again:

That got me back in business: inserts resumed, and all looked normal. I had been too quick with my "add-node and assume resilient" action. The web-gui still reported some under-replicated ranges.

What I think happened: the new node, roach4, was in the cluster, but not all ranges had a copy on roach4 yet. Hence, the ranges who were only on roach2 and roach3 were running "under-replicated", they only had 2 replicas. When the node roach2 got killed, they went down to 1-replica, hence no "quorum of 2/3", and the database stopped/failed. 

This is what under-replication looks like in the gui-dashboard:

We notice 3 "live" nodes, looking good. But there are under-replicated ranges. This means your database is (still) at risk. When I added roach4 or re-started roach2,  I should have checked the under-replicated in the gui-dashboard. Once back to 3 nodes, that number came down rapidly to.... zero. And now, with a healthy 3-node cluster, and no under-replicated ranges, I could again kill/start/kill 1 out of the 3 nodes with minimal impact: 

Let's try it:


I could now kill/restart roach2, with minimal impact, or some other node...

QED: I also could kill/restart roach3 (or roach4), with minimal impact. 


Lessons: Resilient and Repairable.

You can Survive the loss of One Node.

And you can (Should!) repair that loss ASAP.

As long as the cluster is healthy, with no under-replicated ranges, any stop/kill/loss of ONE NODE is nearly harmless: The database will stay functioning, and client-connection will only notice either the "loss of their node" when connected to the failing node, or possibly a small hiccup because some underlying object needs to re-elect a raft-leader or lease-holder.

So Far So Good: This system can survive a single node-faillure.

Repair, as in restore of resilience, can be done by either bringing back the failed node or by adding a new node to get back to the count of 3-nodes.

Caveat: On repair, you have to "wait" for the re-distribution of all ranges over sufficient nodes to have complete quorum for all ranges. E.g. every range has to have its 3-replicas distributed over the 3-nodes.


There is more.. but that has to wait for some next episode.

I want to make sure I "understand" this part before moving on to "bigger faillures" and more funny situations.

There is, for example, the possibility of replication-factor 5, which should give the ability to survive 2 node-faillures. I know already that the "catalog" is stored with replication-factor 5 Whenever Possible, hence the catalog is already better protected than some of the user-objects.

It may also be interesting to look into reducing the vulnerable time after a node-failure.

Many more things to look into.

--------   End of Blogpost -- Some Questions and Remarks below --- -- 


Q: Although Quorum is vital to prevent "split brain" I wonder if there is a way to provoke a single node out of 3 to become "stand alone cluster". This would be akin to the very old trick of breaking a RAID-mirror set off a 3-way RAID to take out 1 set and use it as a backup.

---> Edit: Reply from Cockroach-Slack (Shaun) was along these lines: 

1) Set replication-factor to 1, and decommission the other nodes, then you would end up with a single node. => Yes. I should RTFM and test that sometime.

2) recovering from some disaster where only 1 node remains: Yes. Can also be done. but there is potential for logical data loss or logical-inconsistency, as some range-replicas may not have received all the latest data yet.

As Always, I am Grateful for Shaun @ CRDB for answering my funny questions.

//End of edit  <--- 


Remark on ranges and deprecated interna tables/views: In a follow-up episode I think I can show that the information in the "deprecated tables" of ranges and ranges-no-leases, is Very Valuable. I'd like to keep that available in SQL, rather than only in a SHOW-command. I would probably ask for an SQL Query-able relation, a many-to-many relation, between ranges and user-objects. This would help to determine which objects depend on which nodes, and thus determine the risk involved in loosing a particular node or set of nodes.

--------- real end of blogpost -------- 

Comments

Popular posts from this blog

yugabyte : Finding the data, in a truly Distributed Database

yugabyte terminology and related information