Resilience of Master nodes in Yugabyte.
TL;DR: Removing nodes that run Master-processes from a yugabyte cluster. When does it break...? And if it breaks (e.g. no more quorum): How to Fix It?
Background and Setup: Masters and TServers.
Out of curiosity, I started experimenting with yugabyte, a distributed database, built on PostgreSQL, and hence very Postgres-Compatible (link, check, reference)
My setup consist of a cluster of nodes (containers) in a docker-network (yes, running on a laptop...)
From looking at the demos and using the latest yugabyte-provided container (link), I've built a cluster in docker consisting of maximum 7 nodes, looking like this in yugatool:
Spot the Masters: three of them, running on the first three nodes.
And spot the TServers, all 7 of them, running on .... all seven nodes.
The three nodes that are listed as Masters run the Master processes, those processes have a key role, they keep the meta-data and coordinate activity. In the list of Master-processes, one was elected Leader (M-L, Master Leader), and two are Followers (M-F, Master Follower)
Then there is the list of nodes that act as TServers. The TServer processes are the actual workers: they service the user-requests, and they store and process the data.
Notice: in this cluster: the three nodes that run the Master-processes, als run TServer processes: they also work to process SQL, and to Store Retrieve data.
Double check (ask yugabyte?): I think this is a common setup for yugabyte: a limited number of nodes, usually default RF=3, act both as Master and as TServer.
Monitoring via nodeX
There is one more node in my docker-network: nodeX, acting as "The Client". I needed a platform to monitor and to run tests from, and the easiest was to put another node inside docker. Hence I've created a nodeX in the same network to act as client- and monitoring-platform. It currently serves to loop a shell-script that connects to the cluster-database to "fill" Table t with 1 record per run, and sleep one second before re-connecting (code at bottom of blog). The loop-script output looks like this:
After every insert: it lists the nr of milliseconds from the previous record, (and the time, and some connect-info) which gives me an indication of of hiccups. In a stable situation, there are about 1600ms beween inserts a sleep of 1 sec, and about 0.6 sec of overhead for connecting, inserting, reporting. Slow? This currently runs on a 10yo macbook MBP-i7 ...
The additional nodeX also serves to occasionally run yb-admin and yugatool. I may turn nodeX into some "management-node" or automated-platform at some point (e.g. automatically start actions based on findings...)
Nodes can fail.
A faillure us usually the disappearance of a process or a node from the cluster. I want to start very simple: I'll just use docker-kill to stop nodes and see what happens.
docker kill node2
The result in yugatool is :
My cluster is now running with only 2 Masters, and 6 TServers. The Raft-Protocal has elected the Master-process on node3 to become the new master.
And the hiccup on inserting records was:
The hiccup is noticeable in the insert-loop, it shows 3 timeouts, supposedly the client-tool experienced a timeout trying to connect. But after 31763ms it manages to connect and insert again. The first 3 insert-attemtps are still a bit slow with 7383 msec and 12902ms between inserts, but then "normal service resumes". I didnt have to Do Anything manually, the cluster restored the service to the client.
I realise this is a primitive test on a slow laptop. I would expect service-resume to be faster on real hardware, and generally less hiccup.
When I re-start node2, it re-joins the cluster with no visible effect on the insert-loop, and the cluster looks like this:
The node3 retains the Master-Leader process, and the master-process on node2 becomes a Master-Follower. There was no error/warning and no measurable difference in timing of the "insert loop".
More nodes can fail.
So far so good: I can (patiently) survive the temporary outage of 1 (Master, Leader) node. It would seem evident that removing 2 of the 3 master nodes would cause the cluster a more serious problem. Let's try...
Currently, nodes 2 and 4 are running the processes for Master-Followers, let's remove those:
docker kill node2
docker kill node4
The yugatool-sreen failed almost immediately with "could not connect to master leader" (as more or less expected, bcse majority of leader was poufff-gone), and the insert loop seemed to hang:
After several minutes (more then 2) I gave it two Control-Cs, with no visible result, but I left it waiting. It finally came back with some timeout errors.
Killing a majority of Master-processes really did stop the database.
Repair: Did you try turning it on...?
The easiest repair would be to re-start the nodes that contain the master-processes. And that worked:
I re-started node4, patiently looked at my screen for about 2 minutes, nothing yet..
I then also restarted node2, and almost immediately, the insert-loop started running again:
So trying to re-start nodes 4 and 2 had repaired my cluster? But when I checked yugatool, node4 was still missing:
The cluster had a quorum (majority) of Masters from node2 and node3, but node4 was missing. One more attempt at re-start of node4 also fixed that.
But there is an unsolved mystery there. Note to Self: on next tests, I need to check closer if a start-attempt actually results in a running node. Another item for the todo- and to-check list.
Result: a yugabuyte clusters is Resilient, as Expected.
The first playful tests show: The database will function as long as there is a quorum (majority) of master-nodes running in the cluster up and running. As Expected.
Note that when a Master disappears from a cluster, the cluster becomes more vulnerable. In an RF-3 situation, you can only afford to lose 1 master-process.
There is a process for adding/removing Masters to a cluster, documented here, but it currently involves "manual" (or scripted) action via yb-admin.
For any production-deployment, I would recommend to have some "master on standby" and possibly automate some of the cluser-reconfig needed in case you really lose a node that is listed as master.
Questions... Always a lot Questions, and more...
Q: is Automated (or configured) Master-replacement possible?
When a master disappears from the cluster, why can some other node not start- or assume-the-role of that master more or less straight away? In the yugabyte-provided containers, every container seems to start a master-process anyway, but only the first 3 started containers (nodes) assume the Master-role. The master processes running on other nodes are just there for... what for actually?
Possibly a quirk form the yugabyted utility?
Proof: a ps -ef from node8 shows a running master-procees, check PID=18
Hence, even on the last-started and last-joined node, node8, there is a process running that _could_ be a master for the cluster. However, on all nodes, the contents of yugabyted.conf show a masterlist with the addresses of nodes 2, 3, an 4. This may be "hardcoded" or presisted in various other locations as well.
I also know, from earlier playing/testing, that replicas of tablets are re-located to other tserver-nodes when a tserver is out for a period of time.
It seems to me this "re-location" could also happen with Master-processes: If a Master is not responding for a longer period of time (minutes?), Can the remaining "leader" or remaining raft-group, not pick a candidate from the other nodes and have that join the masterlist ?
Q: I can (and do) spy on my servers via SQL. Check this screen running a query against the yb provided function yb_servers():
This shows me the list of running nodes, or possibly just the list of nodes where ysqlsh can find a query-processor.
If this information is available, would it be OK to ask for similar info on yb_masters() and yb_tservers(), and possibly expose some of the info from yb-admin or yugatool via SQL ?
In a previous blogs, I took also down a node from the master-list, and several nodes running TServer processes. At one point I took down All of the non-master nodes. ... the cluster kept functioning! (caveat: If, and only If you give the cluster time to re-balance in between). This already opens up a lot of possibilities for cluster-migration: For example adding + removing nodes to move to other hardwar or another cloud provider.
But all of that has to wait a bit. As always, testing-around in an IT system generates more and more questions. Limited time, Need to Prioritise, Most-Useful things first.
First I should really RTFM a bit more, and I need to play around a Lot to explore and discover more.
-------- end of Blogpost --- some more questions below -----
Q: killing a Master-Leader or Master-Follower currently generates a hiccup. Would there be less hiccup if data was placed on tservers outside of the master-list nodes ? (need to test with tablespaces to locate data...)
Q: would less tablets (split into), or less spread-over-nodes makes a table more robust to resist hiccups or node-failures ?
Q: when a node goes down: which tables are affected.. Would like to be able to SQL the table+tablet-locations to see this!
Q: yb-admin and yugatool: Can we have the lists of nodes, master, tservers, ordered by host-name or IP. It helps to spot changes if the lists are Consistently Ordered by some property, preferably the host-IP or hostname.
more notes...
I couldnt resist..
https://github.com/yugabyte/yugabyte-db/issues/18951
addendum: The Scripts to run the do_fill.sh:
things to note:
- The connect-string, primitive failover.
-
#!/usr/bin/sh
#
# do_fill.sh: fill the table t, and leave some info in filler-column
#
# set -v -x
# pick up the host name...
export hostnm=`hostname`
while true
do
ysqlsh -X postgresql://yugabyte@node5:5433,node6:5433,node7:5433?connect_timeout=2 <<EOF
\set QUIET on
\timing off
\pset footer off
\t
with s as ( select nextval('t_seq') as id
, pgs.setting as hostadr
from ( select generate_series ( 1, 1 ) ) as sub
, pg_settings pgs
where pgs.name = 'listen_addresses'
)
insert into t
select
s.id as id
, case mod ( s.id+1, 10000 ) when 0 then 'Y' else 'N' end as active
, mod ( s.id, 10000 ) / 100 as amount
, now () as dt /* timestamp, ms */
, rpad ( fnNumberToWords ( s.id ), 198) as payload
, '{ "client" :"$hostnm",'
|| ' "host" :"' || s.hostadr || '"}' as filler
from s
;
with t_seconds as ( select id
, dt
, ( to_char ( dt, 'SSSS' ) )::bigint as secs
, 1000 * ( to_char ( dt, 'SSSS' ) )::bigint
+ to_char ( dt, 'MS' )::int as msecs
, substr ( filler, 1, 50 ) as filler
from t
)
select
msecs - LAG ( msecs, 1 ) OVER w as msec_diff
-- , id
, to_char ( dt, 'HH24:MI:SS.MS' ) as timestmp
, filler
from t_seconds
window w as ( order by id )
order by id desc
limit 1;
EOF
sleep 1
done
# end while loop
Comments
Post a Comment