Resilience of Master nodes in Yugabyte.

TL;DR: Removing nodes that run Master-processes from a yugabyte cluster. When does it break...?  And if it breaks (e.g. no more quorum): How to Fix It? 


Background and Setup: Masters and TServers.

Out of curiosity, I started experimenting with yugabyte, a distributed database, built on PostgreSQL, and hence very Postgres-Compatible (link, check, reference)

My setup consist of a cluster of nodes (containers) in a docker-network (yes, running on a laptop...)

From looking at the demos and using the latest yugabyte-provided container (link), I've built a cluster in docker consisting of maximum 7 nodes, looking like this in yugatool:

Spot the Masters: three of them, running on the first three nodes. 

And spot the TServers, all 7 of them, running on .... all seven nodes.

The three nodes that are listed as Masters run the Master processes, those processes have a key role, they keep the meta-data and coordinate activity. In the list of Master-processes, one was elected Leader (M-L, Master Leader), and two are Followers (M-F, Master Follower)

Then there is the list of nodes that act as TServers. The TServer processes are the actual workers: they service the user-requests, and they store and process the data.

Notice: in this cluster: the three nodes that run the Master-processes, als run TServer processes: they also work to process SQL, and to Store Retrieve data. 

Double check (ask yugabyte?): I think this is a common setup for yugabyte: a limited number of nodes, usually default RF=3, act both as Master and as TServer.


Monitoring via nodeX

There is one more node in my docker-network: nodeX, acting as "The Client". I needed a platform to monitor and to run tests from, and the easiest was to put another node inside docker. Hence I've created a nodeX in the same network to act as client- and monitoring-platform. It currently serves to loop a shell-script that connects to the cluster-database to "fill" Table t with 1 record per run, and sleep one second before re-connecting (code at bottom of blog). The loop-script output looks like this:

After every insert: it lists the nr of milliseconds from the previous record, (and the time, and some connect-info) which gives me an indication of of hiccups. In a stable situation, there are about 1600ms beween inserts a sleep of 1 sec, and about 0.6 sec of overhead for connecting, inserting, reporting. Slow? This currently runs on a 10yo macbook MBP-i7 ...

The additional nodeX also serves to occasionally run yb-admin and yugatool. I may turn nodeX into some "management-node" or automated-platform at some point (e.g. automatically start actions based on findings...)


Nodes can fail.

A faillure us usually the disappearance of a process or a node from the cluster. I want to start very simple: I'll just use docker-kill to stop nodes and see what happens. 

docker kill node2

The result in yugatool is :

My cluster is now running with only 2 Masters, and 6 TServers. The Raft-Protocal has elected the Master-process on node3 to become the new master. 

And the hiccup on inserting records was:

The hiccup is noticeable in the insert-loop, it shows 3 timeouts, supposedly the client-tool experienced a timeout trying to connect. But after 31763ms it manages to connect and insert again. The first 3 insert-attemtps are still a bit slow with 7383 msec and 12902ms between inserts, but then "normal service resumes". I didnt have to Do Anything manually, the cluster restored the service to the client.

I realise this is a primitive test on a slow laptop. I would expect service-resume to be faster on real hardware, and generally less hiccup.

When I re-start node2, it re-joins the cluster with no visible effect on the insert-loop, and the cluster looks like this:

The node3 retains the Master-Leader process, and the master-process on node2 becomes a Master-Follower. There was no error/warning and no measurable difference in timing of the "insert loop".


More nodes can fail.

So far so good: I can (patiently) survive the temporary outage of 1 (Master, Leader) node. It would seem evident that removing 2 of the 3 master nodes would cause the cluster a more serious problem. Let's try...

Currently, nodes 2 and 4 are running the processes for Master-Followers, let's remove those:

docker kill node2

docker kill node4

The yugatool-sreen failed almost immediately with "could not connect to master leader" (as more or less expected, bcse majority of leader was poufff-gone), and the insert loop seemed to hang:


After several minutes (more then 2) I gave it two Control-Cs, with no visible result, but I left it waiting. It finally came back with some timeout errors.

Killing a majority of Master-processes really did stop the database.


Repair: Did you try turning it on...?

The easiest repair would be to re-start the nodes that contain the master-processes. And that worked: 

I re-started node4, patiently looked at my screen for about 2 minutes, nothing yet.. 

I then also restarted node2, and almost immediately, the insert-loop started running again: 


So trying to re-start nodes 4 and 2 had repaired my cluster? But when I checked yugatool, node4 was still missing:


The cluster had a quorum (majority) of Masters from node2 and node3, but node4 was missing. One more attempt at re-start of node4 also fixed that. 

But there is an unsolved mystery there. Note to Self: on next tests, I need to check closer if a start-attempt actually results in a running node. Another item for the todo- and to-check list.


Result: a yugabuyte clusters is Resilient, as Expected.

The first playful tests show: The database will function as long as there is a quorum (majority) of master-nodes running in the cluster up and running. As Expected.

Note that when a Master disappears from a cluster, the cluster becomes more vulnerable. In an RF-3 situation, you can only afford to lose 1 master-process. 

There is a process for adding/removing Masters to a cluster, documented here, but it currently involves "manual" (or scripted) action via yb-admin. 

For any production-deployment, I would recommend to have some "master on standby" and possibly automate some of the cluser-reconfig needed in case you really lose a node that is listed as master.


Questions... Always a lot Questions, and more...

Q: is Automated (or configured) Master-replacement possible? 

When a master disappears from the cluster, why can some other node not start- or assume-the-role of that master more or less straight away? In the yugabyte-provided containers, every container seems to start a master-process anyway, but only the first 3 started containers (nodes) assume the Master-role. The master processes running on other nodes are just there for... what for actually? 

Possibly a quirk form the yugabyted utility?

Proof: a ps -ef from node8 shows a running master-procees, check PID=18

Hence, even on the last-started and last-joined node, node8, there is a process running that _could_ be a master for the cluster. However, on all nodes, the contents of yugabyted.conf show a masterlist with the addresses of nodes 2, 3, an 4. This may be "hardcoded" or presisted in various other locations as well.

I also know, from earlier playing/testing, that replicas of tablets are re-located to other tserver-nodes when a tserver is out for a period of time. 

It seems to me this "re-location" could also happen with Master-processes: If a Master is not responding for a longer period of time (minutes?), Can the remaining "leader" or remaining raft-group, not pick a candidate from the other nodes and have that join the masterlist ?


Q: I can (and do) spy on my servers via SQL. Check this screen running a query against the yb provided function yb_servers():


This shows me the list of running nodes, or possibly just the list of nodes where ysqlsh can find a query-processor.

If this information is available, would it be OK to ask for similar info on yb_masters() and yb_tservers(), and possibly expose some of the info from yb-admin or yugatool via SQL ?

In a previous blogs, I took also down a node from the master-list, and several nodes running TServer processes. At one point I took down All of the non-master nodes. ... the cluster kept functioning! (caveat: If, and only If you give the cluster time to re-balance in between). This already opens up a lot of possibilities for cluster-migration: For example adding + removing nodes to move to other hardwar or another cloud provider. 

But all of that has to wait a bit. As always, testing-around in an IT system generates more and more questions. Limited time, Need to Prioritise, Most-Useful things first.

First I should really RTFM a bit more, and I need to play around a Lot to explore and discover more.


-------- end of Blogpost --- some more questions below ----- 

Q: killing a Master-Leader or Master-Follower currently generates a hiccup. Would there be less hiccup if data was placed on tservers outside of the master-list nodes ? (need to test with tablespaces to locate data...)

Q: would less tablets (split into), or less spread-over-nodes makes a table more robust to resist hiccups or node-failures ? 

Q: when a node goes down: which tables are affected.. Would like to be able to SQL the table+tablet-locations to see this!

Q: yb-admin and yugatool: Can we have the lists of nodes, master, tservers, ordered by host-name or IP. It helps to spot changes if the lists are Consistently Ordered by some property, preferably the host-IP or hostname.


more notes...

I couldnt resist.. 

https://github.com/yugabyte/yugabyte-db/issues/18951


addendum: The Scripts to run the do_fill.sh: 

things to note: 

 - The connect-string, primitive failover.


#!/usr/bin/sh
#
# do_fill.sh: fill the table t, and leave some info in filler-column
#
# set -v -x

# pick up the host name...
export hostnm=`hostname`

while true
do
  ysqlsh -X postgresql://yugabyte@node5:5433,node6:5433,node7:5433?connect_timeout=2 <<EOF
 
    \set QUIET on
    \timing off
    \pset footer off
    \t

    with s as ( select nextval('t_seq') as id
                     , pgs.setting      as hostadr
                  from ( select generate_series ( 1, 1 ) ) as sub
                     , pg_settings pgs
                  where pgs.name = 'listen_addresses'
              )
    insert into t
    select
      s.id                                     as id
    , case mod ( s.id+1, 10000 )  when 0 then 'Y' else 'N' end  as active
    , mod ( s.id, 10000 ) / 100                as amount
    , now ()                                   as dt   /* timestamp, ms  */
    , rpad ( fnNumberToWords ( s.id ), 198)    as payload
    ,     '{ "client" :"$hostnm",'
      ||  '  "host"   :"' || s.hostadr || '"}' as  filler
    from s
    ;

    with t_seconds as ( select id
               , dt
               , ( to_char ( dt, 'SSSS' ) )::bigint        as secs
               , 1000 * ( to_char ( dt, 'SSSS' ) )::bigint
                  + to_char ( dt, 'MS' )::int              as msecs
               , substr ( filler, 1, 50 )                  as filler
            from t
    )
    select
           msecs - LAG  ( msecs, 1 ) OVER w as msec_diff
    --   , id
         , to_char ( dt, 'HH24:MI:SS.MS' )  as timestmp
         , filler
    from t_seconds
    window w as ( order by id )
    order by id desc
    limit 1;

EOF

  sleep 1

done
# end while loop
	
------ end of do_fill.sh ----- 

Comments

Popular posts from this blog

[DRAFT] Yugabyte and Postgres - Finding information.

[Draft] Yugabyte ASH notes and suggestions.

[Draft] Postgres - Yugabyte: logging of sessions, and (active-) session-history