Distributed databases, how many shards and where are they

[+/- Draft! Beware: Work In Progress]

TL;DR: Trying to find out how to shard data on Yugabyte (link). I find a lot of "moving parts" in the YB-Universe, and try to explain and simplify.

For Deep-Down-Engineers, YB-ers: check the questions at bottom.


Background

Future of Databases is ... Serverless, possibly sharded. But Sharding is something for Very Large sets. An average (serverless) database that comes form "the real world"  doesnt need 1000s of shards... IIMHO, it needs "ACID" and Relational Capabilities First.

Yugabyte does this, and potentially "serverless" to make the database more Resilient, and more Scalable (on demand-scaling?) and overall Easier to Operate.

By experimenting with a 6-node database, I try to observe the sharding, and might try to draw some conclusions or "good practices" from what I see. My "cluster" is running in docker-containers hence K8s or other container-systems will also work.

After the first experiments, I am happy with my cluster, but I have also generated a lot of questions (for Yugabyte, for customers, for operators...).


My cluster and tooling-scripts

The destruction and re-creation of my cluster is by now totally automated. I'm creating a 6-node cluster using a slightly improved version of the script I built a while ago, see Earlier Blog (DM me if you want a copy, but it was/is mainly for demo- and educational purposes).

I have also generated a few primitive tools: 

A shell-script to check the health of my clusters, and provide me with info on ip's and nodenames: chk_clu.sh: 

#!/bin/ksh

# chk_clu.sh: get some info on the yb-cluster, and loop that...
#
# todo: node-list should be variable...
# todo: masterlist still hardcoded, risky?
# todo: separate for YB-Masters and YB-Servers ? 
#

# do quick check first, minimal info, mininalsleep, just quick....
# I notably want to verify the IP addy of each node.
for node in node1 node2 node3 node4 node5 node6 node7
do

  echo doing node $node  
  echo $node:
  docker exec $node  ps -ef  | grep database_host | cut -d= -f2  
  docker exec -it $node yugabyted status ; sleep 0 ; echo .

done

# now loop more slowly over nodes, if too much load: sleep longer
while true 
do

  for node in node1 node2 node3 node4 node5 node6 node7
  do

    echo $node:
    docker exec $node ps -ef  | grep database_host | cut -d= -f2  
    docker exec -it $node yugabyted status | grep atus; sleep 2 ; echo .

    echo .
    echo $node get_universe_config:
    docker exec -it $node yb-admin \
      --master_addresses node1:7100,node2:7100,node3:7100 \
      get_universe_config  | jq | grep clusterUu

    echo .
    echo $node list_all_masters:
    docker exec -it $node yb-admin \
      --master_addresses node1:7100,node2:7100,node3:7100 \
      list_all_masters

    echo .
    echo $node list_all_t-servers:
    docker exec -it $node yb-admin \
      --master_addresses node1:7100,node2:7100,node3:7100 \
      list_all_tablet_servers

    echo ----- $node done, next.. 

    sleep 2

  done

  echo ----- loop over nodes done, next.. 
  sleep 9

done 

# ------------------- end chk_clu.sh --------------

And a script to execute a command against all nodes (preferably for observation, not meant for Actions): do_all_clu.sh:

#!/bin/ksh

# do_all_clu.sh: loop over all nodes with a command...
#
# todo: HARDcoded nodenames TWICE: make a list.
#
# typical usage: measure disk-usage, or check tablets..
# ./do_all.sh du -sh --inodes /root/var/data/yb-data/
# ./do_all.sh yb-admin  -master_addresses node1:7100 list_tablets ysql.yugabyte t_1st 0 
#

#  verify first, show command

echo .
echo do_all_clu: \[  $* \] ... 
echo .

# do it once, quick...
for node in node1 node2 node3 node4 node5 node6 node7
do

  echo doing node $node  
  docker exec -it $node $*

done


echo .
echo do_all_clu.sh: Cntr-C or ...continue doing it slower forever...
echo . 

sleep 10

# now loop slowly over nodes
while true 
do

  echo .
  echo do_all: \[  $* \] ... 
  echo .

  for node in node1 node2 node3 node4 node5 node6 
  do

    # echo doing node $node  
    docker exec -it $node $*

    sleep 2 

  done

  echo ----- do_all_clu.sh: loop over nodes done, next.. 
  sleep 10

done 

# ----------------- end do_all_clu.sh -------------
Useless tidbit: those monitor-scripts can keep running even if I destroy + rebuild my cluster afresh. It took some realisation that hardcoded IPs in there were not a good Idea, but in the file yugabyted.conf I keep finding hard-coded (saved?) IPs. That tells me there may be other similar files, I need to RTFM more? 

Equipped with these scripts, and with some Ideas from reading the Docu at the Yugabyte site (link), and with what I remembered form "sharding" systems long ago I set off on some experiments. I was notably curious how the sharding would work out in practice (I did read the RTFM), and how that would impact things like RPC-Calls-volumes. 

The first exploration will be Very Simple... I hope. 


What to look for ? - Simplicity, and to avoid Overhead !

For this Exploration, and for and future experiments, I hesitate between "destructive testing", e.g. removing nodes and see how the system survives (maybe include geo-spread-testing...), or "overhead-testing", e.g. put load on the system and see how it holds up (again, maybe involve long-distance latencies? and geo-separated requirements).

On Reading and on very-early Observations, I got questions about YB-Masters/TServers, About how-many-tablets (too many shards?), About overhead on too-small-tables. And as usual, more+more questions pop up as I read+discover.

Keeping in mind the blog-title, I will try to begin As Simple As Possible. With a near-empty cluster(-database) and 1 table. 

Let's create a fresh cluster and have a look. My script creates a new docker-net and 6 containers. Here is a snippet:

docker run -d --network yb_net  \
  --hostname node5 --name node5 \
  -p15435:15433 -p5435:5433     \
  -p7005:7000 -p9005:9000       \
  yugabytedb/yugabyte           \
  yugabyted start --background=false --join node1.yb_net

sleep 15

.. etc

Repeat this six-fold with 6 nodenames, and so some checking with "docker-ps" and "yugabyted status".

It takes about 1 minute. When in a real hurry, I will experiment reducing the sleep time, but for now, I leave those sleep-waits in on the advice of a friend to avoid errors).

When 6 containers are up and running, I can logon with ysqlsh, and I see... six-nodes:


From the screenshot above, I conclude about the "empty" system:

 - Six nodes in yb_servers()

 - No User Tables in the database yet, just 1 view of mine to check on yb-properties.

 - I can query the pg-catalog (checking my own view in pg_class)

 - I query the yb-properties of the catalog-table pg_class, and I see it is supposedly assigned 1 (one, single) tablet in the yb-storage system.

Curious, I also looked around inside the containers:

First, there is the yb-admin tool to list Universe, Masters and TServers, that sort of looks like this (taken from the chk_clu.sh output):


We notice 3 Masters and 6 TServers, with a lot of information. I would Really like to be able to SQL-Select this information, but that will take some time to implement by Yugabyte?.

I can also use docker-exec to shell-into  a container, here I go into node4, but the other containers look very similar in process and disk-size:

All the containers seem to run a master, a tserver, a ui process, and then some. 

I'm also interested in the data-directories:



The "yb-data directory" is likely to contain some postgres-catalog info, some master info, and then the tablet-data. I'm going to keep an eye on these directories using the do_all_clu.sh: loop over the nodes in the cluster, and regularly do a du (disk usage) of these while I create a table.

Here is what the disk-usage (inodes and Mbytes) looks like on an empty database, six lines, in the order of node1 ... node6, on a cluster that has been running idle for a few hrs (didnt created/drop any object inside the database).

Inodes:


Megabytes:


This is partly how I discovered: nr-of-nodes = nr-tablets. It took me a while (and some help) to discover the yb_table_properties() function, and I really wish there were more of those. Ideally, the SQL-interface can also expose which tablets and which nodes the data of a table goes on. For the moment, I have to do that with the yb-admin tool.

What we know now: Creating a table (with no options specified) generates a table split into 6 tablets, with  supposedly 1 tablet placed on each node...

But do I always want 6 tablets? 

Also note: the nr of MB and inodes will slowly increase due to ongoing (logging) activity. And on dropping objects, you dont always reclaim space right away. The background processes may do their clean up later (more things to research, RTFM and test ...) . This means that comparing before-after values on activities like table-creation or table-filling needs to be done in _short_ timespans, and preferably on an often-refreshed cluster. Hence the scripts. 

And due to the refresh, please be careful to draw conclusions form comparing IPs or UUIDS from the screenshots as well.


So Far So Good..

Writing it all up is proving time-consuming, and generates questions faster than I can write.... but in the next instalment I want to compare...

First: Creating a single table, split into 1 Tablet. Check sizes and tablet-info. This should tell me approx where the data goes, and I expect 3 tablets (1 lead + 2 follow) bcse of RF=3?

Then: Creating a single Table, 6 Tablets. Check sizes and tablet-info. Expecting 6x3=18 tablets, spread more or less evenly over the nodes?

Also: Verify, using tablet-uuids, what happen son creating a set of Tables in a Co-Located database...


Impression so far: Nice! 

This system is Seriously Sharded and Resilient!

I like playing around with this thing.

But it is my impression that the default behaviour is to maximise the sharding for user-created tables. A new table gets 1 tablet on every node. I notice (I suspect) that the pg-catalog is effectively co-located because the catalog-tablets report only 1 shard (to be verified)


Caveat: I'm a Noob here.

I know that as a total-noob, I need to b very careful with too-quick conclusions. I'm learning how this system works, as I discover more+more.

Notably from the very first observations : Dont Jump To Conclusions!


----------------- end of blog  - keeping notes below, to be removed.. ?-------------


Questions, notably for YB-insiders :

 - Please verify/Confirm/Deny: Given a create-table statement with no options, the number of tablets (shards) is by default equal to the nr of nodes in the cluster ? e.g. the count of select yb_servers() ? Except for "colocation", where the nr of shards is effectively One, and tables+indexes appear by default in the same tablet.

 - Please verify/Confirm/Deny:  Given RF=3 (the default), and a cluster of 3 or more nodes: even if a table is defined as colocated or "split into 1 tablets", there will be 2 additional copies of the tablet, hence a total of 3-copies, where 3 is the RF. Hence outage of 1 node will not affect availability?  The info at port9000/tablets seems to confirm this by listing 1 leader and 2 followers for each tablet. Check?


- Please verify/Confirm/Deny: Every Tablet is "protected" by a raft-group, and the group will elect a new leader when current-leader gets detected as broken/offline ? 

Related to previous question: do tablets on 1 and the same node all share the same raft-group and the same leader+followers, does a node contain multiple raft-groups of different configurations ? 

This is important, notably on node-fail: does it mean 1 new election, or multiple new elections ?

- (keep for later: What is Tablet-data-Tombstoned signify?)

- (keep for later: is the data for pg_catalog and information_schema always co-located? if so in same- or separated tablet as user-created tables?)

- Please Comment : Defining tables with relative high number of tablets will incur inter-node network traffic. Is there a way to determine, possibly to control or optimise, the amount of calls between nodes ? 


- I just "hammered" a table with 1 tablet, RF=3, in a 6-node cluster, and I see the size of yb_data grow on 3 of the nodes (the ones with tablets for that table...) It dawns on me that some nodes should be YB-Master and run the master-processes, and other nodes might just be YB-Terver-modes. This is illustrated in:

https://docs.yugabyte.com/preview/architecture/concepts/universe/

What I begin to deduced: a universe may need more YB-Tervers (storage+processing) than YB-Masters (who seem to do mainly coordination).

- From this follow some questions:

Can a node be just YB-Master-only, or does it always need a YB-TServer on the node itself ?

What is the preferred way to add a YB-Master or a YB-Tserver  to a universe ? 

What is the preferred way to remove nodes ? (or can I just kill them and wait ?)


- IP addresses or nodenames.. elaborate on config-files ?

- Parameters: Master and tserver processes have long list of cmd-line options. How to query those options, and are they possibly saved in some config-file somewhere ? (too lazy to RTFM...)

 - Later: region + Geo-Separation.. (if real-demand...) how to preserve Resilience, Response-time, while meeting requirements for multi-region-deployment and/or geo-separation of data.

- later: what if the storage subsystem of a node is full? what kind of warnings or errors would ensue ? And is there possibly any growth-trend-data to see a disk-full scenario arriving ? 

 - Datamodel: the list of yb-entities (to be Queried?)... universe, node, master (role: leader, follower, ....) , tserver, namespace, table (pg: relation, can be table, index, mview... ), tablet (ldr or flwr), raft-group. .. 


- tablet internals..

you _knew_ I would ask… I see you jumping between ysqlsh and the web-interface : I would rather query all the data like “select tablename, tabletuuid, node, node_isleader from yb_tablets ;”  (3 records with RF=3).

Also a function or hidden column showing in which tablet a row is located…

select yb_tablet, id from table ; 




Comments

Popular posts from this blog

yugabyte : Finding the data, in a truly Distributed Database

Testing Resilience of CockroachDB - Replacing 1 node.

yugabyte terminology and related information