Posts

Logging Performance-Information in a Distributed Database

 TL;DR:  We store the yb_Active_Session_History (ASH) data in a table, whereby every node inserts its own data. To see data from other nodes, we query the table. We tried to minimize the inter-host communiction and avoid additional mechanisms like FDW or dedicated RPCs. And we keep the data (history) for possible future usage. The main disadvantages are that (1) there is "insert activity", (2) that here is a possible delay due to the sampling-interval, and (3) this concept depends on the correct working of most of the components on a node. Background: Keep History. When using Yugabyte (link), a distributed database, and trying to keep track of the workload-history, e.g. the performance-data, Postgres and Yugabyte make internal data available via views, and notably YB added the view yb_active_session_history. This data can be used to find bottlenecks and hotspots in your system. You can find the (first version of) ASH documentatin here: https://docs.yugabyte.com/preview/explor

Monitoring a Disributed Database (notes)

[not published yet - need more work in "locally available" and "need to combine"] TL;DR: Keeping an eye on a database that runs on multiple servers (nodes, containers....), possibly in different locations challenging. An SQL-database with logging-data will help. I'll describe two possible mechanisms with some pros and cons. Background: The Distributed database, multiple nodes. Distributed, hence it runs on multiple servers (or containers, or whatever entity that can provide the compute + memory + storage you need). The processing of calls, queries is also distributed.  And any given SQL, even if it concerns just 1 records, is likely to touch several nodes: Query, Catalog, Storage, Consensus, Master, Replicas... And a query that uses joins and filters, and thus touches muliple tables and many records, is bound to need calls, RPCs, to several nodes. This is illustrated by the fact that the  "explain-plans" of both Yugabyte and Cockroach are already more

Distributed - but when do we over-react

[DRAFT!]  TL;DR: Distributing a Database helps with Sharding and Resilience. But do I really want my data spread out over 10s or 100s of "nodes" ?  Background. Been investigating Distributed Databases. Finding: Scale out means Fragmentation of Data. Adding nodes to a cluster can help to improve both  Resilience,  by having at least 2 spare copies and a voting mehanism, and  Scalability,  by adding more workers, more CPU, more memory. But the downside that I see (or imagine) is that my data gets too fragmented, too much spread out.  Solution: Shared Disks ?  For Resilience, I would prefer to keep 3, 5, or maybe even 7 copies of my data. But not more.  Too many copies mean too much overhead on writing and voting. For Sharding, I want the data spread, but not fragmented over separated storage components any more than necessary. For some "scale out" situations, I may want more CPU and more Memory, but maybe not a further fragmentation of my data on (non-shared) disks. U

Testing CockroachDB, more on Resilience ...

Image
 TL;DR: I'm simulating the faillure of more than 1 node and find some weird behaviour ? I'm learning how to operate a cluster. Repair is not always completely automatic? Spoiler: The cluster can remain Resilient, but you need to keep monitoring your cluster closely. There are some things I cannot quite explain, yet.  Do I need to RTFM more ?  => Edit: Yes. See answer to question below. Background: Distributed Databases are Resilient. I'm experimenting with a distributed database, where every data-item is replicated 3x (replication factor of 3) over a 3 node cluster. Data remains "available" as long as every object has a quorum, in this case 2 votes out of 3.  From previous experiments (link), I already know that an outage of 1 node is not a problem, but an outage of 2 nodes out of 3 is fatal. In previous experiment I showed how I could "fix" the loss of a single node by adding another, fourth, node to the cluster to bring the cluster back to 3-nodes a

Testing Resilience of CockroachDB - Replacing 1 node.

Image
 TL;DR: I'm using a multi-node cluster to see what happens if I remove and add nodes. Spoiler: it is Resilient, but within limits. The Replication-Factor determines how many nodes you can loose without impacting availability. And there is a more or less long period of vulnerability after losing a node: Some ranges will run witout their "3rd replica" for a while, which makes the system vulnerable until all the under-replicated ranges are fixed. Background: A distributed database uses Replication of data to protect against failure. In the case of cockroach, the default replication factor is 3 and it uses the raft-protocol to determine who is the "leaseholder" and whether a large-enough (majority) of replicas is available to continue, the Quorum. Cockroach also features a more or less "automatic repair" to cover situations where a node disappears. Testing: 3 nodes, a ha-proxy, and a running insert. Nodes are called roach1 ... roach3, and additional nodes

Checking Cockroach : Distributed, Replicated, Sharded.

Image
 TL;DR: Some simple testing of data in a table to see how it is "replicated" and "sharded". Spoiler: cockroachDB replicates from the start, but it only shards (splits) when necessary. Hence on default settings you need to add about 512M to a table before any sharding (splitting) occurs. It seems to work Fine! Background: Distributed Databases, Replication and Sharding. The two different aspects covered by a Distributed Database are 1) Replication and 2) Distribution/Sharding. Replication: Offering resilience against outage (storage-, server- or network-faillure), and Sharding. Replication means there are multiple copies of the data in different storage-areas, or even in different physical locations. Sharding: Offering scalability to more-users and/or parallel processing of large sets. By having data in more manageable chunks (shards, partitions), it can be handled by multiple processes in parallel. Distribution or Sharding (can) offer both higher multi-user capacity

Exploring Cockroach Distributed Database - 7 nodes, 9 nodes even, in containers.

Image
TL;DR: I managed to create a 7-node, and even a 9 node playground with a CockroachDB cluster running on my old macbook. There is a Lot to Explore but it looks Very Interesting. Background: "Distributed Databases". For a good while now we are looking at "serverless" databases. Either some DBaaS running at a (cloud-) provider, or a locally created cluster, which mostly run in k8s or docker. Note: Nobody seems to realise that for Truly Performant database access, the less layers, and the less chattyness, The Better. But then again, to obtain Resilience and Scale-Out: you need multi-node deployment and some form of replication, redundancy, or Sharding. To be able to properly Guide Customers in Database Deployments, I want to know how these system "Tick". I want to know what the strong- and weak points of "distributed" are. For this reason, I'm experimenting with some of the systems out there.  Setup, using containers in Docker. Note: My primitiv