Distributed - but when do we over-react

[DRAFT!] 

TL;DR: Distributing a Database helps with Sharding and Resilience. But do I really want my data spread out over 10s or 100s of "nodes" ? 


Background.

Been investigating Distributed Databases.


Finding: Scale out means Fragmentation of Data.

Adding nodes to a cluster can help to improve both 

Resilience, by having at least 2 spare copies and a voting mehanism, and 

Scalability, by adding more workers, more CPU, more memory.

But the downside that I see (or imagine) is that my data gets too fragmented, too much spread out. 


Solution: Shared Disks ? 

For Resilience, I would prefer to keep 3, 5, or maybe even 7 copies of my data. But not more.  Too many copies mean too much overhead on writing and voting.

For Sharding, I want the data spread, but not fragmented over separated storage components any more than necessary.

For some "scale out" situations, I may want more CPU and more Memory, but maybe not a further fragmentation of my data on (non-shared) disks.

Under sudden increase of traffic, load, I may want to scale up (quickly!) to 37 nodes, to add CPU and memory to my system, but I dont want to have to look after 37 storage-partitions. And I dont want to invest effort in re-distributing shards over 37 disks, especialy not under load.

Alternatively, when scaling back down from 37 nodes to, say, 9 (3 regions, 3 zones), I dont want the system to have to re-consolidate my data from 37 back to 9 storage-partitions.

Ideally, Scaling up is a matter of CPU and Memory, not for Storage, at least not by moving data around to occupy new storage-partitions. And scaling down is actually quite difficult if you have just spread out over multiple non-shared storage partitions.

Note: On scale-down, I can also imagine re-mounting 37 storage-partitions to just 9 nodes, but that would be complicated in the databases I have observed so far.


In Short:

Distributed Databases use multiple nodes (instances, shard) to achieve both Resilience and Sharding. But I hesitate to spread my data out over too-many pieces of storage.


Additional thoughts

Scale-down in both CRDB and YB seems to involve manually moving data back to the remaining nodes.


Comments

Popular posts from this blog

yugabyte : Finding the data, in a truly Distributed Database

Testing Resilience of CockroachDB - Replacing 1 node.

yugabyte terminology and related information