This blog is aimed at people who are building a new application, without an existing userbase, from scratch. I’m a junior developer, so please take everything I say with a heaping tablespoon of salt.
Intro
I started coding almost exactly three years ago. I’d been dabbling a bit with HTML, CSS and some extremely minimal JavaScript in the preceding months, but I didn’t seriously dive into learning programming and web development until October of 2021, when I enrolled in a coding bootcamp.
But you know what I was doing in December of 2020, while cursing CSS and trying desperately to center a div in my spare time? Reading about database sharding and horizontal scaling.
That’s right, folks. Before I even knew what a function was, I was worried about building a website for “web scale”. Because I, like so many aspiring developers and companies that came before me, was going to build an application that every single person on earth would eventually use.
There are dozens of us. Dozens!
It’s safe to say that I’m a bit of an overthinker. And a perfectionist. I don’t think that’s uncommon for software developers — a lot of us enter this field because we love thinking logically, designing systems, and coming up with efficient ways to solve difficult problems. So, planning ahead, considering edge-cases, and striving for the best possible system comes pretty naturally. On the surface, that seems positive! But in the long-run, we might actually have to un-train ourselves from following this kind of behavior, lest we lose ourselves in a horrific labyrinth of our own “optimized” design.
My Journey into NoSQL
In spite of seeing all the “just use postgres” memes floating around the internet, I forged boldly ahead on my journey into the immense variety of NoSQL databases.
And I’m really glad that I did. I learned so much, about databases, data modeling, and query optimization as well as overall data storage and system design. I wouldn’t have learned about CDNs, object storage, caching, indexing, sharding, load-balancing, clustering, replication, separation of concerns, or a long list of other topics if I wasn’t fiendishly pursuing the ultimate paradigm of Scalability.
The two NoSQL databases I’m most familiar with are Neo4j and MongoDB. I’ve also dabbled a bit with Redis and Cassandra, and read about a variety of other databases, including DGraph and CockroachDB, but I haven’t yet used them in a way that requires real familiarity.
I love both Neo4j and MongoDB. They’re incredibly cool, flexible tools, and Cypher in particular is a great query language — it’s so expressive and flexible that it makes going back to SQL a bit of a drag.
And, it’s with these two databases that I came up with my Ultimate Infinitely Scalable design, in which my application data could theoretically grow to infinite proportions, provided the requisite hardware, userbase, and patient developers who would be willing to commit to a collective hallucination of Scale. And, in all honesty, it’s not really that complicated. It just might, in the future, actually have a role to play in applications that I build. I’m glad I have it — it’s comforting knowing that I can scale to the horizon and beyond.
But, it was only once I had it and started trying to implement it that I really started questioning if I actually needed it.
A Change of Heart
There were a few things that started to shift the gears in my head and make me think Maybe there’s an easier way…
1. Neo4j, like most NoSQL databases, is schemaless.
That’s right — you can just dump data straight into the database. It’s 100% on the developer to define and implement a layer over the database to ensure data integrity. That means one little uncaught typo in a query could inadvertently end up populating the database with mislabelled data, which could then start causing application crashes.
Good software design would greatly minimize the risk of something like this actually happening in production, but the possibility alone is a cause for night-sweats. And this is something that isn’t even possible in a SQL database like Postgres or MySQL; tables have names, and columns have names and datatypes — if your queries don’t get the names and types right, data doesn’t get put into the database. Simple.
2. Data Duplication
When I was first starting out, I tried to design my entire application using MongoDB, relational queries and all. No, I wasn’t trying to do join
operations, which are antithetical to MongoDB’s design — I was following the recommended data-duplication pattern. Which ended up being an absolute nightmare.
Brief overview: because there aren’t relationships in a document database like Mongo, you end up embedding data from other documents in related documents, then running queries on those embedded pieces of data. This is fine until you need to run some kind of update on the original document. Any data embedded in other documents will then also need to be updated. Which is awful, especially if there are a large number of documents with this embedded data.
I can’t remember exactly what kind of data duplication problems I was running into at the time, but it was enough for me to feel a sense of rising panic and existential horror. This instilled in me a key design principle — if data relationships are a first order priority, there must be some sort of core query database that can express relationships between entities and serve as a single source of truth. Mongo can still serve as a great, highly scalable document data store, but it’s not sufficient for densely interconnected data.
Which is how I came to use Neo4j — it has an incredibly powerful relational model (it’s a graph), and it actually has a really cool, flexible sharding model that allows for custom clustering and sharding, negating the issue of monotonically increasing shard keys and hot shards. (Doesn’t that sound easy and fun?!)
The thing is, sharding is always hard. And sharding in Neo4j also requires data duplication. Which also makes updating data a bit of a nightmare. If I really needed to, I’d be willing to make this compromise, and I don’t think it would be that hard to manage, but once established the system would become more brittle, less extensible, and harder to maintain.
3. Heeding Wisdom
I’m still a new developer. Full confession, I’ve yet to land my first full programming job — I’ve taught software development for a few years, and have worked extensively on my own projects, but I haven’t yet really gotten the full-developer experience of my-code-to-production work.
So, I read a lot and take in a lot of developer opinions. And some of them make really good points I can’t ignore. One of which is:
Preemptive optimizing is where good projects go to die.
I still go back and forth on this one, but I think the core point is really strong: don’t let the great be the enemy of the good; get something working, not something perfect.
Sure, one day down the road, my super fine-tuned, hyper-scalable design might come into play. But, right now, it’s just a hindrance preventing me from actually getting something off the ground and running.
4. Crunching the Numbers
The big idea behind horizontally sharding data is to allow for a single database to accommodate more data than a single machine can hold. This is different from horizontal scaling via replication to accelerate read performance and make a database more resilient, which to my understanding is almost always a good idea. But in replication, the full dataset is still held on a single machine — there are just several copies of that machine that can run queries and return data. In sharding, the dataset itself is spread out across multiple machines.
But how much data do you actually need to be storing in order to warrant sharding?
Well, let’s crunch the numbers.
Let’s say you want an easy developer experience, so you decide to use a managed database offered by Digital Ocean.
At the highest end of the spectrum, you could get a single machine that can store 15 tebibytes of data on disk. That’s a ton! But let’s start smaller, and consider the smallest size offered for a managed Postgres database: 10 GiB (gibibytes) of disk storage, which costs $15/month for a single instance (no replication).
1 GiB is 1024 MiB; 1MiB is 1024 KiB; 1KiB is 1024 bytes. So, 1 GiB can store around 1 billion bytes. That means the smallest offered instance can store 10 billion bytes worth of data.
Now let’s assume that, on average, records in your database have about 10 columns, and each column stores about 50 bytes worth of data (which is about equivalent to a string containing 50 characters). That means each record is around 500 bytes.
10 billion / 500 = 20 million.
That’s right: a database that costs $15 a month can store up to 20 million records.
But wait, for an extra $4/month, you could get 30GiB of disk storage — now you can store 60 million records!
And that’s just the low end of the spectrum. If you did end up going for the ultra-chunky, 15Tib instance — around 16 trillion bytes — (coming in at a whopping $5,000/ month), you could store around 32 billion records.
Just for fun here are few other disk storage volumes and the approximate amount of our theoretical records they could hold:
100Gib: 200 million records.
500Gib: 1 billion records.
1TiB: 2 billion records.
I think you get the picture.
This is an imperfect example: it doesn’t accommodate for things like index store sizes, only discusses data stored on disk, not available RAM, uses an arbitrary record size, and doesn’t account for the fact that performance will probably degrade as data-sets grow (running a 15TiB database might be a bit slow).
But if you’re getting to the point where you’re starting to get a slow Postgres instance due to data volume, you’re hopefully now running or working for a successful organization that has the knowledge and resources to transfer to something more scalable.
Yes, shifting from one system to another system is difficult and probably has a lot of pain points. But it’s a tragic waste of time and energy trying to account for a problem that might never come to pass. Especially when such preemptive optimization might be a hindrance to success in the first place.
So, I don’t know, maybe consider… just using Postgres?
Conclusion
So, what, should we just never worry about scalability? Is NoSQL pointless? Should I not try and learn how to design scalable systems?
Not at all! As I mentioned before, learning about and designing for scalability has been instrumental to my growth as a developer. There is no one-size-fits-all design; different tools have different purposes. You very well might come across a situation where you need to shard your database, and in that situation, you’ll want to know how to do it well. Just don’t get bogged down in future potentialities that may never come to pass. For the time being, those are just illusions floating on your psychic wind.
Instead, focus on building something that is robust, extensible, maintainable, and easy for other people to contribute to. But, above all, build something that actually works.