A few weeks ago, we shipped cascading replication for PostgreSQL, MySQL and Redis on Cloud 66. Customers can now build replication chains: a primary streaming to a middle replica, which in turn streams to leaves. It reduces load on the primary, supports geographic distribution, and stops you from melting your network when you have a large fan-out of replicas all pulling WAL from the same machine.
PostgreSQL has supported cascading replication natively since version 9.1, which shipped over a decade ago. The interesting question is not why we built it. It is why it took us this long to start.
I want to talk about why, because the answer is not the one engineers like to give. It is also, I think, the most interesting thing that has changed about building software in the last eighteen months.
The ticket
The primary reason we shipped this feature is a support ticket from a long-time customer. The shape of the request was simple: a single Postgres primary fanning out to a large number of replicas, and a network that buckled under the cost of streaming the same data that many times in parallel. The customer had already set the cascading topology up by hand on their staging environment, watched it work, and sent us the runbook. They asked us to ship the same thing in production.
A long time later, they sent a follow-up. The new version of the message was less polite. The gist was: it has been a long time, our fleet has grown, and a failover today would put us down for days because every replica would have to resync from scratch from the same source.
They were right to be annoyed. The feature they were asking for is one the database has supported for over a decade. The runbook fits on a single page. They had already proved it worked.
So why did it take so long?
What "hard" actually means
When you ask an engineer why a ticket sat in the backlog, you tend to hear one of two answers. Either "it is genuinely a hard problem and we have not had the time," or "we have been busy with higher-priority work."
Both are true sometimes. Neither was true here.
The truth, which engineers find harder to say out loud, is that the cascading replication ticket was not blocked by a hard problem. It was blocked by a tedious problem.
Cloud 66 has been managing Postgres replication for customers for a long time. Long enough that the orchestration code has lived through multiple major Postgres versions, and configuration formats. That history is exactly the problem.
Our replication orchestrator was designed when "replication" meant one thing: a primary, with a flat set of replicas hanging off it. Every assumption in the system, and there are a lot of them in a system that has been growing for a decade, encoded that worldview. A server's "primary" was always the root primary, never the immediate upstream. A server was a master or a slave or standalone, mutually exclusive. The health check that watches your replicas was watching the root primary, not the server they were actually streaming from.
None of those assumptions are wrong in a flat world. Not true, however, in a cascading world, where a server is a slave of one thing and a master of another at the same time. The existing data structure literally could not represent a server that belonged to two replication groups at once. The data model had cascading designed out of it.
So the work was not "add cascading replication." The work was: untangle a decade of assumptions, replace the data model underneath them without breaking anything in production, and then add cascading.
This is what I am going to call the boring 80%.
The shape of the boring 80%
The boring 80% is the part of feature work that is not technically hard, exactly, but is exhausting in a way that breaks engineering momentum. It looks like this:
Mapping the blast radius. Before you change how a piece of data is read, you have to find every caller, and you have to know what each of them assumes. Not what the documentation says, what they actually assume. For our replication change, that meant 47 files reading topology through the helper layer and 16 places in the code that bake in the flat-world assumption in subtler ways.
Dual-write migrations. You cannot rip out the old data structure on a Wednesday. You have to write the new structure, mirror writes to both for some period, backfill the old data into the new structure, switch reads, and only then stop writing the old one. Each of those steps is a separate deploy, with a separate risk profile, and a window where the system is running on two truths at once.
Untangling the misleading neighbours. Buried in any old codebase are fields whose names sound exactly like the thing you are migrating but which are completely unrelated. They were added years ago for a different subsystem and they share a prefix or a concept with your target. If you do not notice, you migrate them. If you do notice, you spend half a day proving they are not yours.
Independent review across multiple angles. Once the change is written, someone has to look at it from the angle of race conditions, and from the angle of schema migrations, and from the angle of error handling, and from the angle of dead code, and from the angle of test coverage. The same reviewer cannot do all of those passes well in one sitting. Single-pass review misses exactly the things that surface in production six months later.
None of this requires you to be clever. It requires you to be patient, to be careful, and to do work that has no narrative arc and no resume value. You will not write a blog post about it. You will not get promoted for it. If you do it well, nobody notices. If you do it badly, you broke production at 3am.
This is the work that gets you stuck. Not the algorithm. Not the schema. The grind around them.
Why backlogs accumulate this kind of debt
Engineering teams systematically under-fund the boring 80%. There are a few reasons.
The first is estimation. When an engineer scopes "add cascading replication," they scope the algorithm. They do not scope the eight days of mapping callers, the three iterations of dual-write design, or the two days arguing about whether the new field should be nullable. Those things show up in the actual work and not in the estimate, so the work always overruns, so the work always feels like a failure even when it shipped.
The second is incentives. Greenfield work has a story. "We built a new ingestion pipeline" is a story. "I spent four weeks proving that 47 callers do not depend on a behaviour they appear to depend on" is not a story. It is the absence of a story. Engineers who are good at this work tend to be invisible. Engineers who avoid it tend to look more productive.
The third is risk. The boring 80% touches load-bearing parts of the system. If you do it well, nothing changes from the customer's perspective. If you do it badly, an unrelated subsystem breaks in production six months later and someone has to spend a weekend tracing it back to your refactor. This asymmetry rewards inaction. The brave thing is to touch the foundation. The safe thing is to add another layer on top.
These dynamics are why every long-running codebase has a backlog of two-year-old tickets that are technically reasonable, well-scoped, and still completely stuck.
What changed
I am going to be careful here, because there is a lot of nonsense in the air about what AI assistants can and cannot do.
The honest version of the story, from a year of building features inside a very large evolved Rails codebase, is this: AI assistants are not as good at design as a senior engineer. They are about as good at writing code as an extremely enthusiastic junior engineer, with no taste. They are significantly better than any human I have worked with at the boring 80%.
When I started the cascading replication work, the first thing I did was hand the codebase to my AI and ask: if I change how "who is my primary?" is answered, what breaks? It came back with the 47 files, and more usefully, with the 16 places that bake in flat-world assumptions in ways grep would never find. A code path that stored "primary" but used it as "immediate upstream" is the kind of thing a string search misses and a model with the file open notices.
I could have done this myself. I have done this myself, on past projects. It would have taken a week. It took an afternoon.
During the comprehensive multi-agent AI review, where independent reviewers each look at a change from a different angle without coordinating, one of them found a mutation path I had completely missed in the migration layer. The bug would have let production drift away from the new model, only for some customers, only on one specific operation. That is the kind of bug you find in a postmortem instead of a PR.
My AI was not clever about any of it. Not during the design, not during the build, not during the review. What it was, the whole way through, was thorough, in a way that a tired engineer at 4pm on a Friday is not.
This is what changed. The boring 80% used to be expensive in a way that scaled with the size of your codebase, and that expense came out of senior engineers' weeks. It now comes out of a model's afternoon. The work still has to be done. Someone still has to read the reports, make the calls, write the migration, run the validations. But the part of it that scales linearly with codebase size is no longer the bottleneck.
What this means for your backlog
If you run an engineering team with a long-lived codebase, you almost certainly have a list of tickets that sit somewhere in the same shape as the cascading replication ticket. They are not blocked by a hard problem. They are blocked by the cost of touching a foundation that has accumulated a decade of assumptions. Every time you scope one, you flinch.
I would re-audit that list.
A surprising number of those tickets are now a few weeks of work, and the few weeks is mostly your time spent steering and reviewing, not your time spent on the grind. Not because the underlying problems got easier. Because the boring part of the work stopped being yours alone.
The teams that figure this out first are going to ship features that their competitors have been promising for years. Not because they have better engineers, or smarter ideas, or more funding. Because they noticed, before everyone else, that the constraint moved.
Closing
To the customer who waited a long time on cascading replication: I am sorry. You were right, and we should have been faster, and the reason we were not is the reason I just spent two thousand words trying to explain. It is in production now. Build your tree. Stop melting your network.
To anyone else sitting on a ticket like that one, on either side of the support inbox: a lot of the things on your "too hard, too risky, too entangled" pile are no longer what they used to be. Look again.
The boring 80% is no longer the thing that kills your backlog. The thing that kills it now is not noticing.
