Migrating Legacy Without Breaking Production

Every engineer has a war story about a migration that went sideways. Data loss. Unexpected downtime. A "quick cutover" that turned into a 48-hour incident. These stories are common because migrations are genuinely hard — and because most teams underestimate what "hard" means in this context.

I led the incremental migration of a legacy server-rendered monolith and aging React codebases to a modern stack. The constraint was non-negotiable: zero merchant-facing disruptions. These were services that real businesses depended on daily, and any interruption had direct revenue implications for the merchants we served.

We pulled it off. No merchant-impacting incidents during the migration phases I led. And the patterns we established became the playbook other teams reused for subsequent modernization efforts across the org. Here's what I learned.

Why "just rewrite it" is almost always wrong

The instinct when you inherit a legacy system is to want a clean break. Tear it down, rebuild it right, ship the new thing. It's emotionally satisfying. It's also one of the most dangerous decisions you can make in production software.

Rewrites fail for a few predictable reasons. You underestimate the accumulated edge cases embedded in the old system. You lose institutional knowledge that was encoded in code rather than documentation. And you create a long period where you're maintaining two systems in parallel, which is expensive and error-prone.

The legacy services had years of business logic baked in. Some of it was well-documented. A lot of it wasn't. The forms, the workflows, the error handling — all of it had been shaped by real merchant feedback over time. A clean rewrite would have meant re-discovering all of that the hard way, probably through production bugs.

So instead of rewriting, we migrated incrementally. One surface at a time. One service at a time. Always maintaining backward compatibility with whatever hadn't been migrated yet.

The strangler fig pattern, in practice

The strategy we used is sometimes called the strangler fig pattern — you grow the new system around the old one, gradually routing traffic and functionality to the modern stack, until the legacy system has nothing left to do and can be safely decommissioned.

In theory, it's simple. In practice, it requires a level of discipline that's easy to underestimate.

The first challenge is defining clean boundaries. Legacy systems rarely have clean module boundaries. The legacy framework tightly coupled view rendering with server-side logic in ways that made it hard to extract individual pieces. We spent real time upfront mapping the dependency graph and identifying which surfaces could be migrated independently without breaking upstream or downstream consumers.

The second challenge is maintaining backward compatibility during the transition. For every surface we migrated, the old and new versions had to coexist. That meant shared data contracts, careful API versioning, and a routing layer that could direct traffic to the right implementation based on migration state. It's unglamorous work, but it's the work that keeps the lights on.

The third challenge — and the one most people skip — is having a rollback plan for every single migration step. We never shipped a migration without a tested path back to the previous state. Not because we expected to fail, but because the cost of not being able to roll back was unacceptable.

What zero incidents actually requires

Achieving zero merchant-impacting incidents sounds like a headline. The reality behind it is less exciting: it's process, communication, and paranoia.

We built rollout playbooks for every migration phase. Not high-level documents — step-by-step runbooks that specified who does what, what signals to watch, what thresholds trigger a rollback, and how to communicate status to stakeholders. If something went wrong at 2 AM, the on-call engineer had a clear decision tree to follow.

We used feature flags extensively to control migration rollout. New implementations were deployed but dark-launched behind flags, so we could enable them for internal users first, then a small percentage of merchants, then gradually ramp up. At every stage, we had dashboards tracking error rates, latency, and merchant-facing behavior for both the old and new paths.

And we invested heavily in parity testing — verifying that the new implementation produced identical outputs to the old one for the same inputs. This is tedious and not particularly fun, but it's what gives you confidence that the migration isn't silently changing behavior.

The human side of migrations

Technical patterns matter, but the human side matters just as much. Legacy migrations touch a lot of teams. The engineers who built the original system have context you need. The product managers have opinions about which surfaces are highest-risk. The merchant support team knows where the pain points are.

I spent a lot of time in the early phases just talking to people. Understanding which parts of the system were fragile. Learning about the edge cases that never made it into documentation. Getting buy-in from teams whose code was about to change under them.

Migrations that ignore this step tend to create friction. Engineers feel like their work is being ripped out without respect. Product managers get nervous about stability. Support teams get blindsided by changes they weren't briefed on.

The best migration I can describe isn't one where we had a clever technical solution. It's one where every stakeholder felt informed, respected, and confident in the process.

Building migration patterns that outlast the migration

One of the things I'm most proud of from this project isn't the migration itself — it's that the patterns and playbooks we created were reused by other teams for their own modernization efforts.

When you're deep in a migration, it's tempting to just focus on getting through it. But if you take the time to document your approach — the decision frameworks, the rollout playbooks, the testing strategies — you create institutional knowledge that pays dividends long after your specific migration is done.

We documented everything: how we identified migration boundaries, how we structured feature flag rollouts, how we validated parity, how we communicated with stakeholders. When the next team needed to modernize their own legacy services, they didn't start from zero. They started from a playbook that had already been battle-tested in production.

That's the difference between doing a migration and building a migration capability. The first is a project. The second is an organizational asset.

Legacy modernization is slow, unglamorous, and absolutely critical. If you're in the middle of one and want to swap notes, I'm always up for that conversation — find me on LinkedIn.

Why "just rewrite it" is almost always wrong

So instead of rewriting, we migrated incrementally. One surface at a time. One service at a time. Always maintaining backward compatibility with whatever hadn't been migrated yet.

The strangler fig pattern, in practice

In theory, it's simple. In practice, it requires a level of discipline that's easy to underestimate.

What zero incidents actually requires

Achieving zero merchant-impacting incidents sounds like a headline. The reality behind it is less exciting: it's process, communication, and paranoia.

The human side of migrations

The best migration I can describe isn't one where we had a clever technical solution. It's one where every stakeholder felt informed, respected, and confident in the process.

Building migration patterns that outlast the migration

One of the things I'm most proud of from this project isn't the migration itself — it's that the patterns and playbooks we created were reused by other teams for their own modernization efforts.

That's the difference between doing a migration and building a migration capability. The first is a project. The second is an organizational asset.

Legacy modernization is slow, unglamorous, and absolutely critical. If you're in the middle of one and want to swap notes, I'm always up for that conversation — find me on LinkedIn.

Migrating Legacy Without Breaking Production

Why "just rewrite it" is almost always wrong

The strangler fig pattern, in practice

What zero incidents actually requires

The human side of migrations

Building migration patterns that outlast the migration

Related Projects

Migrating Legacy Without Breaking Production

Why "just rewrite it" is almost always wrong

The strangler fig pattern, in practice

What zero incidents actually requires

The human side of migrations

Building migration patterns that outlast the migration

Related Projects