All notes
/// Notes

Why we ship in production from day one

Staging environments are a tax most small teams cannot afford, and a comfort blanket that hides the bugs they were meant to catch. Here is when to skip them and when not to.

·6 min read·Navdeep Singh
engineeringshippinginfrastructureprocess

Every Sifotech product has been live in production from the day its first feature compiled. There is no staging URL. There is no "uat" branch. There is main, there is a preview deployment attached to each pull request, and there is the production deployment. That's it.

This is heresy in most engineering shops. So let me argue for it, and then immediately argue against it — because there are at least three situations where I would build a staging environment tomorrow, and I want to be honest about both sides.

The hidden cost of staging

A staging environment is not free. Most teams underestimate the ongoing tax it adds:

  • A second database project (or schema) with its own RLS, its own seed data, its own drift from prod.
  • A second set of API keys for every third party — payments, email, telephony, analytics.
  • A second domain or subdomain, a second SSL cert, a second set of webhook endpoints.
  • A second CSP, a second set of secrets, a second deployment pipeline.
  • A team habit of "it works on staging" — which is a meaningless statement, because staging is not production.

For a five-person team with a paid devops engineer, this is fine. For a small studio shipping six products in parallel, it is the difference between shipping and not shipping. Every minute spent reconciling staging-vs-prod drift is a minute not spent on the actual product.

The staging-shaped lie

The hidden problem with staging is more subtle than the maintenance cost. Staging environments give you false confidence. The shape of the data is wrong, the volume is wrong, the traffic patterns are wrong, the real users are not there. So you ship something that "passed staging" and it falls over the first time a real customer hits an edge case nobody seeded.

We have seen this in every previous role. The bugs that hurt are never the ones that show up on staging. The bugs that hurt are the ones that depend on real data, real concurrency, real user behaviour. Staging hides them, it does not catch them.

What we do instead

The substitutes we use, in roughly the order they catch problems:

1. Preview deployments per PR

Every pull request gets its own URL with its own deployment. This is staging at the right granularity — per change, not per environment. The reviewer sees the actual diff running on the actual platform with the actual build pipeline. If it 500s on the preview, it will 500 on prod.

2. Branch databases

Branch databases (or, in cheaper setups, a SQL transaction-per-test pattern) let you exercise migrations without touching prod data. We treat the migration as the code-under-test, not the runtime behaviour. Migrations are the one part of the stack that genuinely deserves a staging step — see below.

3. Feature flags + small percentages

The riskiest releases ship to ourselves, then to 5% of users, then to 25%, then to all. Most "we need a staging environment" instincts are actually "we need a way to roll back fast." Feature flags solve that more cleanly than a parallel environment.

4. Synthetic prod checks

A small /api/health cron that hits the critical paths (login, the one paid action, the one webhook) every two minutes. If anything 500s, we know before a customer tells us. This is more useful than any staging test suite, because it runs against the real system.

5. Smoke tests on PR

Each repo has a ten-second Playwright run that opens the homepage, hits the primary CTA, and confirms it gets to a known state. That's it. We don't try to test everything. We try to test the path that, if broken, costs money.

When you absolutely need staging

I would build a staging environment, immediately, in any of these three cases. Pretending otherwise would be irresponsible.

Data migrations on critical data

The day we move customer financial records, accounting ledgers, NHS health entries or anything else where a botched migration is a regulatory incident — that day, we run the migration against a copy of production data first, and we sign off on the diff before flipping to prod. Branch databases make this affordable. We do this for BahiKhata's quarterly VAT export. We do not do it for "the marketing site has a new section."

Third-party webhooks with side effects

Payment webhooks that issue refunds. Inbound SMS that triggers driver dispatch. Email parsers that auto-file expenses. If a misfire would touch money or move a human in the real world, route those through a staging webhook handler first. Most payment providers ship a CLI tool that lets you replay webhooks locally — use it before every payment-flow change.

Compliance certification audits

SOC 2, ISO 27001, Cyber Essentials Plus — auditors expect to see a documented separation between dev, staging, and prod. If you intend to certify, build the environment to match the control. We will hit this with ComplyOS as we move into enterprise contracts. We won't have hit it on day one.

The honest middle path

The argument we are actually making is not "staging is bad." It is "the cost of staging is real, and most small teams pay it without auditing whether they got their money's worth." A small studio shipping multiple products cannot afford the parallel-environment tax. A team selling to the NHS cannot afford to skip it.

The right answer is to ask:

If this change went wrong in production, what is the worst that could happen?

If the answer is "the marketing copy is briefly wrong" — ship to prod, fix forward. If the answer is "a customer is double-billed" — build the staging step, route the webhook, sign off the diff, then ship. The mistake is treating both with the same gravity.

What ships in production from day one at Sifotech

For reference, here is the actual policy across the six products:

  • Marketing pages, content, copy, design changes → ship to prod on merge to main. The PR preview is the staging step.
  • New features behind a flag → ship the code to prod on merge, enable the flag for ourselves first, then expand.
  • Database migrations on non-critical tables → run forward, prepare a down-migration, ship.
  • Database migrations on financial or health tables → branch DB, dry-run, diff, sign off, ship.
  • Webhook handler changes → payment-provider CLI replay, manual verification, ship.
  • Auth / RLS changes → mandatory pair review (or a 24-hour cooling-off period before merge — yes, really), then ship.

This is the closest thing we have to a "deployment process." It fits on a sticky note. It has not let us down across six products and roughly 600 production deploys this year.

If you are running a small team and you have inherited a staging environment, the question worth asking is not "should we keep it?" — it is "what specific class of bug is this catching that our PR previews and smoke tests are not?" If you cannot name the class of bug, you are paying the tax for nothing.

/// Subscribe

New notes, when we publish.

No release announcements. Roughly monthly.

/// Got a project?

Build it with the
person who wrote this.