Six lessons from running our own multi-tenant SaaS

We build software for clients, but we also build and operate a fleet management SaaS ourselves. Here are six things you only learn when you have to keep the lights on yourself.

METZ CODE has been running its own multi-tenant fleet management platform since 2020. It serves multiple operators across the EU. We built it the way we'd want a vendor to build for us — and along the way we collected a list of mistakes that you can only make when you're both the engineer and the customer.

1. Multi-tenant from day one (or pay the migration tax later)

The temptation in the early days is to build a single-tenant app and "add multi-tenancy when we get there." Don't. The migration cost is enormous, and it stretches across every part of the system.

Even a tenant_id column in every table is non-trivial after the fact. You have to:

Backfill the column for every existing record.
Add it to every query (and remember it on every JOIN).
Update every API endpoint to scope by tenant.
Migrate existing customers into "tenants" without breaking their integrations.

If there's any chance you'll have more than one customer, design for tenants on day one. Tenant table, tenant_id everywhere, row-level security if your database supports it, and middleware that enforces scoping at the request boundary.

2. Tenant isolation is harder than it looks

"Multi-tenant" sounds simple: add a tenant_id and filter on it. In practice, the leaks come from places nobody looks at:

Background jobs. A job kicked off by tenant A reads from a queue without re-checking the tenant context, and writes to tenant B's data.
File uploads. File paths leak tenant identifiers, and a misconfigured S3 bucket exposes one tenant's data to another.
Caches. A cache key without the tenant_id returns tenant A's data to tenant B's request.
Logs and analytics. Aggregated logs surface specific tenant data to all tenant admins.
Search indexes. Elasticsearch with poor query construction returns documents across tenants.

The defense in depth: enforce tenant scoping at four layers — request middleware, ORM/repository, database (row-level security), and infrastructure (separate buckets, separate keys per tenant). One layer alone is not enough.

3. Per-tenant feature flags will multiply

The first tenant says "can we have a slightly different invoice format?" You add a config flag. The second tenant wants different mileage units. Another flag. By tenant 12, you have 50 flags, half of them undocumented, and someone is afraid to delete any of them because nobody remembers which tenant uses what.

What we do now:

Every config flag is checked into source control with a description and a default.
Every flag has a "deprecated_at" date. If nothing's using it 6 months past that date, we remove it.
Per-tenant config goes through a migration process with a paper trail. No SQL UPDATE in production.
Reports show which tenants are using which non-default flags, so we can plan deprecation.

Treat tenant configuration as code, not as data. Otherwise it becomes the dumbest part of your system, and the most expensive to change.

4. Authentication boundaries everywhere

In a multi-tenant system, every request needs to answer two questions:

Who is the user?
Which tenant is this user acting as?

It's tempting to derive #2 from the URL or the subdomain. That works until a user has access to multiple tenants (employee at one company, contractor at another), or until a support engineer needs to impersonate a customer for debugging. Then you have a JWT that says "I'm Alice" and the system has to figure out which tenant.

What we do: tenant context is explicit in every authenticated request — JWT claim, header, or query param — and validated against the user's actual permissions for that tenant on every request. Yes, every request. Caching this is fine; skipping it is how data leaks happen.

5. Observability that distinguishes tenants

The first time a tenant says "the system is slow," you'll regret not having per-tenant metrics. You can't debug tenant-specific issues with global dashboards. You need:

Logs tagged with tenant_id (and a way to view per-tenant log streams).
Metrics dimensioned by tenant (P99 latency for tenant A vs the global P99).
Tracing that survives across services and includes the tenant context.
Alerts that fire when one tenant degrades (not just when global SLOs slip).

Most observability stacks make this hard because cardinality is expensive. The trade-off is real, but the alternative — one tenant melts down, you can't tell — is worse. Pay for the cardinality.

6. Onboarding is the product

For B2B SaaS, the onboarding flow isn't a one-time thing your sales team does. It's the product, because:

Every new user goes through it.
Every existing user goes through bits of it whenever they add a feature.
Most churn happens in the first 30 days.
Every onboarding step that fails silently is a customer you lose.

We rebuilt our fleet onboarding three times. The current version: every step has a defined "completed" state, every failure has a recovery path, and the support team can see exactly which step a customer got stuck on without asking. The result is a 60% reduction in support tickets in the first 30 days.

If you're building a SaaS, treat onboarding as a feature with the same weight as your core product. Spec it. Test it. Instrument it. Iterate on it monthly.

What you can take from this

Multi-tenant SaaS is harder than it looks, but it's not magic. The mistakes are knowable in advance. If you're building one, the cheapest thing you can do is:

Decide on multi-tenant on day one.
Invest in tenant isolation across at least three layers.
Treat configuration as code.
Build per-tenant observability before you have many tenants.
Spend at least 20% of your engineering time on onboarding.

If you're considering one and want a second pair of eyes on the architecture, that's exactly what our Tech Strategy engagements look like. Tell us about your project.