Staging passes. CI is green. A teammate adds a region column to the orders table, backfills 25,000 rows, and opens a PR. Everything looks clean.
The migration introduces a 92x query regression that nobody will catch until production.
Except this time, Scry caught it first.
The Migration
Picture a typical e-commerce app. The orders table has ~25,000 rows, a handful of indexes, and a steady workload of filters and joins. A teammate adds a region column for a new geo-filtering feature:
-- Add the region column
ALTER TABLE orders ADD COLUMN region VARCHAR(50);
-- Backfill existing orders with region data
UPDATE orders SET region = CASE
WHEN (id % 5) = 0 THEN 'west'
WHEN (id % 5) = 1 THEN 'east'
WHEN (id % 5) = 2 THEN 'central'
WHEN (id % 5) = 3 THEN 'south'
ELSE 'north'
END;
-- NOTE: No index added!
In staging, with 100 rows, the filter query returns in under a millisecond. CI is green. The PR gets approved.
Why Staging Alone Missed It
Staging has two structural blind spots:
Wrong data volume. 100 rows vs. 25,000 (or 25 million). PostgreSQL’s query planner makes different choices at different scales – a sequential scan on 100 rows is faster than an index lookup. On 25,000 rows, it’s a performance cliff.
Wrong query patterns. Test suites run a handful of known queries. Production runs hundreds of distinct patterns with real concurrency and real data skew. The interaction between a new column and existing queries only surfaces when you replay actual traffic.
Staging tests correctness, not performance at scale. You need both. (See 100 Migrations Later for the longer argument.)
But this migration was also running through Scry.
What Scry Found
Here’s what happens when this migration runs through Scry’s pipeline:
- scry-proxy is already capturing production queries transparently – no application changes needed.
- The migration is applied to a shadow database, a CDC-replicated copy of production that maintains real data volume and distribution.
- Scry replays the captured query workload against the shadow, comparing latency before and after the migration.
The replay report:
The EXPLAIN ANALYZE confirms the root cause:
-- Without index: Seq Scan (184ms)
Seq Scan on orders (cost=0.00..1250.00 rows=5000 width=120)
(actual time=0.045..184.23 rows=5000 loops=1)
Filter: (region = 'west'::text)
Rows Removed by Filter: 20000
Planning Time: 0.089 ms
Execution Time: 184.67 ms
The Fix
Two indexes:
CREATE INDEX CONCURRENTLY idx_orders_region ON orders(region);
-- For queries that filter by region AND status, a composite index is even better:
CREATE INDEX CONCURRENTLY idx_orders_region_status ON orders(region, status);
After applying the fix to the shadow and re-running the replay, every pattern is back to baseline:
-- With index: Index Scan (3ms)
Index Scan using idx_orders_region on orders
(cost=0.29..125.40 rows=5000 width=120)
(actual time=0.032..2.89 rows=5000 loops=1)
Index Cond: (region = 'west'::text)
Planning Time: 0.112 ms
Execution Time: 3.14 ms
184ms to 3ms. Regression eliminated before it ever touched production – no pages, no customer impact, no incident channel. In CI, the whole cycle – apply migration, replay traffic, detect regression, apply fix, re-validate – fits in a single command:
scry ci test-migration prod-db/ci-main -- alembic upgrade head
Exit code 0: safe to ship. Exit code 5: regressions detected, pipeline fails before production.
A 92x regression on a core query path would have meant pages firing, customers timing out, and an engineer reverse-engineering what changed. Instead, it was a three-line fix before the PR merged.
You can replay this exact scenario locally in under two minutes:
scry demo
Run the full end-to-end test yourself
We’re looking for design partners – teams shipping migrations against production PostgreSQL who want to close the gap between staging and prod. Request early access.