tl;dr - Metabase v1.61.2's transform engine does not run ANALYZE on its output tables between a CTAS-then-rename swap and the next dependent transform's execution. With non-trivial downstream queries against the freshly-swapped table, this consistently produces query plans that exceed MB_DB_QUERY_TIMEOUT_MINUTES. We demonstrated this by inserting an explicit 5-minute delay between Transform A and Transform B (forcing autoanalyze to populate stats), which made Transform B complete in 4 minutes instead of timing out at 60+. The same failure now reproduces at the next junction (Transform B → Transform C) where autoanalyze fires 12 minutes too late. Additionally, when these long queries are running, Metabase's JDBC connection is dropped client-side ("connection to client lost" in Postgres logs) before the 60-minute statement timeout, creating zombie-state runs in the Metabase UI.
Hi all,
I've spent the last two weeks chasing a transform-timeout bug that I think reflects a gap in how Metabase Transforms interact with Postgres planner statistics. I want to share what I found in case anyone else is hitting the same thing, and to ask the team whether this is on the roadmap.
Environment
-
Metabase Pro Cloud, metabase-enterprise v1.61.2-X01
-
Add-ons: Advanced Transforms, Metabase AI Managed
-
Target database: AWS RDS PostgreSQL 16.8 (aarch64/Graviton)
-
~10 transforms in a dependency chain, all writing to a dedicated
analyticsschema -
All transforms tagged to a single daily job at 08:00 GMT
-
MB_DB_QUERY_TIMEOUT_MINUTES=60(set on our instance per Metabase support's earlier guidance)
Symptom
A daily scheduled transform job has been failing for ~3 weeks. The pattern:
-
The first downstream consumer of a freshly-swapped upstream transform runs for exactly 60 minutes and is cancelled by
statement_timeout. -
Every subsequent downstream transform in the chain — same SQL, same data — finishes in seconds.
-
Manually triggering the same transform a few minutes later always succeeds in 1–5 minutes.
Fixing one "stuck" transform (by rebuilding/renaming the underlying table) just moved the failure to the next transform in the chain. Each rebuild made a different transform the new canary. The bug follows the position in the DAG, not the table.
For simplicity, let's call the upstream root Transform A, its first downstream consumer Transform B, and the next-junction consumer (that reads from B) Transform C.
Root cause: no ANALYZE between CTAS-rename and the next dependent transform
Here's a typical failing 08:00 GMT scheduled run from our RDS Postgres logs:
08:09:50 ALTER TABLE transform_a RENAME TO <temp> ← upstream swap completes
08:10:10 CREATE TABLE <temp> AS <Transform B SQL> ← downstream consumer starts
08:11:00 automatic vacuum of transform_a
08:11:24 automatic analyze of transform_a ← stats finally populated, 74 sec too late
09:11:01 ERROR: canceling statement due to user request
STATEMENT: CREATE TABLE <temp> AS <Transform B SQL>
← exactly 60:51 elapsed → MB_DB_QUERY_TIMEOUT_MINUTES
09:11:04 Transform C starts ← runs in 1m 40s. Stats on transform_a are warm now.
09:13:26 next transform — 23 sec
09:13:53→09:14:14 next 4 transforms, all in seconds
The Transforms engine does CREATE TABLE temp_X AS SELECT … then ALTER TABLE prod RENAME TO old; ALTER TABLE temp_X RENAME TO prod. The new physical relation has zero rows in pg_statistic until autovacuum's worker happens to visit it. There's no mechanism in Postgres that auto-ANALYZEs after CTAS — that's expected Postgres behavior — but it means the first downstream transform that reads the freshly-swapped table plans against missing stats and frequently picks a catastrophic plan (nested loop where hash join is needed, sequential scan where index scan is needed, etc.).
Manual reruns succeed because by the time a human clicks Run, autoanalyze has long since populated stats.
The diagnostic that confirmed this
I inserted a tiny "buffer" transform between Transform A and its first dependent, body:
WITH delay AS MATERIALIZED (
SELECT pg_sleep(300) AS slept
)
SELECT
count(*) AS row_count,
max(date_refreshed) AS max_date_refreshed,
CURRENT_TIMESTAMP AS date_refreshed
FROM analytics.transform_a
CROSS JOIN delay;
The 5-minute sleep gives autoanalyze a window to populate stats before any real downstream consumer reads transform_a. Result on the next scheduled run:
21:47:36 transform_a swap
21:47:48 buffer starts (pg_sleep 300)
21:49:48 automatic analyze of transform_a ← stats populated during the sleep
21:52:51 buffer completes (5m 3s)
21:52:56 Transform B starts
21:57:19 Transform B COMPLETES in 4m 23s ← was timing out at 60 min before
Same SQL, same data, same statement_timeout, same scheduled job. The only change was giving Postgres a few minutes to autoanalyze the freshly-swapped table.
The cascade then moved to the next junction. Transform C reads from the freshly-swapped Transform B. Autoanalyze on Transform B fired 12 minutes after its swap — too late. Transform C started 1m 49s after the Transform B swap, locked in a bad plan, and ran for 33 minutes before the JDBC connection dropped (more on that below). So every CTAS-then-rename junction in the DAG appears to need its own buffer or its own explicit ANALYZE.
Questions for the Metabase team
-
Is a post-build
ANALYZEstep on the roadmap for Transforms? dbt'spost-hookrunsANALYZEfor exactly this reason on warehouses like Postgres. I couldn't find a Transforms equivalent in 1.60.x. -
Could the Cloud team expose a per-transform post-build SQL field (or a global "run
ANALYZEafter each Postgres transform swap" toggle)? -
Failing that, is there a way for the engine to wait for autovacuum/autoanalyze on the freshly-swapped relation before proceeding?
In the meantime, the pg_sleep buffer is working as a stopgap, but it requires one extra transform per junction in the DAG, which doesn't scale gracefully past 3–4 levels of dependency.
Two related things I noticed while debugging
Sharing in case they're useful to the staff diagnosing this or to other Cloud customers.
1. "connection to client lost" on long transforms. When a transform runs longer than ~10 minutes, our RDS Postgres logs show FATAL: connection to client lost on the Metabase JDBC connection — well before MB_DB_QUERY_TIMEOUT_MINUTES fires. This looks like Jetty's AsyncContext timeout closing the socket from the Metabase side. We're on Cloud and can't set MB_JETTY_ASYNC_RESPONSE_TIMEOUT ourselves — could the Cloud team bump this on our instance to match our 60-min query timeout? When the JDBC socket dies but Postgres doesn't immediately detect the dead client, the backend can keep running for hours: we had one zombie transform that ran in RDS for 2.5 days before being killed manually.
2. Duplicate upstream runs across job tags. When two job tags both transitively depend on the same upstream transform, that upstream gets rebuilt twice — sometimes in parallel. We observed two CREATE TABLE transform_a AS … backends running concurrently, with the second one's ALTER TABLE … RENAME waiting 19 minutes for AccessExclusiveLock. The docs do say "Metabase will run the dependency first, even if that transform isn't tagged in the job," but they don't mention how concurrent jobs interact. Aligning all our dependent transforms to a single tag was a partial workaround, but it would be nice to have explicit cross-tag deduplication in the engine.
Has anyone else seen this?
Especially curious whether anyone on Postgres is running Advanced Transforms with deep dependency chains (3+ levels) and what their post-CTAS strategy looks like. I'd also love to hear from anyone on Snowflake, BigQuery, or Redshift about whether the same pattern reproduces there — those warehouses have very different planner-stats behavior, so they may or may not exhibit this.
Thanks —