Upgrade from v0.43.2 to v0.56.0.1 on Docker with MySQL fails – /api/health stays 503

dwi.prawira · August 11, 2025, 2:59am

Hi,

I’m running Metabase in Docker with MySQL as the application database.
I recently tried upgrading from v0.43.2 straight to v0.56.0.1, but after startup, the service never becomes healthy.

What I did:

Stopped my v0.43.2 container (running fine).
Pulled the v0.56.0.1 image and started it using the same MySQL DB connection.
Watched the logs during startup.

What happens:
The migration log shows success for migrations/001_update_migrations.yaml, but afterwards, /api/health continuously returns 503. The log just repeats the health check errors every ~15 seconds.

Logs:

2025-08-11 02:27:06,387 INFO liquibase.changelog :: ChangeSet migrations/001_update_migrations.yaml::v50.2024-01-10T03:27:31::noahmoss ran successfully in 152ms
2025-08-11 02:27:12,461 ERROR middleware.log :: HEAD /api/health 503 0ms (0 DB calls) {:metabase-user-id nil}
2025-08-11 02:27:27,498 ERROR middleware.log :: HEAD /api/health 503 0ms (0 DB calls) {:metabase-user-id nil}
...

Environment:

Metabase version: v0.56.0.1 (upgraded from v0.43.2)
Deployment: Docker
Application database: MySQL
Host OS: Ubuntu 22.04

Other notes:

No other services are having issues with the MySQL database.
I haven’t tried intermediate versions yet — not sure if Metabase requires upgrading in steps for such a big version jump.

Has anyone run into this before? Should I try upgrading via intermediate releases (e.g., v0.50.x) or is this upgrade path supported?

Thanks!

dwhitemv · August 11, 2025, 3:42am

Are you sure the migrations finished? There's a particularly large one in there that can take a while to finish. The step after v50.2024-01-10T03:27:31 creates data_permissions and migrates data from the previous permissions table, which I believe can get quite large if you have a lot of Metabase objects.

Usually there's an accompanying log message if you try to hit Metabase while startup is still running, don't know if that applies to the health check or not. Make sure the container doesn't try to auto-heal itself (i.e. terminate & restart) during the migration, otherwise it may never finish.

If you want to peek at what the migration is doing, log into the metabase database and look at the databasechangelog table, the most recent entry by orderexecuted marked EXECUTED is what's been done so far. The 'vXX' in the id field is the Metabase version the rule was written for. If things are still running you should be able to see them with the usual db monitoring tools, SHOW FULL PROCESSLIST for MySQL, etc.

dwi.prawira · August 11, 2025, 4:13am

, I checked MySQL using SHOW PROCESSLIST and found a query running for a long time:

-- Insert 'no' permissions for any table and group combinations that weren't covered by the previous query
INSERT INTO data_permissions (group_id, perm_type, db_id, schema_name, table_id, perm_value)
SELECT
    pg.id AS group_id,
    'perms/download-results' AS perm_type,
    mt.db_id,
    mt.schema AS schema_name,
    mt.id AS table_id,
    'no' AS perm_value
FROM permissions_group pg
CROSS JOIN metabase_table mt
WHERE NOT EXISTS (
    SELECT 1
    FROM data_permissions dp
    WHERE dp.group_id = pg.id
      AND dp.db_id = mt.db_id
      AND (dp.table_id = mt.id
           OR dp.table_id IS NULL)
      AND dp.perm_type = 'perms/download-results'
)
AND pg.name != 'Administrators'

It seems this is part of a migration step to populate missing perms/download-results entries in data_permissions.

Notes:

In my case, the permissions_group and metabase_table tables are fairly large, so this cross join produces a lot of rows. Row counts in my environment:
- permissions_group = 67 rows
- metabase_table = 27,729 rows
- CROSS JOIN = ~1.85 million combinations
This might explain why the upgrade hangs — the migration query could be running for a very long time before allowing the application to finish starting.

Is this expected for large deployments, and is there any recommended way to speed this up? For example, running this manually with indexes in place before upgrading?

dwhitemv · August 11, 2025, 4:23am

1.8 million rows is nothing, unless you're running your app database on a WiFi router.

Plus the WHERE NOT EXISTS triggers a semi-join optimization so you aren't getting all those rows materialized anyway ... assuming you aren't running an old-as-dirt MySQL. 5.7 isn't supported by Metabase (or anybody) anymore and will explode later on if you're still running that dinosaur. 8.x should be fine.

dwi.prawira · August 11, 2025, 5:12am

I ran this on my local PC to simulate the production upgrade. I’m currently using MySQL 8.0 running in Docker on Windows, on an i9 machine. I’ve let the query run for over an hour, but it still hasn’t finished. Could this be because it’s running MySQL inside Docker, so the resources aren’t being fully utilized?

In the production environment, it runs on a managed database with 4 vCPUs and 16 GB RAM.

dwhitemv · August 11, 2025, 2:50pm

If you have resource limits on the container, it’s certainly a possibility. Databases want lots of memory for cache and fast storage. A dedicated server (or VM, with more resources than your workstation) is going to perform better. That said, there’s a LOT of migration to happen with a version jump that large, and its going to take a while. I would plan for an extended downtime.

And of course, have a backup of the database in the event something goes wrong.

dragonsahead · August 13, 2025, 9:38pm

We explicitly ask people to move major version by major version (using the latest minor versions on each version) when you upgrade metabase from very old versions. That will give you more information about which migration is going wrong