DatabaseChangeLog.validate fails on restart of new deployment

This week, we've started experiencing an issue in our CI builds where Metabase v1.42.4 (7c3ce2d release-x.42.x) database migration fails on a completely new deployment after restart. It comes up fine the first time, but on the next restart, it fails because a bunch of change set checksums don't match (see below). This is a brand new, empty database that hasn't even had the initial user added. We're using the official Docker image (metabase/metabase-enterprise:v1.42.4) against AWS RDS Aurora PostgreSQL Serverless (10.14). Aside from occasionally bumping the Metabase version, we've been running this configuration successfully for about year.

2022-05-05 03:05:51,430 INFO db.setup :: Successfully verified PostgreSQL 10.14 application database connection. ✅
2022-05-05 03:05:51,431 INFO db.setup :: Running Database Migrations...
2022-05-05 03:05:51,601 INFO db.setup :: Setting up Liquibase...
2022-05-05 03:05:51,721 INFO db.setup :: Liquibase is ready.
2022-05-05 03:05:51,721 INFO db.liquibase :: Checking if Database has unrun migrations...
2022-05-05 03:05:57,593 ERROR metabase.core :: Metabase Initialization FAILED
liquibase.exception.ValidationFailedException: Validation Failed:
26 change sets check sum
migrations/000_migrations.yaml::1::agilliland was: 8:70d9a1021171b879cc55058f4d094a9c but is now: 8:7182ca8e82947c24fa827d31f78b19aa
migrations/000_migrations.yaml::2::agilliland was: 8:ca241a2c777802abaedd43d8944927f3 but is now: 8:bdcf1238e2ccb4fbe66d7f9e1d9b9529
migrations/000_migrations.yaml::7::cammsaul was: 8:4f23ddbddd447461588462fa8bb443dd but is now: 8:c57c69fd78d804beb77d261066521f7f
migrations/000_migrations.yaml::13::agilliland was: 8:8cb1df98d24b215132d5e0175bc7009f but is now: 8:c2c65930bad8d3e9bab3bb6ae562fe0c
migrations/000_migrations.yaml::14::agilliland was: 8:22397aede818dcd23b443810127b665f but is now: 8:320d2ca8ead3f31309674b2b7f54f395
migrations/000_migrations.yaml::20::agilliland was: 8:1b6c23d69b9330b56dcb4bb7188e8d5e but is now: 8:0afa34e8b528b83aa19b4142984f8095
migrations/000_migrations.yaml::21::agilliland was: 8:be31f33e7fa7961b902c3ee4a56e19a7 but is now: 8:fb2cd308b17ab81b502d057ecde4fc1b
migrations/000_migrations.yaml::25::agilliland was: 8:ea4cd56cdc21ddf73495101a2a1cb3d4 but is now: 8:61f25563911117df72f5621d78f10089
migrations/000_migrations.yaml::27::agilliland was: 8:4778d6724e114803d2d805f494008fb1 but is now: 8:001855139df2d5dac4eb954e5abe6486
migrations/000_migrations.yaml::32::camsaul was: 8:28afcfc9b7f1bd1b08caa9d527355c48 but is now: 8:40830260b92cedad8da273afd5eca678
migrations/000_migrations.yaml::32::agilliland was: 8:32fb3ffb1ac9b619248c3de8e3d09b4b but is now: 8:483c6c6c8e0a8d056f7b9112d0b0125c
migrations/000_migrations.yaml::38::camsaul was: 8:c9251b99ab1122ef441117d5d9852d29 but is now: 8:43604ab55179b50306eb39353e760b46
migrations/000_migrations.yaml::43::camsaul was: 8:3cf48c3e52ea315379a8ed538c9f2624 but is now: 8:165e9384e46d6f9c0330784955363f70
migrations/000_migrations.yaml::47::camsaul was: 8:717ed16b2f279bd4fdf5c4aa2b5f5857 but is now: 8:1d2474e49a27db344c250872df58a6ed
migrations/000_migrations.yaml::48::camsaul was: 8:b3b418a4bbcf558e40f5d32effd2c7da but is now: 8:720ce9d4b9e6f0917aea035e9dc5d95d
migrations/000_migrations.yaml::58::senior was: 8:9e6d48acacceb109e0bd27e4e44a8cb4 but is now: 8:3554219ca39e0fd682d0fba57531e917
migrations/000_migrations.yaml::67::attekei was: 8:a03608d97e19b8a716989e918efad0a6 but is now: 8:59dfc37744fc362e0e312488fbc9a69b
migrations/000_migrations.yaml::78::camsaul was: 8:8ea54a54fa04fe657b0b8072125981af but is now: 8:1977d7278269cdd0dc4f941f9e82f548
migrations/000_migrations.yaml::79::camsaul was: 8:37825e0d48bd4ed5cf1a068de30edc60 but is now: 8:3f31cb67f9cdf7754ca95cade22d87a2
migrations/000_migrations.yaml::85::camsaul was: 8:624bb71d09dc2bef7c16c025f57edff0 but is now: 8:9b4c9878a5018452dd63eb6d7c17f415
migrations/000_migrations.yaml::89::camsaul was: 8:2ee410aba3f9bcd144ea362494a19a58 but is now: 8:94d5c406e3ec44e2bc85abe96f6fd91c
migrations/000_migrations.yaml::95::senior was: 8:e7cd8168533c58c865dacf320e819218 but is now: 8:9824808283004e803003b938399a4cf0
migrations/000_migrations.yaml::100::camsaul was: 8:28b6b684c539d4668110b920985ee363 but is now: 8:948014f13b6198b50e3b7a066fae2ae0
migrations/000_migrations.yaml::167::walterl, camsaul was: 8:4ed8c3a2d7d00700cc564095ddbb290f but is now: 8:4c11dc8c5e829b5263c198fe7879f161
migrations/000_migrations.yaml::171::camsaul was: 8:57f4fb33fe80309af71dc8c0c8d90362 but is now: 8:0798080c0796e6ab3e791bff007118b8
migrations/000_migrations.yaml::291::camsaul was: 8:f1b9110e87c7cb627c29fe6fb243b675 but is now: 8:b3b15e2ad791618e3ab1300a5d4f072f
at liquibase.changelog.DatabaseChangeLog.validate(DatabaseChangeLog.java:276)
at liquibase.Liquibase.listUnrunChangeSets(Liquibase.java:1287)
at liquibase.Liquibase.listUnrunChangeSets(Liquibase.java:1274)

Hi @trevorlc

Which version was the last one that worked?

Just want to make sure that there's only a single instance connected to that application database, and when you start the instance it populates the database, Metabase will show setup screen, then you restart the instance and it fails with wrong checksums?

It really sounds like the database is being contaminated by a different version.
Or the table databasechangelog is somehow not being saved correctly, so when the instance restarts, then it tries to apply migration changes again.

Try saving the table databasechangelog after initial startup, then shutdown the instance, the table should be identically, and if you start the instance again, then the table should still be identically.

If this was a common problem, then ever setup would fail after restart. I cannot reproduce, I'm seeing this after restarting the instance:

...
2022-05-05 11:33:17,607 INFO db.setup :: Verifying postgres Database Connection ...
2022-05-05 11:33:17,717 INFO db.setup :: Successfully verified PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) application database connection. ✅
2022-05-05 11:33:17,718 INFO db.setup :: Running Database Migrations...
2022-05-05 11:33:17,726 INFO db.setup :: Setting up Liquibase...
2022-05-05 11:33:17,803 INFO db.setup :: Liquibase is ready.
2022-05-05 11:33:17,804 INFO db.liquibase :: Checking if Database has unrun migrations...
2022-05-05 11:33:19,865 INFO db.liquibase :: Database has unrun migrations. Waiting for migration lock to be cleared...
2022-05-05 11:33:19,884 INFO db.liquibase :: Migration lock is cleared. Running migrations...
2022-05-05 11:33:19,924 INFO db.setup :: Database Migrations Current ...  ✅
2022-05-05 11:33:19,948 INFO db.data-migrations :: Running all necessary data migrations, this may take a minute.
2022-05-05 11:33:19,968 INFO db.data-migrations :: Finished running data migrations.
Database setup took 2.4 s
...

Hi @flamber,

Thanks for the quick reply!

We used 1.41.4 for quite a long time without seeing this issue. We switched to 1.42.4 about 3 weeks ago.

This is all completely automated, so I can't say if there's a setup screen. We start a single instance in an ECS task and it initializes fine (we get to Metabase Initialization COMPLETE). Then about 5 minutes later, we stop the task and start a new one, and it fails. The first task gets all the way to Metabase Shutdown COMPLETE before the second starts. (The restart happens as a side effect of the deployment strategy into a new AWS environment: the first task starts when the ECS service is created, and the second when we explicitly deploy a task update. It's redundant, and we could fix it, but it doesn't seem like something that should break this.) Because Metabase fails to initialize and respond to health checks, ECS keeps restarting the task, which fails with the same error.

Here are the first 2 rows of databasechangelog. Are the md5sum values correct?

id author filename dateexecuted orderexecuted exectype md5sum description comments tag liquibase contexts labels deployment_id
1 agilliland migrations/000_migrations.yaml 2022-05-05 07:08:18.667802 1 EXECUTED 8:70d9a1021171b879cc55058f4d094a9c createTable tableName=core_organization; createTable tableName=core_user; createTable tableName=core_userorgperm; addUniqueConstraint constraintName=idx_unique_user_id_organization_id, tableName=core_userorgperm; createIndex indexName=idx_userorgp... NULL 4.8.0 NULL NULL 1734497820
2 agilliland migrations/000_migrations.yaml 2022-05-05 07:08:18.709916 2 EXECUTED 8:ca241a2c777802abaedd43d8944927f3 createTable tableName=core_session NULL 4.8.0 NULL NULL 1734497820

Updates: I've confirmed the issue still happens with Postgres 12.4 (non-serverless). We're trying Metabase 1.43.0 next.

Thanks,
Trevor

@trevorlc When I create a new setup on Postgres, then checksum of changeset 1 is:
8:7182ca8e82947c24fa827d31f78b19aa

There must be something in your setup, which is doing something different - otherwise the many thousands Metabase installations would not work if they were rebooted.
We are using ECS and RDS Aurora on all Metabase Cloud instances too.

After running in both new and existing failing environments, upgrading to 1.43.0 seems to have fixed it:

Metabase v1.42.4 (7c3ce2d release-x.42.x)
2022-05-05 10:19:57,432 INFO metabase.core :: Starting Metabase version v1.42.4 (7c3ce2d release-x.42.x) ...
2022-05-05 10:20:05,069 ERROR metabase.core :: Metabase Initialization FAILED
Metabase v1.42.4 (7c3ce2d release-x.42.x)
2022-05-05 10:21:33,086 INFO metabase.core :: Starting Metabase version v1.42.4 (7c3ce2d release-x.42.x) ...
2022-05-05 10:21:40,762 ERROR metabase.core :: Metabase Initialization FAILED
Metabase v1.43.0 (ee686fc release-x.43.x)
2022-05-05 13:13:59,804 INFO metabase.core :: Starting Metabase version v1.43.0 (ee686fc release-x.43.x) ...
2022-05-05 13:14:08,479 INFO db.liquibase :: Database has unrun migrations. Waiting for migration lock to be cleared...
2022-05-05 13:14:08,802 INFO db.liquibase :: Migration lock is cleared. Running migrations...
2022-05-05 13:14:11,835 INFO metabase.core :: Metabase Initialization COMPLETE
Metabase v1.43.0 (ee686fc release-x.43.x)
2022-05-05 13:22:33,428 INFO metabase.core :: Starting Metabase version v1.43.0 (ee686fc release-x.43.x) ...
2022-05-05 13:22:42,122 INFO db.liquibase :: Database has unrun migrations. Waiting for migration lock to be cleared...
2022-05-05 13:22:42,361 INFO db.liquibase :: Migration lock is cleared. Running migrations...
2022-05-05 13:22:44,415 INFO metabase.core :: Metabase Initialization COMPLETE

@trevorlc There's a lot of changes in 43, so it's less strict about previous checksums, since the validation is if the changeset has been executed.
But I cannot reproduce your problem on 42 - and we would have seen many reports if this was happening for everyone.

Thanks, that's good to know about the relaxed change set validation. It sounds like we don't have a root cause, but maybe the issue is moot going forward.