H2 connection timeout on launch

rcahearn · October 7, 2016, 1:14pm

We’ve been happy users of metabase for awhile now. Yesterday, the docker container exited (not sure why) and when trying to start the container again we get this error:

10-07 13:08:20 INFO metabase.core :: Starting Metabase in STANDALONE mode
10-07 13:08:20 INFO metabase.core :: Launching Embedded Jetty Webserver with config:
 {:port 3000, :host "3a68e9e9bf23"}

10-07 13:08:20 INFO metabase.core :: Starting Metabase version v0.19.3 (796613e release-0.19.3) ...
10-07 13:08:20 INFO metabase.core :: System timezone is 'GMT' ...
10-07 13:08:22 DEBUG metabase.driver :: Registered driver :bigquery 🚚
10-07 13:08:22 DEBUG metabase.driver :: Registered driver :crate 🚚
10-07 13:08:23 DEBUG metabase.driver :: Registered driver :druid 🚚
10-07 13:08:23 DEBUG metabase.driver :: Registered driver :h2 🚚
10-07 13:08:23 DEBUG metabase.driver :: Registered driver :mongo 🚚
10-07 13:08:23 DEBUG metabase.driver :: Registered driver :mysql 🚚
10-07 13:08:23 DEBUG metabase.driver :: Registered driver :postgres 🚚
10-07 13:08:23 DEBUG metabase.driver :: Registered driver :redshift 🚚
10-07 13:08:23 DEBUG metabase.driver :: Registered driver :sqlite 🚚
10-07 13:08:23 DEBUG metabase.driver :: Registered driver :sqlserver 🚚
10-07 13:08:24 INFO metabase.db :: Verifying h2 Database Connection ...
10-07 13:08:29 ERROR metabase.driver :: Failed to connect to database: Timed out after 5000 milliseconds.
java.lang.AssertionError: Assert failed: Unable to connect to Metabase h2 DB.
(binding [*allow-potentailly-unsafe-connections* true] (require (quote metabase.driver)) ((resolve (quote metabase.driver/can-connect-with-details?)) engine details))
	at metabase.db$verify_db_connection.invokeStatic(db.clj:223)
	at metabase.db$verify_db_connection.invoke(db.clj:218)
	at metabase.db$setup_db.invokeStatic(db.clj:237)
	at metabase.db$setup_db.doInvoke(db.clj:229)
	at clojure.lang.RestFn.invoke(RestFn.java:421)
	at metabase.core$init_BANG_.invokeStatic(core.clj:142)
	at metabase.core$init_BANG_.invoke(core.clj:126)
	at metabase.core$start_normally.invokeStatic(core.clj:226)
	at metabase.core$start_normally.invoke(core.clj:220)
	at metabase.core$_main.invokeStatic(core.clj:260)
	at metabase.core$_main.doInvoke(core.clj:255)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at metabase.core.main(Unknown Source)
10-07 13:08:29 ERROR metabase.core :: Metabase Initialization FAILED:  Assert failed: Unable to connect to Metabase h2 DB.
(binding [*allow-potentailly-unsafe-connections* true] (require (quote metabase.driver)) ((resolve (quote metabase.driver/can-connect-with-details?)) engine details))
10-07 13:08:29 INFO metabase.core :: Metabase Shutting Down ...
10-07 13:08:29 INFO metabase.core :: Metabase Shutdown COMPLETE

We’re running on a t2.micro AWS instance, and the metabase.db.h2.db file is 70MB. The db lock file is created, the most recent of which contains:

#FileLock
#Fri Oct 07 13:15:33 GMT 2016
hostName=9f4e7848a1ef
id=1579f48d552e285c179f13cd7622a962af1261a96bc
method=file
server=172.17.0.3\:40948

I copied the db file to my iMac and was able to start up the container using that file, so I’m guessing it’s just not finishing processing in time? Is there a way to increase that 5 second timeout?

rcahearn · October 7, 2016, 1:45pm

So, I ran the java -jar metabase.jar migrate release-locks command and now startup is working. I guess manually removing the lock file is frowned upon?

sameer · October 7, 2016, 4:58pm

Unclear.

Conceptually, they’re two separate things. The release-locks command is an escape valve for weirdness in database migrations for our application database. The lock file is H2’s lock file for it’s own transactions/etc. It’s possible that there were migration transactions that weren’t flushed to disk that got clobbered by removing the lock file.

Do you have logs for when the problem first started appearing?

rcahearn · October 7, 2016, 5:31pm

Unfortunately I lost them as part of my first efforts to get the container started again.

sameer · October 7, 2016, 5:51pm

Hmm. If it happens again, please chime in. We see intermittent issues with our migration framework (http://www.liquibase.org/) but can’t seem to isolate the root cause.

Which version did this occur with? Was there an upgrade involved or did the server die?

rcahearn · October 7, 2016, 6:06pm

The original container exiting happened for an unknown reason, and the logs I uploaded are what had happened consistently after pulling a new docker container, so I’m guessing that was an update involved there. The reason I pulled a new image though is because starting the container again wasn’t working with the old image. Unknown if the error was the same.

Sorry I don’t have any more info. I’ll be more vigilant about keeping logs if/when the container dies again.

rcahearn · October 31, 2016, 7:53pm

I hit this timeout issue again, solved by running migrate release-locks again. Looks like the docker container just quit during a db sync. The server itself stayed up, and all the other containers on the machine appear to have continued running. Here’s a copy of the container logs when it quit:

10-31 02:50:02 INFO sync-database.introspect :: Introspection completed on PostgreSQL database 'DB_NAME' (2 s)
10-31 02:50:02 INFO metabase.sync-database :: Finished syncing PostgreSQL database 'DB_NAME'. (2 s)
10-31 03:02:38 ERROR metabase.metabot :: Error connecting websocket: Read timed out
10-31 03:17:37 ERROR metabase.metabot :: Error connecting websocket: Read timed out
10-31 03:17:39 ERROR metabase.metabot :: Error connecting websocket: Read timed out
10-31 03:44:38 ERROR metabase.metabot :: Error connecting websocket: Read timed out
10-31 03:50:19 INFO metabase.sync-database :: Syncing PostgreSQL database 'DB_NAME'...
10-31 03:50:26 INFO sync-database.introspect :: Introspecting schema on PostgreSQL database 'DB_NAME' ...

Then nothing at all in the logs until I tried to restart the container.

sameer · November 1, 2016, 12:12am

There’s not a whole lot going on there. It’s just sitting around.

Interesting that it bombed on sync’ing as that’s a pretty cheap/quick operation.

rcahearn · November 21, 2016, 7:21pm

Following up on this, probably for the last time. After several more container exits and failures to restart we finally migrated this server to a machine with 2GB of RAM (only had 1GB previously). After doing that we haven’t had any other issues.

sameer · November 21, 2016, 7:36pm

Thanks for the update!

We’ll start recommending 2GB and try to find places to slim down.