Reliability problems in .43 and .44

exabrial · August 5, 2022, 3:49pm

Hey guys,

We've been happy with Metabase, using it for around 5 years in total now!

Metabase was very slow after our .43 upgrade, and now has become completely unusable with .44. I know it's probably as simple as tuning some pools, but we cannot figure this out! We have 15 semi-active users and 15 dashboards around the office. We're hosted on Digital Ocean and are using a massive managed db to host the data and metabase's schema

We run Metabase behind haproxy to terminate TLS. Timeouts are listed below
.43 seems to have take a far more asynchronous programming model and this has lead to problems stacking up, rather than load shedding
Running a jstack on the frozen server just shows a ton of waits at pools
CPU on the box is idling
Metabase is not using it's full Xmx value of 3g
JDK17
The pool MB_APPLICATION_DB_MAX_CONNECTION_POOL_SIZE seems to get all of it's connections used. So we set it to an absurd number of 128, but it appears eventually all of the connections in the pool get used and the app crawls to a halt.

/etc/default/metabase:

MB_ANON_TRACKING_ENABLED=false
MB_ASYNC_QUERY_THREAD_POOL_SIZE=16
MB_JDBC_DATA_WAREHOUSE_MAX_CONNECTION_POOL_SIZE=16
MB_APPLICATION_DB_MAX_CONNECTION_POOL_SIZE=128
MB_JETTY_ASYNC_RESPONSE_TIMEOUT=600000

/etc/haproxy/haproxy.cfg [excerpt]

  timeout connect 5000
  timeout server 610000
  timeout server 610000
  timeout tunnel 610000
frontend www-https
  bind :443 ssl crt /etc/ssl/private/metabase.xyz.com.pem alpn h2,http/1.1
  http-request add-header X-Forwarded-Proto https
  default_backend metabase-backend

backend metabase-backend
  option httpchk OPTIONS /metabase.core/initialized? HTTP/1.0
  server metabase-workhorse 127.0.0.1:3000 #check disabled for right now

Most of the threads are locked up in this state:

"qtp1019098496-137" #137 prio=5 os_prio=0 cpu=1.48ms elapsed=26.68s tid=0x00007f70880ecf30 nid=0x55a0 in Object.wait() [0x00007f704c765000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@17.0.3/Native Method)

waiting on
at com.mchange.v2.resourcepool.BasicResourcePool.awaitAvailable(BasicResourcePool.java:1503)
at com.mchange.v2.resourcepool.BasicResourcePool.prelimCheckoutResource(BasicResourcePool.java:644)

locked <0x000000074760a0e8> (a com.mchange.v2.resourcepool.BasicResourcePool)
at com.mchange.v2.resourcepool.BasicResourcePool.checkoutResource(BasicResourcePool.java:554)
at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutAndMarkConnectionInUse(C3P0PooledConnectionPool.java:758)
at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutPooledConnection(C3P0PooledConnectionPool.java:685)
at com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource.getConnection(AbstractPoolBackedDataSource.java:140)
at metabase.db.setup$reify__35566.getConnection(setup.clj:168)
at clojure.java.jdbc$get_connection.invokeStatic(jdbc.clj:372)
at clojure.java.jdbc$get_connection.invoke(jdbc.clj:274)
at clojure.java.jdbc$db_query_with_resultset_STAR_.invokeStatic(jdbc.clj:1111)
at clojure.java.jdbc$db_query_with_resultset_STAR_.invoke(jdbc.clj:1093)
at clojure.java.jdbc$query.invokeStatic(jdbc.clj:1182)
at clojure.java.jdbc$query.invoke(jdbc.clj:1144)
at clojure.java.jdbc$query.invokeStatic(jdbc.clj:1160)
at clojure.java.jdbc$query.invoke(jdbc.clj:1144)
at metabase.server.middleware.session$current_user_info_for_session.invokeStatic(session.clj:245)

Is there a setting to help load shedding? Just setting a 5s max wait on the datasource would allow the server to kick some errors to users and help us diagnsose the problem.

{
  "browser-info": {
    "language": "en-US",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.5 Safari/605.1.15",
    "vendor": "Apple Computer, Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "17.0.3+7",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "17.0.3",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "17.0.3+7",
    "os.name": "Linux",
    "os.version": "4.15.0-189-generic",
    "user.language": "en",
    "user.timezone": "America/Chicago"
  },
  "metabase-info": {
    "databases": [
      "mysql",
      "mongo"
    ],
    "hosting-env": "unknown",
    "application-database": "mysql",
    "application-database-details": {
      "database": {
        "name": "MySQL",
        "version": "8.0.28"
      },
      "jdbc-driver": {
        "name": "MariaDB Connector/J",
        "version": "2.7.5"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2022-08-04",
      "tag": "v0.44.0",
      "branch": "release-x.44.x",
      "hash": "d3700f5"
    },
    "settings": {
      "report-timezone": "US/Central"
    }
  }
}

flamber · August 5, 2022, 4:17pm

Hi @exabrial
I have a feeling that you're hitting this frustrating issue:
https://jira.mariadb.org/browse/CONJ-983

Your environment variables seems a little off (besides the 128).
https://www.metabase.com/docs/latest/operations-guide/environment-variables.html#mb_async_query_thread_pool_size

We're also a bit more conservative with Java, so currently staying on Java 11, because of some known problems with some drivers on Java 17, but it should be fine with the two you're using.

exabrial · August 5, 2022, 5:29pm

Flamber thanks for the reply, we really appreciate it.

We actually were trying to force the server to error out earlier and I added that setting. I've since removed it, but we still had the slow crawl to lockup problem.

I have no doubt we're experiencing that mariadb driver bug, but it's being exasperated by other things. I think once we experience enough pool waits we'll slowly lock up.

One of our guys got frustrated with the server being down and found something else though and began digging on the SQL side... metabase had about 85 of these queries waiting in the database:

SELECT
	min(`view_log`.`user_id`),
	`view_log`.`model`,
	`view_log`.`model_id`,
	count(*) AS `cnt`,
	max(`timestamp`) AS `max_ts`
FROM
	`view_log`
	LEFT JOIN `dashboard_bookmark` `bm` ON (`model` = 'dashboard'
			AND `bm`.`user_id` = 33
			AND `model_id` = `bm`.`dashboard_id`)
WHERE ((`model` in('dashboard', 'table'))
	AND `bm`.`id` IS NULL)
GROUP BY
	`model`,
	`model_id`
ORDER BY
	`max_ts` DESC,
	`model` DESC
LIMIT 8

This query isn't very efficient; it could probably use some optimization in addition to some indexes.

This was further compounded by the fact our view_log table had no less than 42 millon rows in it... Like I said, we <3 Metabase and we've been using it for years. We went ahead and drop/recreate that table and that got rid of that particular issue (for now).

The guy that found this is going to suggest a rewrite of the query and some indexes that should help out.

I do think Metabase could probably use a cleanup process on that particular table though, as I doubt the data going back to 2018 - 2019 is useful anymore.

Again, thanks for the reply. We'll try to contribute a fix to the view_log query and see if that'll help out some other people!

flamber · August 5, 2022, 5:45pm

@exabrial Okay, so I would definitely recommend cleaning out view_log and likely query_execution as well. Perhaps not truncating, like you did, but just backup and then delete all entries older than a year(?). I know what you have done, just leaving a note for others who might read this.

We're still working on adding more indexes to support some larger setups, but every index adds several hundred MB extra, which we saw in v40/41, where people using a small instance (AWS) for their Postgres appdb would suddenly see severe performance penalties until they upgraded to a larger instance.
We haven't found a perfect solution that would gracefully work for everybody yet.

Just for reference: https://github.com/metabase/metabase/issues/4155

flamber · August 5, 2022, 6:44pm

@exabrial Ping https://github.com/metabase/metabase/pull/24643

flamber · August 6, 2022, 9:12am

@exabrial If you cannot wait for 44.1, then get the latest build, which includes the driver update:
https://github.com/metabase/metabase/actions/runs/2806431500

By the way, the health endpoint is /api/health

exabrial · August 6, 2022, 7:43pm

Incredible. Thank you sir!