Losing connection to data warehouse after a few weeks

ni.paris · February 5, 2020, 2:34am

Hi There

We’re deploying metabase in AWS using EBS. Every few weeks / months, metabase loses its connection to our data warehouse (Redshift).

Symptoms of the problem: Impossible to run any query on the redshift database, while queries behave normally from a standard SQL client.

The “fix” is trivial: just reboot the EC2 instance.

However that bug is a major reliability problem as we use metabase Premium embedding to have dashboards in client facing applications, and right now the “fix” require a human intervention to notice AND fix the problem.

Ideally, that “connection drop” issue would get fixed.

Is there some sort of Healthcheck API endpoint that could be used to set up a a self-healing

Extra Information

We’ve been using metabase for about 6 months, the problem has occured 3 times
Right now we’re on the latest metabase version, and we typically update to new metabase versions quickly
Metabase running on AWS EBS (using metabase provided template), AWS RDS postgres backend for metabase (set up by EBS), Redshift data warehouse
We’re using a t3.large instance. CPU usage is low when the problem occurs (5-10%)
All the interface behaves normally when the problems occurs
The problem never happened during usage peaks - this time it occured at night, when there’s virtually no usage.

Happy to report other information that may help

Nicolas

flamber · February 7, 2020, 12:01am

Hi @ni.paris
Which version of Metabase? Please post “Diagnostic Info” from Admin > Troubleshooting
It seems like there might be issues with the upstream Redshift driver again. That has happened a few times the past year. Not saying that there isn’t Metabase specific issues, but without logs, then it can be difficult to pinpoint where the root cause is.
https://github.com/metabase/metabase/issues/11441#issuecomment-576439730

ni.paris · February 7, 2020, 4:39am

Thanks @flamber

We also the the issue you link to (database locks not being released by metabase). But to us, it appears to be a separate issue. The locks are very annoying, but right now we consider them less critical than the connection loss issue i’ve reported.

The bug being tied to a redshift driver is quite possible, as we also have some tables on a Postgres DB, and we have never observed issues with that backend (no locks, no disconnect)

Here’s the info requested in your message. Let me know if you need more.

Metabase version: 0.34.1
Going to upgrade to 0.34.2 today

Diagnostic Info:

{
  "browser-info": {
    "language": "en-GB",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.5+10",
    "java.vendor": "AdoptOpenJDK",
    "java.vendor.url": "https://adoptopenjdk.net/",
    "java.version": "11.0.5",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.5+10",
    "os.name": "Linux",
    "os.version": "4.14.152-98.182.amzn1.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "redshift",
      "postgres"
    ],
    "hosting-env": "elastic-beanstalk",
    "application-database": "postgres",
    "run-mode": "prod",
    "version": {
      "date": "2020-01-13",
      "tag": "v0.34.1",
      "branch": "release-0.34.x",
      "hash": "265695c"
    },
    "settings": {
      "report-timezone": "Asia/Kuala_Lumpur"
    }
  }
}

flamber · February 7, 2020, 11:24am

@ni.paris
When streaming is added (probably 0.35.0), then the locks should not be a problem anymore.
As for the connection dropping. There has been some connection checking added to 0.34.2, so that might help, but since you’re not having issues with Postgres, only Redshift, that makes me think there might be another problem with the driver. The driver might be updated in 0.34.3, but still needs testing.

ni.paris · February 20, 2020, 7:40am

FYI: Just happened again. That’s 13 days between two occurences.

flamber · February 20, 2020, 2:24pm

@ni.paris Thanks for the update.
And you’re now on 0.34.2?
Can you post the error from the log, when this happens? Admin > Troubleshooting > Logs