Losing connection to data warehouse after a few weeks

Hi There

We’re deploying metabase in AWS using EBS. Every few weeks / months, metabase loses its connection to our data warehouse (Redshift).

Symptoms of the problem: Impossible to run any query on the redshift database, while queries behave normally from a standard SQL client.

The “fix” is trivial: just reboot the EC2 instance.

However that bug is a major reliability problem as we use metabase Premium embedding to have dashboards in client facing applications, and right now the “fix” require a human intervention to notice AND fix the problem.

Ideally, that “connection drop” issue would get fixed.

Is there some sort of Healthcheck API endpoint that could be used to set up a a self-healing

Extra Information

  1. We’ve been using metabase for about 6 months, the problem has occured 3 times
  2. Right now we’re on the latest metabase version, and we typically update to new metabase versions quickly
  3. Metabase running on AWS EBS (using metabase provided template), AWS RDS postgres backend for metabase (set up by EBS), Redshift data warehouse
  4. We’re using a t3.large instance. CPU usage is low when the problem occurs (5-10%)
  5. All the interface behaves normally when the problems occurs
  6. The problem never happened during usage peaks - this time it occured at night, when there’s virtually no usage.

Happy to report other information that may help

Nicolas

1 Like

Hi @ni.paris
Which version of Metabase? Please post “Diagnostic Info” from Admin > Troubleshooting
It seems like there might be issues with the upstream Redshift driver again. That has happened a few times the past year. Not saying that there isn’t Metabase specific issues, but without logs, then it can be difficult to pinpoint where the root cause is.
https://github.com/metabase/metabase/issues/11441#issuecomment-576439730

Thanks @flamber

We also the the issue you link to (database locks not being released by metabase). But to us, it appears to be a separate issue. The locks are very annoying, but right now we consider them less critical than the connection loss issue i’ve reported.

The bug being tied to a redshift driver is quite possible, as we also have some tables on a Postgres DB, and we have never observed issues with that backend (no locks, no disconnect)

Here’s the info requested in your message. Let me know if you need more.

Metabase version: 0.34.1
Going to upgrade to 0.34.2 today

Diagnostic Info:

{
  "browser-info": {
    "language": "en-GB",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.5+10",
    "java.vendor": "AdoptOpenJDK",
    "java.vendor.url": "https://adoptopenjdk.net/",
    "java.version": "11.0.5",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.5+10",
    "os.name": "Linux",
    "os.version": "4.14.152-98.182.amzn1.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "redshift",
      "postgres"
    ],
    "hosting-env": "elastic-beanstalk",
    "application-database": "postgres",
    "run-mode": "prod",
    "version": {
      "date": "2020-01-13",
      "tag": "v0.34.1",
      "branch": "release-0.34.x",
      "hash": "265695c"
    },
    "settings": {
      "report-timezone": "Asia/Kuala_Lumpur"
    }
  }
}

@ni.paris
When streaming is added (probably 0.35.0), then the locks should not be a problem anymore.
As for the connection dropping. There has been some connection checking added to 0.34.2, so that might help, but since you’re not having issues with Postgres, only Redshift, that makes me think there might be another problem with the driver. The driver might be updated in 0.34.3, but still needs testing.

FYI: Just happened again. That’s 13 days between two occurences.

@ni.paris Thanks for the update.
And you’re now on 0.34.2?
Can you post the error from the log, when this happens? Admin > Troubleshooting > Logs