Liveness probe continually failing

timsmith12 · November 9, 2022, 3:55pm

We have a Metabase client being run on k8s via Terraform. This is how the liveness_probe is configured.

liveness_probe {
  exec {
    command = [
      "wget",
      "localhost:3000/api/health",
      "-q",
      "-O",
      "-"
    ]
  }
  initial_delay_seconds = 300
  period_seconds        = 10
  failure_threshold     = 30
  timeout_seconds       = 10
}

And the resources are set like this:

resources {
  limits = {
    memory = "12.0Gi"
    cpu    = "2.0"
  }
  requests = {
    memory = "12.0Gi"
    cpu    = "2.0"
  }
}

However, every 30 mintes to an hour, a random failure occurs:

Warning Unhealthy pod/metabase-deployment-dfawdada-9wagaw Liveness probe failed:

This error is not very descriptive and we can't figure out exactly what is causing this crash. Do the liveness_probe settings need to be adjusted?

flamber · November 9, 2022, 4:14pm

Hi @timsmith12

Post "Diagnostic Info" from Admin > Troubleshooting.

Have you checked the Metabase logs or resource monitors around the time of the probe failure?

Why using exec instead of httpGet probe?

timsmith12 · November 9, 2022, 4:33pm

{
  "browser-info": {
    "language": "en-US",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.16+8",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.16",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.16+8",
    "os.name": "Linux",
    "os.version": "5.4.204-113.362.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "postgres",
      "h2"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "12.11"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.4.1"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2022-08-16",
      "tag": "v0.44.1",
      "branch": "release-x.44.x",
      "hash": "112f5aa"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

timsmith12 · November 9, 2022, 4:35pm

And yes the Metabase pod that we have deployed will only give this error:

Warning Unhealthy pod/metabase-deployment-dfawdada-9wagaw Liveness probe failed:

Kind of strange it ends with a colon ":", it's like the end of the error never got appended... which makes this a lot more difficult to trace what is happening

flamber · November 9, 2022, 4:57pm

@timsmith12 That log is from Kubernetes, not Metabase. Try checking the Metabase log or resources around the time of the problem.

Latest release is 0.44.6

Perhaps you're seeing this issue:
https://github.com/metabase/metabase/issues/26266 - upvote by clicking on the first post

timsmith12 · November 9, 2022, 6:29pm

Appreciate the help flamber! So another question, how am I able to get the Metabase log once it crashes? It seems to start fresh and I can't view the error once the liveness probe kills it?

flamber · November 9, 2022, 7:42pm

@timsmith12 So you are not recording any logs of any pods (unless the applications themselves writes to a log somewhere "physical")? That seems like a Kubernetes configuration you're missing. Not really specific to Metabase.

But change your probe to httpGet instead of exec, and setup a lot more logging of Kubernetes.

If Metabase is working, but a probe is restarting the pod, then disable or adjust the probe.