Metabase docker - inconsistent crashing

ametauser · June 1, 2022, 1:53pm

I've been having an odd issue with metabase (running up to date with a postgres back-end inside a docker container in AWS) where it will inconsistently throw a bunch of errors then shut itself down and restart when refreshing certain dashboards/queries that otherwise work fine the rest of the time. These are decently complex but not excessive queries and the database instance the postgres back-end is running on is chunky enough that it's not breaking a sweat running them as far as I can see so I doubt that's the issue.

To end users this presents itself as attempting to refresh, a large delay, 502/503 HTTP errors then metabase displaying pages about starting up: from the logs I can see a shutdown and new container launch but I'm struggling to make more sense of them than that. If it'd be helpful I can provide the full logs in a PM or something similar but out of an abundance of caution I'd prefer not to drop them here publicly: I'll post some snippets that seem to be relevant though.

Primarily I'm seeing two errors, firstly:

 {:status :failed,
   :class clojure.lang.ExceptionInfo,
   :error "Error reducing result rows",
   :stacktrace
   ["--> query_processor.context.default$default_reducef$fn__38196.invoke(default.clj:59)"
    "query_processor.context.default$default_reducef.invokeStatic(default.clj:56)"
    "query_processor.context.default$default_reducef.invoke(default.clj:48)"
    "query_processor.context$reducef.invokeStatic(context.clj:69)"
    "query_processor.context$reducef.invoke(context.clj:62)"

And secondly

 :context :dashboard,
 :error "Broken pipe",
 :row_count 0,
 :running_time 0,
 :data {:rows [], :cols []}}
2022-06-01 07:16:48,000 INFO metabase.core :: Metabase Shutting Down ...

The start of the stacktrace (above this in the log) for which looks like this:

 :status :failed,
 :class java.io.IOException,
 :stacktrace
 ["java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)"
  "java.base/sun.nio.ch.SocketDispatcher.writev(Unknown Source)"
  "java.base/sun.nio.ch.IOUtil.write(Unknown Source)"
  "java.base/sun.nio.ch.IOUtil.write(Unknown Source)"
  "java.base/sun.nio.ch.SocketChannelImpl.write(Unknown Source)"
  "java.base/java.nio.channels.SocketChannel.write(Unknown Source)"
  "org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:273)"
  "org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422)"
  "org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277)"
  "org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)"
  "org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:831)"
  "org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)"
...

It's possible that they're actually both part of the same error as well, they seem to occur one after the other. These happen over a whole bunch of different card IDs, I'm not sure if they're all on the same dashboard (fairly large dashboard but it usually loads fine and the queries are all ~5-8s return time usually).

I'm also seeing a correlation with connection reset by peer but I'm almost certain that's just my impatient end users refreshing the page so likely a red herring.

I've pretty much exhausted all the resources I can find looking through here and the metabase github for this, suggestions mostly seem to circle around either metabase running out of connections or needing more memory: I've doubled the RAM available to it (2GB as of now) but that doesn't seem to have resolved the issue, it could easily go higher or I could improve e.g. CPU specs but I'm loath to start blindly throwing things like that at the problem unless I'm sure that's what's causing it. If you can confirm to me it's only a specs issue and I just need to be less stingy with resource allocation that'd be great.

Apologies about the wall of text.

ametauser · June 1, 2022, 1:58pm

Additional info: I think these errors are possibly both just parts of the full stacktrace for the parent error
ERROR middleware.catch-exceptions :: Error processing query: null
But I could be wrong on that.

flamber · June 1, 2022, 3:36pm

@ametauser You are running out of memory somewhere perhaps, so Docker sends a SIGHUP to shutdown the container, which is why you are not getting OOM errors directly in the container logs.
But it could also be some "smart" container management, which shutdown the container, since it's only supposed to run single-functions. We've seen this on one of Google's GCP services (forgot which one it was).

But you are cutting too much in the logs to fully understand what is going on, but it looks like Metabase is being shutdown by the Docker host (or OS), so check the logs there.

ametauser · June 1, 2022, 4:06pm

You are running out of memory somewhere perhaps, so Docker sends a SIGHUP to shutdown the container, which is why you are not getting OOM errors directly in the container logs.
but it looks like Metabase is being shutdown by the Docker host (or OS), so check the logs there.

I think this is correct, I've dug into it and it looks like it was capping out on CPU rather than memory and then shutting down. I've improved the CPU resources given to the container so fingers crossed that works.

Thanks for your help, I'll try and remember to confirm here if it works in a week or so just in case anyone else googles the same error.

ametauser · June 10, 2022, 4:33pm

I'll try and remember to confirm here if it works in a week or so just in case anyone else googles the same error.

I can confirm for anyone else that might end up here from googling the same issue that upping the resources available to the container worked perfectly, and made everything else much snappier too. I guess that's obvious in hindsight but there you go, thanks again for the help.