Nightly Issues on ELB: Connection Refused

I haven’t been able to pin down a pattern, but it seems that every night or every few nights our metabase ELB instance goes down. The only consistent log message we get is:

2019/11/13 14:32:23 [error] 27526#0: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.101.80, server: , request: "GET /api/health HTTP/1.1", upstream: "http://10.0.101.80:3000/api/health", host: "10.0.101.80"
2019/11/13 14:32:24 [error] 27526#0: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.101.80, server: , request: "GET /api/health HTTP/1.1", upstream: "http://10.0.101.80:3000/api/health", host: "10.0.101.80"
2019/11/18 14:46:47 [warn] 20205#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/18 14:46:47 [warn] 20221#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/18 14:46:47 [warn] 20231#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/18 14:46:56 [warn] 20583#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/18 14:46:56 [warn] 20593#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/18 14:47:00 [error] 20597#0: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.101.80, server: , request: "GET /api/health HTTP/1.1", upstream: "http://10.0.101.80:3000/api/health", host: "10.0.101.80"
2019/11/18 14:47:10 [warn] 21012#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/18 14:47:10 [warn] 21028#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/19 17:35:40 [error] 21030#0: *17549 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.101.80, server: , request: "GET /api/health HTTP/1.1", upstream: "http://10.0.101.80:3000/api/health", host: "10.0.101.80"
2019/11/19 17:35:43 [warn] 17038#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/19 17:35:43 [warn] 17048#0: duplicate MIME type "text/html" in /etc/nginx/sites-enabled/elasticbeanstalk-nginx-docker-proxy.conf:11
2019/11/19 17:35:55 [error] 17051#0: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.101.80, server: , request: "GET /api/health HTTP/1.1", upstream: "http://10.0.101.80:3000/api/health", host: "10.0.101.80"
2019/11/19 17:35:58 [error] 17051#0: *3 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.101.80, server: , request: "GET /api/health HTTP/1.1", upstream: "http://10.0.101.80:3000/api/health", host: "10.0.101.80"

That seems to coincide with the container being killed:

2019-11-18T14:46:43.512375478Z image tag sha256:ae319f6b110453e01a0c29867da3c8ab39479ba07f743905f94909c37a54b601 (name=aws_beanstalk/current-app:latest)
2019-11-18T14:46:43.578190263Z image tag sha256:e50f04477cd2aea76b6070de084af9b7286deca5a02c5c956a2ded8ff9a2f029 (name=metabase/metabase:v0.33.5)
2019-11-18T14:46:45.618070265Z image pull metabase/metabase:v0.33.5 (name=metabase/metabase)
2019-11-18T14:46:45.787403197Z image tag sha256:ae319f6b110453e01a0c29867da3c8ab39479ba07f743905f94909c37a54b601 (name=aws_beanstalk/staging-app:latest)
2019-11-18T14:46:49.525549625Z container create 1d6c84bce613cee6e1fff0a2b50028e292f817fec690c52be43ab3dd21dee7c8 (image=ae319f6b1104, name=blissful_lalande)
2019-11-18T14:46:49.609703768Z network connect b9eb01249dd5e58c474cbafada7fb8fbe2d95a62648c1d9d55f637c5f72981f5 (container=1d6c84bce613cee6e1fff0a2b50028e292f817fec690c52be43ab3dd21dee7c8, name=bridge, type=bridge)
2019-11-18T14:46:49.932051478Z container start 1d6c84bce613cee6e1fff0a2b50028e292f817fec690c52be43ab3dd21dee7c8 (image=ae319f6b1104, name=blissful_lalande)
2019-11-18T14:46:57.250072590Z container kill bb476aad178a24857f18e356f3a96ecffaf4afcc0fae147bfae59a1f45a8f22c (image=ae319f6b1104, name=vibrant_mcnulty, signal=15)
2019-11-18T14:47:07.261511087Z container kill bb476aad178a24857f18e356f3a96ecffaf4afcc0fae147bfae59a1f45a8f22c (image=ae319f6b1104, name=vibrant_mcnulty, signal=9)
2019-11-18T14:47:07.544527672Z container die bb476aad178a24857f18e356f3a96ecffaf4afcc0fae147bfae59a1f45a8f22c (exitCode=137, image=ae319f6b1104, name=vibrant_mcnulty)
2019-11-18T14:47:07.657777844Z network disconnect b9eb01249dd5e58c474cbafada7fb8fbe2d95a62648c1d9d55f637c5f72981f5 (container=bb476aad178a24857f18e356f3a96ecffaf4afcc0fae147bfae59a1f45a8f22c, name=bridge, type=bridge)
2019-11-18T14:47:07.824365960Z container stop bb476aad178a24857f18e356f3a96ecffaf4afcc0fae147bfae59a1f45a8f22c (image=ae319f6b1104, name=vibrant_mcnulty)
2019-11-18T14:47:08.607013252Z container destroy bb476aad178a24857f18e356f3a96ecffaf4afcc0fae147bfae59a1f45a8f22c (image=ae319f6b1104, name=vibrant_mcnulty)
2019-11-18T14:47:08.743086466Z image tag sha256:ae319f6b110453e01a0c29867da3c8ab39479ba07f743905f94909c37a54b601 (name=aws_beanstalk/current-app:latest)
2019-11-18T14:47:08.851273977Z image untag sha256:ae319f6b110453e01a0c29867da3c8ab39479ba07f743905f94909c37a54b601 (name=sha256:ae319f6b110453e01a0c29867da3c8ab39479ba07f743905f94909c37a54b601)
2019-11-18T14:47:10.004762653Z image tag sha256:ae319f6b110453e01a0c29867da3c8ab39479ba07f743905f94909c37a54b601 (name=aws_beanstalk/current-app:latest)
2019-11-18T14:47:10.169365976Z image tag sha256:e50f04477cd2aea76b6070de084af9b7286deca5a02c5c956a2ded8ff9a2f029 (name=metabase/metabase:v0.33.5)

I also modified the cloudwatch scripts to track memory usage, and it seems like it never goes above 30% of the ec2’s memory.

Any help would be greatly appreciated!

Hi @matthew
Which version of Metabase? Post the Diagnostic Info from Admin > Troubleshooting.
What do you see in the Metabase logs just before the crash?

We’ve had the issue consistently throughout the past 3 minor versions (maybe even before then, but I don’t remember when it started).

Here’s the diagnostic info:

{
  "browser-info": {
    "language": "en-US",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.4+11",
    "java.vendor": "AdoptOpenJDK",
    "java.vendor.url": "https://adoptopenjdk.net/",
    "java.version": "11.0.4",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.4+11",
    "os.name": "Linux",
    "os.version": "4.14.123-86.109.amzn1.x86_64",
    "user.language": "en",
    "user.timezone": "US/Pacific"
  },
  "metabase-info": {
    "databases": [
      "postgres",
      "h2"
    ],
    "hosting-env": "elastic-beanstalk",
    "application-database": "postgres",
    "run-mode": "prod",
    "version": {
      "date": "2019-11-07",
      "tag": "v0.33.5",
      "branch": "release-0.33.x",
      "hash": "bf436b4"
    },
    "settings": {
      "report-timezone": "US/Pacific"
    }
  }
}

The logs from the admin console only show from the most recent restart.

There are a lot of elb logs, but from what I can see in what looks like right before the crash there isn’t much info.

It looks like a few tables don’t have metadata synced because of a permission issue:

11-19 09:00:12 e[1mERROR sql-jdbc.executee[0m :: Failed to set timezone: 
 PSQLException:
 Message: ERROR: permission denied for schema hdb_views
  Position: 1406
 SQLState: 42501
 Error Code: 0

PSQLException:
 Message: ERROR: permission denied for schema hdb_views
  Position: 1406
 SQLState: 42501
 Error Code: 0
11-19 09:00:12 e[1mERROR sql-jdbc.executee[0m :: nil
11-19 09:00:12 e[1mERROR sql-jdbc.executee[0m :: Failed to set timezone: 
 PSQLException:
 Message: ERROR: permission denied for schema hdb_views
  Position: 2582
 SQLState: 42501
 Error Code: 0

But I think that comes up each time a sync occurs, not related to metabase going down. In fact, I have logs from after metabase becomes unreachable which is the confusing part.

@matthew
Okay, number 1: Upgrade to 0.33.5.1, since I have no idea which version 0.33.5 is on Docker because of a build problem.
I think you need to setup logging that is saved across restarts, since it’s going to be difficult to know what is crashing otherwise.
It’s interesting that it crashes during the night, since I guess there’s no/little activity. The only thing that I know is being run at night is the scan, which can be somewhat heavy (depending on your database size).
You might want to try to manually run the scan and see if that triggers a crash - Admin > Databases > (db) > Re-scan field values now

So I tried to deploy both 0.33.5.1 and 0.33.6 (seeing as it just came out last night :slight_smile: ), but doing so is causing a separate issue (migration failing due to permission issue?). I’ve reverted to whatever 0.33.5 is for the moment. To keep this thread on topic, I opened a separate issue about the upgrade error (https://github.com/metabase/metabase/issues/11357).

In the mean time, I ran the scans manually and at first it seemed to be fine. However, ~5 minutes later the cpu spiked and now I’m seeing the same error. Here’s a snippet from /var/log/eb-docker/containers/eb-current-app/eb-a1101def37a9-stdouterr.log:

08-23 08:00:07 e[1mINFO sync.utile[0m :: e[35mFINISHED: Sync metadata for postgres Database 3 'dev' (3 s)e[0m
08-23 08:00:07 e[1mDEBUG sync.utile[0m :: Sync operations in flight: {:analyze #{3}}
08-23 08:00:07 e[1mINFO sync.utile[0m :: e[35mSTARTING: Analyze data for postgres Database 3 'dev'e[0m
08-23 08:00:07 e[1mDEBUG sync.utile[0m :: e[35mSTARTING: step 'fingerprint-fields' for postgres Database 3 'dev'e[0m
08-23 08:00:07 e[1mERROR sql-jdbc.executee[0m :: Failed to set timezone: 
 PSQLException:
 Message: ERROR: permission denied for schema hdb_views
  Position: 1406
 SQLState: 42501
 Error Code: 0

<... more errors like above ...>

08-23 08:00:08 e[1mERROR sql-jdbc.executee[0m :: nil
08-23 08:00:08 e[1mDEBUG sync.utile[0m :: e[35mFINISHED: step 'fingerprint-fields' for postgres Database 3 'dev' (1 s)e[0m
08-23 08:00:08 e[1mDEBUG sync.utile[0m :: e[35mSTARTING: step 'classify-fields' for postgres Database 3 'dev'e[0m
08-23 08:00:08 e[1mINFO sync.analyzee[0m :: e[34mclassify-fields Analyzed [********************······························] 😐   41% Table 100 'public.patient_custom_demographics_view'e[0m
08-23 08:00:08 e[1mINFO sync.analyzee[0m :: e[34mclassify-fields Analyzed [*************************·························] 😬   51% Table 112 'hdb_catalog.event_log'e[0m
08-23 08:00:09 e[1mINFO sync.analyzee[0m :: e[34mclassify-fields Analyzed [******************************····················] 😌   62% Table 78 'public.workflow_instance'e[0m
08-23 08:00:09 e[1mDEBUG sync.utile[0m :: e[35mFINISHED: step 'classify-fields' for postgres Database 3 'dev' (96 ms)e[0m
08-23 08:00:09 e[1mDEBUG sync.utile[0m :: e[35mSTARTING: step 'classify-tables' for postgres Database 3 'dev'e[0m
08-23 08:00:09 e[1mINFO sync.analyzee[0m :: e[34mclassify-tables Analyzed [***********************************···············] 😋   72% Table 135 'public.checkin_data'e[0m
08-23 08:00:09 e[1mINFO sync.analyzee[0m :: e[34mclassify-tables Analyzed [*****************************************·········] 😊   82% Table 80 'public.workflow_blueprint_step'e[0m
08-23 08:00:09 e[1mINFO sync.analyzee[0m :: e[34mclassify-tables Analyzed [**********************************************····] 😍   93% Table 82 'public.workflow_blueprint'e[0m
08-23 08:00:09 e[1mDEBUG sync.utile[0m :: e[35mFINISHED: step 'classify-tables' for postgres Database 3 'dev' (100 ms)e[0m
08-23 08:00:09 e[1mDEBUG sync.utile[0m :: 
#################################################################
# Completed analyze on dev
# Start: 2019-08-23T15:00:07.668Z
# End: 2019-08-23T15:00:09.116Z
# Duration: 1 s
# ---------------------------------------------------------------
# Completed step 'fingerprint-fields'
# Start: 2019-08-23T15:00:07.668Z
# End: 2019-08-23T15:00:08.920Z
# Duration: 1 s
# Fingerprint updates attempted 519, updated 0, no data found 519, failed 0
# ---------------------------------------------------------------
# Completed step 'classify-fields'
# Start: 2019-08-23T15:00:08.920Z
# End: 2019-08-23T15:00:09.016Z
# Duration: 96 ms
# Total number of fields classified 0, 0 failed
# ---------------------------------------------------------------
# Completed step 'classify-tables'
# Start: 2019-08-23T15:00:09.016Z
# End: 2019-08-23T15:00:09.116Z
# Duration: 100 ms
# Total number of tables classified 71, 0 updated
#################################################################

08-23 08:00:09 e[1mINFO sync.utile[0m :: e[35mFINISHED: Analyze data for postgres Database 3 'dev' (1 s)e[0m

Maybe that’s thrashing? But it completes sucessfully and that’s the end of that log.

The ELB logs do persist, and it seems the errors are all about failures for particular tables to be indexed. Can I configure to ignore certain schemas? I can try to remove those from the scan and see if that works.

@matthew Until this PR is merged, then you’ll see the schema errors, when there isn’t sufficient permissions:
https://github.com/metabase/metabase/pull/10892

Thanks @flamber! Do you know if hiding tables will prevent them from being scanned?

EDIT: nevermind, I just tried it and hiding does not prevent a scan. I’m updating permissions now and I’ll report back whether this prevents the instance from crashing.

After upgrading and permission changes it looks like the issue is good’ol fashion JVM heap OutOfMemoryError. So I’m going to bump that up. I think that message was just getting buried by all of the schema errors so grepping through the logs was untenable.

Thanks again for your help @flamber!

@matthew It seems like hidden tables are still scanned.
https://github.com/metabase/metabase/issues/5500 - upvote by clicking :+1: on the first post
Also note https://github.com/metabase/metabase/issues/6445
The only workaround I can think of is to limit the database user, so it cannot see those tables.
If you continue having problems with this issue, then reply back - otherwise just create a new topic for other problems :slight_smile: