Stability of metabase on elastic beanstalk

jsharpemr · July 31, 2018, 6:30pm

I have MB up and running using this guide: https://metabase.com/docs/latest/operations-guide/running-metabase-on-elastic-beanstalk.html#runing-metabase-on-aws-elastic-beanstalk

The beanstalk instance constantly fluctuates between Severe and OK. Here’s an example alert from AWS:
“Message: Environment health has transitioned from Ok to Severe. ELB health is failing or not available for all instances.”

I had it running on t2.small, but updated to t2.medium this morning in an attempt to alleviate these issues. I’ve downloaded the logs from AWS but honestly a not sure what to dig into to identify what is causing this.

Are there any debugging/support guides that can assist in maintaining/running our own instance of MB?

Thanks,
Josh

jsharpemr · July 31, 2018, 7:08pm

Since writing I’ve created an added an ec2 key pair to my beanstalk instance (which seems to have generated a new ec2 instance under the hood), yet I can’t seem to SSH in using ssh -i path/to/my.pem .compute-1.amazonaws.com – following this guide here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html

Not sure if that’s necessary, but figured it would be useful to poke around things that way.

burcturkoglu · August 13, 2018, 8:33am

Hi, have you found what’s causing the instability? I’m dealing with the same issue.

jsharpemr · September 24, 2018, 2:41pm

I upgraded out instance to medium - which seemed entirely unnecessary, but that seems to have made the problem go away. Still seems like a metabase issue that it isn’t stable on smaller instances.

jornh · September 24, 2018, 3:19pm

@burcturkoglu and @jsharpemr,

For future reference it would be good to know what Metabase version you both are experiencing this with?

Also – hopefully to help both of you out – try up’ing your AWS deployments to latest available version if you aren’t already on that (as of writing this it should be v0.30.3). Based on when you reported I’m guessing you’re either running 0.29.x or an early 0.30.x version. At least one really nasty bug affecting performance (see https://github.com/metabase/metabase/issues/8312#issuecomment-417326691) was fixed in v0.30.2.

jsharpemr · September 24, 2018, 4:57pm

I’m currently on 0.30.0 – updating to 0.30.3 as I type this. Currently on a t2.small instance.

I read through that issue description, and we don’t seem to meet the requirements of that issue. We don’t have a dashboard with a lot of cards. And our issue isn’t (so far) reproducible. Our application seemingly falls over all by itself. e.g. In the middle of the night I get multiple warnings from AWS about the application missing health checks. Or maybe it’s simply a bird landing on the data center.

That said, I understand the 0.30.3 version could contain some more general perf improvements and I’ll keep an eye on things.

burcturkoglu · October 8, 2018, 1:10pm

Even though I’m using the latest version (0.30.4), I’m still having the same issue.

eli.goldberg · October 29, 2018, 11:16am

The following is always happening?
Is this happening to anyone else?

Dan1 · May 16, 2021, 4:50am

2 years later, the issue seems not solvable? Here're my issues from the healthcheck:

100.0 % of the requests are failing with HTTP 5xx.
ELB processes are not healthy on all instances.
ELB health is failing or not available for all instances.

Have you found a solution? @jsharpemr
Could you advise @flamber?

Luiggi · May 16, 2021, 8:23am

Hi @Dan1, please post your diagnostic info. Also, do you use a Load Balancer? have you configured the health check API with it? thanks

Dan1 · May 16, 2021, 9:49am

Hi @Luiggi I'm new to metabase and programming in general. Could you tell me where and how to get "diagnostic info" you mentioned?

Here's the info for Load Balancer:
Load balancer

Listeners: 1
Load balancer type: application
Processes: 1
Rules: 0
Shared: false
Store logs: disabled

The only difference between my setting and the tutorial is the type of instance. t3.small is recommended but I currently use t2.micro. The max instance is also changed to 1 followed the instruction.

Luiggi · May 16, 2021, 1:24pm

@Dan1 you need to make your load balancer to check for the health of your instance in Port 3000, otherwise the load balancer will visit the container at a port which is not exposing any service and flag the container as unhealthy, killing the container, removing it from the target group and restarting the service.

Please check the configuration of this step in the guide

Regards
Luis

Dan1 · May 16, 2021, 2:28pm

Thank you @Luiggi Which "guide" are you referring to? If metabase, it says making the port 80. Should I just change it from 80 to 3000? I'm sure if I'm following correctly as I'm a beginner.

btw, the only place the metabase guide mentioned about “container" is here:

The only change you need to do here is to reduce the number of Instances from 4 (the default number) to 1, as we still haven’t created a centralized database where Metabase will save all of its configurations and will be using only the embedded H2 database which lives inside the Metabase container and is not recommended for production workloads as there will be no way to backup and maintain that database.

and I did change the max of instance from 4 to 1 as the screenshot in my previous post.

Also, when you're advising "killing the container", I couldn't find such option on Beanstalk...

Luiggi · May 17, 2021, 10:27pm

Hi @Dan1, sorry for not answering earlier. My mistake, the change in the port is in another type of deployments (ECS clusters, nothing to do here) because Elastic Beanstalk provides a custom NGINX in the instance to redirect the port of Metabase to port 80, so nothing to do here.

I just did the guide again and I don't have any stability issues, can you please check if the load balancer is in the right Availability Zones? or if you can pull any logs from the instance or from Cloudwatch in order to see what's going on? This is really the first time we see that someone is having issues with a deployment in Elastic Beanstalk

regards
Luis

Luiggi · May 18, 2021, 2:09pm

Hi @Dan1, updating this just FYI: I left an instance turned on for many hours and didn't have any issue at all, can you please send us any log so I can see further details?

thanks

Dan1 · May 20, 2021, 1:52am

Thank you so much! I'll try it out!

sjlogan · June 21, 2021, 7:23pm

Hi @Luiggi, I am experiencing the same issue that @Dan1 was experiencing. I run a number of Metabase applications in AWS elastic beanstalk and I have observed this issue frequently. Do you know of any new updates to this issue? @Dan1, were you able to find a resolution?

What logs in AWS or Metabase would have the best chance of pointing to a root cause?

Thanks in advance for any help or guidance.

Luiggi · June 21, 2021, 7:31pm

@sjlogan, you should check for the instance logs or start sending all data to Cloudwatch in order to see what is the issue for this. Again, I've left an instance for many hours after setting it up and couldn't reproduce what you were seeing.

sjlogan · June 24, 2021, 7:46pm

Hi @Luiggi. Thanks for the response. I will dig further into the logs. For context, we seem to see this issue when Metabase runs queries against huge sets of data in snowflake. It seems like the issue has something to do with too much load on the Metabase environment. That may not point directly to a root cause, but I thought the context might be helpful to share.

Luiggi · June 24, 2021, 7:59pm

Thanks for the info. So my recommendation there should be to scale up or horizontally the deployment at least as it seems that you're running our of resources. By the way, we're getting reports of users seeing CPU peaks in versions 39+ so we're investigating this and will have a patch soon