Can we scale Metabase on ECS horizontally?

philMarius · November 9, 2021, 11:44am

So we've got Metabase deployed on Amazon ECS hooked up to a postgres DB on RDS serving reports for our Redshift warehouse. The service is deployed on an EC2 launch type with 1 task running currently.

We noticed in the latest update (v0.41.1) that memory consumption on the Metabase ECS service skyrocketed to 300%. We're looking to scale Metabase up regardless as the company is growing. We were wondering whether scaling horizontally (more tasks) is possible with Metabase.

Can we add tasks to our ECS service or is it recommended to scale up and increase the EC2 size instead?

flamber · November 9, 2021, 12:00pm

Hi @philMarius
I would recommend that you read this:
https://www.metabase.com/learn/administration/metabase-at-scale.html

Which version did you upgrade from? There shouldn't be a memory bump like that, so very interested in figuring out what is going on here.

Post "Diagnostic Info" from Admin > Troubleshooting
And did you try to correlate the memory consumption with the logs, so it could help narrowing down what is causing the high memory?
My current idea would be the new static visualizations used in Subscriptions/Pulses, but that's hard to tell without really digging through everything.

philMarius · November 9, 2021, 1:35pm

Hi @flamber

Thanks for your comment. And thanks for the link, that was a good read!

We upgraded from v0.40.2 to v0.41.1. Also, scrolled back further in the timeline and turns out it's been at 300% before we upgraded too! This definitely sounds like a deployment issue now.

One thing our principal developer mentioned was that it's sharing a cluster with a number of other processes that are taking up most of the memory.

Attached is the diagnostic info from the admin page:

{
"browser-info": {
    "language": "en-GB",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.114 Safari/537.36",
    "vendor": "Google Inc."
},
"system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.12+7",
    "java.vendor": "Eclipse Foundation",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.12",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.12+7",
    "os.name": "Linux",
    "os.version": "4.14.101-91.76.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
},
"metabase-info": {
    "databases": [
    "googleanalytics",
    "postgres",
    "mysql",
    "redshift"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
    "database": {
        "name": "PostgreSQL",
        "version": "10.17"
    },
    "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.2.23"
    }
    },
    "run-mode": "prod",
    "version": {
    "date": "2021-10-21",
    "tag": "v0.41.1",
    "branch": "release-x.41.x",
    "hash": "76aa4a5"
    },
    "settings": {
    "report-timezone": null
    }
}
}

flamber · November 9, 2021, 2:13pm

@philMarius We spend quite a few hours last year writing that article and involved several of us.

Okay, it's difficult to tell where the 300% comes from unless there's a reference point.
Also how much memory is being used?

philMarius · November 10, 2021, 1:52pm

@flamber

Yeah we've added that page to our notes and we're going to go through it and review our current setup.

I've had a deeper look at the ECS deployment, I reckon it's due to how many services the Metabase service is sharing the EC2 instance with. It's sharing a 4GB t3.medium ECS instance with 7 other services which all eat up a variable amount of memory.

So the 300% memory may not the primary issue here, instead, moving Metabase to its own ECS instance will help speed up Metabase. Thoughts on this? I raised the 300% initially as Metabase was running slow, apologies I should have clarified this earlier

flamber · November 10, 2021, 2:01pm

@philMarius It's definitely easier to monitor if each heavy service is running in their own container.

It's difficult to give specifics on resources for Metabase, since it depends on your usage (active users, which functionalities are used in Metabase, how many queries are being run concurrently, etc).

But Metabase runs best with 2GB+ for itself, while it can run with less, that all depends on usage.

By the way, upgrade to 0.41.2 released yesterday, which fixes some performance stuff.

philMarius · November 10, 2021, 2:07pm

@flamber

Yeah thanks for your comments you've really helped us understand what's the best way to deploy Metabase in production. We're going to go away and take a look at deploying Metabase in its own instance with dedicated memory for it instead of have it compete with other services. As we're a scaling company, we're aware we're going to need keep a better eye on usage.

And yeah we're planning on upgrading soon!

fredericohorst · November 12, 2021, 9:04pm

I had a similar issue with Metabase last year, in my previous workplace. We had a bunch of data sources (it was up to something like 60 ) connected, in order to "facilitate" data access for end-users.

So when we saw the memory issue after a long and needed upgrade, we dived into Metabase logs and followed that article @flamber mentioned (thanks!!). Replicated and not needed data sources were deleted, some of them were just reconfigured considering lower data-freshness and a more limited access. This already gave us a boost: if I'm not mistaken, we were at 4gb++ of ram and we got it down by 1.5gb.

After that, it was a question to optimize data inside bigquery and concentrate most of the data that had to be available inside one single dataset - and one single datasource for Metabase. After that we were using around the 2gb of ram and no more lag problems!