High CPU Usage and Cache

amand0702 · February 7, 2024, 8:34am

@Luiggi At the time of cpu spike, we are seeing middleware.cache in our logs (attached screenshot). Usually, query while caching takes milli-seconds to run, but at that time it is taking 3-5 minutes. We have to restart metabase service for it to work again. There is only metabase process running on this server.

Also we noticed this unusual error at the same duration:

Luiggi · February 7, 2024, 1:12pm

you have other problems there: exports take 3 minutes, getting the collection tree takes 20 seconds

amand0702 · February 12, 2024, 6:22am

Has anyone tested this cpu utilization issue with metabase 0.48.5 ? We are still facing this issue, is it possible to downgrade metabase to older version (we were using 0.45.4.3) ?

Siddilicious · February 12, 2024, 2:05pm

I have found a solution to our problem. Some time ago there was a problem that an error occurred with custom columns that could only be fixed by inserting a "+0" into the calculations. If you remove the "+0" from the calculation, the calculation runs quickly again and the CPU is no longer overloaded.

Luiggi · February 14, 2024, 1:04am

DO NOT go to that version, at least upgrade to 47.12

amand0702 · February 14, 2024, 6:20am

Can you elaborate on this fix? Where do you make this change and any reference to the previous problem?

CZvacko · February 19, 2024, 7:01am

I also face metabase instability.
With 0.48.3 it was high CPU issue, and symptom same as adinamarca describe (UI irresponsive and slow. Then goes down).

Then I updated to version 0.48.5 and the problem still occurs, the same symptom, but the CPU is not loaded to 100%, only the high memory usage (some memory leak happen ??).

With both versions, Metabase stays alive for about 10 days, then crashes (usually when memory usage reach 5,1GB).
Today I will try updating to v0.48.6 but I don't see the related fixes in the changelog.

Luiggi · February 19, 2024, 11:41pm

how are you running Metabase? are you setting the XMS and XMX variables? please check How to run Metabase in production

CZvacko · February 20, 2024, 7:26am

I use Metabase since version 0.36.0.
It is using a single MariaDB, and I have not set up XMS and XMX variables. It's the same environment as 0.47.x, where no issues were detected. And there has been no increase in users using my metabase. All problems started with v 0.48.x

Sean111 · February 22, 2024, 8:43am

Does v0.48.6 work effectively?

CZvacko · February 22, 2024, 9:13am

My server rebooted due to a Windows updates on Tuesday (Feb 20), now I have to wait another 8 days to see what happens. I've also started collecting CPU/RAM metrics from my server, so I'll have some record, similar as @rpataro

CZvacko · March 4, 2024, 6:31am

Now it's been 13 days and Metabase is still alive, the issue seems to be gone.

Sean111 · March 4, 2024, 9:28am

Nice

Pedro1 · March 12, 2024, 4:31pm

Was there a resolution to this in the end? We have just upgraded from 0.41.4 to 0.48.6. Servers have been fine for years running metabase and now all of a sudden we have exactly the same issue as reported above

Luiggi · March 12, 2024, 4:52pm

This issue should have been fixed by now, what are you seeing?

Pedro1 · March 12, 2024, 5:08pm

So we upgraded last week, it ran fine for 1 week until randomly CPU spiked and metabase itself became unresponsive, not the server. I checked the logs and there were lots of "java.lang.OutOfMemoryError"'s but we are running in on a server that is to spec with the recommended docs (2GiB of memory). Here are the spikes on CPU utilisation:

Luiggi · March 12, 2024, 5:11pm

What’s the instance doing when you run OOM? Or when the cpu spikes?

Pedro1 · March 12, 2024, 5:24pm

Sorry what do you mean by instance? Metabase, server or database?

Luiggi · March 12, 2024, 5:34pm

metabase/sever

Pedro1 · March 12, 2024, 5:44pm

So the server has cpu spikes as shown in the image, Network out traffic also matches the cpu utilisation spikes. Metabase itself becomes slow/laggy before becoming completely unresponsive and returning non 200 responses on the /api/health endpoint showing as unhealthy. The logs from metabase simply show lots of java heap memory errors, followed by the unhealthy api responses. Whilst "unhealthy" it continues to try and sync with the database

As rpataro previously mentioned, restarting the server does seem to fix it for between 12-24 hours, before the cpu spikes restart