High CPU Usage and Cache

@Luiggi At the time of cpu spike, we are seeing middleware.cache in our logs (attached screenshot). Usually, query while caching takes milli-seconds to run, but at that time it is taking 3-5 minutes. We have to restart metabase service for it to work again. There is only metabase process running on this server.

Also we noticed this unusual error at the same duration:

you have other problems there: exports take 3 minutes, getting the collection tree takes 20 seconds

Has anyone tested this cpu utilization issue with metabase 0.48.5 ? We are still facing this issue, is it possible to downgrade metabase to older version (we were using 0.45.4.3) ?

I have found a solution to our problem. Some time ago there was a problem that an error occurred with custom columns that could only be fixed by inserting a "+0" into the calculations. If you remove the "+0" from the calculation, the calculation runs quickly again and the CPU is no longer overloaded.

DO NOT go to that version, at least upgrade to 47.12

Can you elaborate on this fix? Where do you make this change and any reference to the previous problem?

I also face metabase instability.
With 0.48.3 it was high CPU issue, and symptom same as adinamarca describe (UI irresponsive and slow. Then goes down).

Then I updated to version 0.48.5 and the problem still occurs, the same symptom, but the CPU is not loaded to 100%, only the high memory usage (some memory leak happen ??).

With both versions, Metabase stays alive for about 10 days, then crashes (usually when memory usage reach 5,1GB).
Today I will try updating to v0.48.6 but I don't see the related fixes in the changelog.

how are you running Metabase? are you setting the XMS and XMX variables? please check How to run Metabase in production

I use Metabase since version 0.36.0.
It is using a single MariaDB, and I have not set up XMS and XMX variables. It's the same environment as 0.47.x, where no issues were detected. And there has been no increase in users using my metabase. All problems started with v 0.48.x

Does v0.48.6 work effectively?

My server rebooted due to a Windows updates on Tuesday (Feb 20), now I have to wait another 8 days to see what happens. I've also started collecting CPU/RAM metrics from my server, so I'll have some record, similar as @rpataro

Now it's been 13 days and Metabase is still alive, the issue seems to be gone. :grinning:

Nice :grinning:

Was there a resolution to this in the end? We have just upgraded from 0.41.4 to 0.48.6. Servers have been fine for years running metabase and now all of a sudden we have exactly the same issue as reported above

This issue should have been fixed by now, what are you seeing?

So we upgraded last week, it ran fine for 1 week until randomly CPU spiked and metabase itself became unresponsive, not the server. I checked the logs and there were lots of "java.lang.OutOfMemoryError"'s but we are running in on a server that is to spec with the recommended docs (2GiB of memory). Here are the spikes on CPU utilisation:

What’s the instance doing when you run OOM? Or when the cpu spikes?

Sorry what do you mean by instance? Metabase, server or database?

metabase/sever

So the server has cpu spikes as shown in the image, Network out traffic also matches the cpu utilisation spikes. Metabase itself becomes slow/laggy before becoming completely unresponsive and returning non 200 responses on the /api/health endpoint showing as unhealthy. The logs from metabase simply show lots of java heap memory errors, followed by the unhealthy api responses. Whilst "unhealthy" it continues to try and sync with the database

As rpataro previously mentioned, restarting the server does seem to fix it for between 12-24 hours, before the cpu spikes restart