High CPU Usage and Cache

Hello guys. We are currently facing high CPU usage problems using ECS. More details about our instance below.

The use is the same as it was historically, same as active users, but only recently we've being experiencing that situation that the CPU usage turns about 100% several times a day. I have a suspicion that the Cache feature from Metabase is creating those issues. Watching Troubleshooting/Logs we can see multiple times that message in cpu usage spikes: INFO metabase.query-processor.middleware.cache.impl Results are too large to cache. :tired_face:. We turned off cache for saved questions for a moment to check if the instance gets more stable.

Do you have any suggestion of what we can do or check in our ECS or Metabase configs?
Thanks.

Detailed Info:

  • Metabase Version: v0.48.1
  • SQL Engine: Athena
  • Metabase Features:
    -- Cache active for saved questions (Minimum query duration: 30s, TTL Multipllier: 720, Max Cache Entry Size not set)
    -- X-ray features: disabled
  • Connection Features:
    -- Rerun queries for simple explorations: disabled
    -- Choose when syncs and scans happen: enabled
    -- Scanning for filter values: regularly, on a schedule
    -- Periodically refingerprint tables: enabled
1 Like

We’re fixing this as we speak

Nice, expecting to launch it in v0.48.3? Do you have any issue on GitHub you could share with me?

1 Like

Just to update our situation, we duplicated CPU / Memory in ECS to 4 vCPU | 8 GB. As a consequence, it seems that our peak is around 82% now, and not 100% as before.
However, we are facing the " Results are too large to cache" problem massively.

please increase the cache size

we're fixing the cpu problem in v48.3

Setting the MAX CACHE ENTRY SIZE parameter?
Do you have any suggestion of kilobyes I can set?

put 204800, that's 200MB per question

thanks.
I've seen you released 0.48.3 version, is there any fix for that high CPU usage case?

Yes, that version should fix the cpu consumption

Perfect, thanks.
After we upgrade and test I'll let you now if it got stable for us.

Did 0.48.3 resolve the CPU issue for others? It did not for us. Since we upgraded to 0.48.x we have had unexplained CPU spikes that are only resolved after stop/restarting. I've attached a graph of CPU usage for the last week the upgrade happened on 1/13/2024.

Can you give us the logs when the spikes happen?

sure, we can provide them to you. Do you want them when the spike first occurs or throughout the entire event which seem to be anywhere from 30 minutes to 2 hours in duration.

just before the spike and 15 minutes inside the spike, I want to see the exact operation that triggers it

sorry for the delay we had put in an automated process to restart the service every 6 hours to avoid the CPU spike and have had to disable that. will try to get the log data to you as soon as possible.

As an aside we did see this message in the logs that it indicated we should report:

2024-01-18 16:07:00,759 WARN malli.fn :: Invalid input - Please report this as an issue on Github: ["invalid type"] {:type :metabase.util.malli.fn/invalid-input, :error {:schema :map, :value nil, :errors ({:path , :in , :schema :map, :value nil, :type :malli.core/invalid-type})}, :humanized [invalid type], :schema :map, :value nil, :fn-name query-hash}

we have a log file and image of the CPU usage at the time of the event. but there does not appear to be a way to upload a zip file here. if you have a place where we can upload the file let me know, otherwise we can put it somewhere you can download it.

share it on any service you can, thanks

here is a link:

it contains a file with the log output and an image of CPU when the event started.
hope that is what you were looking for.

I used my project here: GitHub - paoliniluis/metatask-timeline-viewer: A simple log extraction script that generates a static to see the tasks being run to analyze your logs and this is what I see

If you check the image, there are syncs that overlap

How many CPU's do you have assigned to the server?