High CPU Usage and Cache

lvgiacomin · January 8, 2024, 5:15pm

Hello guys. We are currently facing high CPU usage problems using ECS. More details about our instance below.

The use is the same as it was historically, same as active users, but only recently we've being experiencing that situation that the CPU usage turns about 100% several times a day. I have a suspicion that the Cache feature from Metabase is creating those issues. Watching Troubleshooting/Logs we can see multiple times that message in cpu usage spikes: INFO metabase.query-processor.middleware.cache.impl Results are too large to cache. . We turned off cache for saved questions for a moment to check if the instance gets more stable.

Do you have any suggestion of what we can do or check in our ECS or Metabase configs?
Thanks.

Detailed Info:

Metabase Version: v0.48.1
SQL Engine: Athena
Metabase Features:
-- Cache active for saved questions (Minimum query duration: 30s, TTL Multipllier: 720, Max Cache Entry Size not set)
-- X-ray features: disabled
Connection Features:
-- Rerun queries for simple explorations: disabled
-- Choose when syncs and scans happen: enabled
-- Scanning for filter values: regularly, on a schedule
-- Periodically refingerprint tables: enabled

Luiggi · January 8, 2024, 5:31pm

We’re fixing this as we speak

lvgiacomin · January 8, 2024, 6:20pm

Nice, expecting to launch it in v0.48.3? Do you have any issue on GitHub you could share with me?

Luiggi · January 9, 2024, 3:15am

github.com/metabase/metabase

CPU consumption is very high after upgrading to v0.48.1

opened 07:49AM - 28 Dec 23 UTC

Jerrylovescoding

Type:Bug Priority:P1 .Performance Querying/Processor .Backend .Regression

### Describe the bug ### Type: Metabase (Running on Docker) ### Version: … v0.48.1 ### Bug Description: After upgrading from v0.47.8 to v0.48.1, the CPU consumption of server becomes higher, at the same time, some cards and dashboards load much more slowly than ever before. It never happened before. For example: ![image](https://github.com/metabase/metabase/assets/25549512/645d7b3c-2846-472f-b7f2-6bb9e1ed4309) The main process that consumes CPU: ![image](https://github.com/metabase/metabase/assets/25549512/74991430-6cb3-4410-bbdd-9a1363e434ac) ![image](https://github.com/metabase/metabase/assets/25549512/3f4727bd-8d97-4d9f-a0c9-52559409da04) ![image](https://github.com/metabase/metabase/assets/25549512/fe385d6e-30af-4f8b-88d7-0f9e4049ac6f) Btw, I also use metabase-clickhouse-driver (v1.3.1) plugin for my Clickhouse db. ### Expected behavior Hope to improve the loading speed of dashboard. ### Logs _No response_ ### Information about your Metabase installation ```JSON - Metabase hosting env: Docker - Metabase version: v0.48.1 - Metabase internal database: PostgresSQL - The OS that Metabase running on: CentOS 7 - My Database: PostgreSQL, MySQL, StarRocks ``` ### Severity The card and dashboard load very very slowly. ### Additional context _No response_

lvgiacomin · January 11, 2024, 1:42pm

Just to update our situation, we duplicated CPU / Memory in ECS to 4 vCPU | 8 GB. As a consequence, it seems that our peak is around 82% now, and not 100% as before.
However, we are facing the " Results are too large to cache" problem massively.

Luiggi · January 11, 2024, 1:57pm

please increase the cache size

we're fixing the cpu problem in v48.3

lvgiacomin · January 11, 2024, 2:14pm

Setting the MAX CACHE ENTRY SIZE parameter?
Do you have any suggestion of kilobyes I can set?

Luiggi · January 11, 2024, 5:11pm

put 204800, that's 200MB per question

lvgiacomin · January 12, 2024, 12:13pm

thanks.
I've seen you released 0.48.3 version, is there any fix for that high CPU usage case?

Luiggi · January 12, 2024, 12:16pm

Yes, that version should fix the cpu consumption

lvgiacomin · January 12, 2024, 12:18pm

Perfect, thanks.
After we upgrade and test I'll let you now if it got stable for us.

rpataro · January 16, 2024, 2:59pm

Did 0.48.3 resolve the CPU issue for others? It did not for us. Since we upgraded to 0.48.x we have had unexplained CPU spikes that are only resolved after stop/restarting. I've attached a graph of CPU usage for the last week the upgrade happened on 1/13/2024.

Luiggi · January 16, 2024, 3:57pm

Can you give us the logs when the spikes happen?

rpataro · January 16, 2024, 4:14pm

sure, we can provide them to you. Do you want them when the spike first occurs or throughout the entire event which seem to be anywhere from 30 minutes to 2 hours in duration.

Luiggi · January 16, 2024, 4:31pm

just before the spike and 15 minutes inside the spike, I want to see the exact operation that triggers it

rpataro · January 18, 2024, 7:43pm

sorry for the delay we had put in an automated process to restart the service every 6 hours to avoid the CPU spike and have had to disable that. will try to get the log data to you as soon as possible.

As an aside we did see this message in the logs that it indicated we should report:

2024-01-18 16:07:00,759 WARN malli.fn :: Invalid input - Please report this as an issue on Github: ["invalid type"] {:type :metabase.util.malli.fn/invalid-input, :error {:schema :map, :value nil, :errors ({:path , :in , :schema :map, :value nil, :type :malli.core/invalid-type})}, :humanized [invalid type], :schema :map, :value nil, :fn-name query-hash}

rpataro · January 18, 2024, 11:22pm

we have a log file and image of the CPU usage at the time of the event. but there does not appear to be a way to upload a zip file here. if you have a place where we can upload the file let me know, otherwise we can put it somewhere you can download it.

Luiggi · January 18, 2024, 11:57pm

share it on any service you can, thanks

rpataro · January 19, 2024, 3:39am

here is a link:

it contains a file with the log output and an image of CPU when the event started.
hope that is what you were looking for.

Luiggi · January 19, 2024, 1:58pm

I used my project here: GitHub - paoliniluis/metatask-timeline-viewer: A simple log extraction script that generates a static to see the tasks being run to analyze your logs and this is what I see

If you check the image, there are syncs that overlap

How many CPU's do you have assigned to the server?