Instance webapp response time slow, lots of Request cancelled on user's request errors

clement.l · June 30, 2022, 3:07pm

Hello @flamber, I hope you're doing well.

I'm opening this topic because the last few days, I noticed that one of my Metabase instance became very slow (talking mainly about webapp response time) going from 50ms - 200ms to serve HTTP requests to around 3000ms now. CPU and RAM of my instance also goes up significantly even when usage is low (lunch break for instance).

This is the case on only one of my 3 instances, all with version 0.43.4, it started before the upgrade (0.43.3) but remains afterwards.

When looking at the logs, I see a lot of exceptions passing by that mention "ERROR: abort query" or "ERROR: Query (98713593) cancelled on user's request " (I can provide examples if necessary).

When investigating further, it seems that Metabase is launching requests in a loop from users, then cancelling those requests, which can overload the instance. I could not determine what triggered this behaviour though.

I was wondering if you had some ideas on what could be causing those requests and errors, and if anything else could be related to the instance's slowdown.

Here's the diagnostic info :
{
"browser-info": {
"language": "fr-FR",
"platform": "MacIntel",
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36",
"vendor": "Google Inc."
},
"system-info": {
"file.encoding": "UTF-8",
"java.runtime.name": "OpenJDK Runtime Environment",
"java.runtime.version": "11.0.15+10",
"java.vendor": "Eclipse Adoptium",
"java.vendor.url": "https://adoptium.net/",
"java.version": "11.0.15",
"java.vm.name": "OpenJDK 64-Bit Server VM",
"java.vm.version": "11.0.15+10",
"os.name": "Linux",
"os.version": "4.14.275-207.503.amzn2.x86_64",
"user.language": "en",
"user.timezone": "GMT"
},
"metabase-info": {
"databases": [
"redshift"
],
"hosting-env": "unknown",
"application-database": "postgres",
"application-database-details": {
"database": {
"name": "PostgreSQL",
"version": "13.4"
},
"jdbc-driver": {
"name": "PostgreSQL JDBC Driver",
"version": "42.3.2"
}
},
"run-mode": "prod",
"version": {
"date": "2022-06-27",
"tag": "v0.43.4",
"branch": "release-x.43.x",
"hash": "61cc28e"
},
"settings": {
"report-timezone": null
}
}
}

Thanks a lot for your help.

flamber · June 30, 2022, 3:15pm

Hi @clement.l
When did the problems start?
I don't know what the difference is between your instances, but there must be some difference.
Sounds like reverse proxing issues: https://www.metabase.com/docs/latest/troubleshooting-guide/proxies.html

clement.l · June 30, 2022, 3:30pm

Thanks for your response,

The problem started at 8am UTC on the 28th of June.

The instances are exact replicas, but one of them (the one with issues) is used a lot more, and contains a lot more users, questions etc...

I can explore the reverse proxy issues as I did not have a look for now and both instances are hosted behind different proxies.

Even though I don't really see how the reverse proxy could be linked to higher CPU usage... Maybe I have 2 separate issues at hand

clement.l · June 30, 2022, 3:39pm

Just to be sure I'm not mistaken, when I see those lines in the logs :

[5d4bca9f-b2f0-4ff0-89e7-77f882221914] 2022-06-30T17:27:30+02:00 DEBUG metabase.server.middleware.log GET /api/activity/popular_items null 1.5 mins (10 DB calls) App DB connections: 2/30 Jetty threads: 6/50 (17 idle, 0 queued) (121 total active threads) Queries in flight: 1 (0 queued)

[5d4bca9f-b2f0-4ff0-89e7-77f882221914] 2022-06-30T17:27:33+02:00 DEBUG metabase.server.middleware.log GET /api/activity/recent_views 200 58.3 s (8 DB calls) App DB connections: 1/30 Jetty threads: 4/50 (19 idle, 0 queued) (121 total active threads) Queries in flight: 1 (0 queued)
[5d4bca9f-b2f0-4ff0-89e7-77f882221914] 2022-06-30T17:27:34+02:00 DEBUG metabase.server.middleware.log GET /api/activity/popular_items 200 58.7 s (10 DB calls) App DB connections: 0/30 Jetty threads: 3/50 (20 idle, 0 queued) (121 total active threads) Queries in flight: 1 (0 queued)
[5d4bca9f-b2f0-4ff0-89e7-77f882221914] 2022-06-30T17:27:39+02:00 INFO metabase.query-processor.middleware.cache Query took 14.1 s to run; minimum for cache eligibility is 5.0 s

And especially "GET /api/activity/recent_views 200 58.3 s (8 DB calls)" or "GET /api/activity/popular_items 200 58.7 s" it does mean that the metabase webapp took 58 seconds to process those requests right ? So the reverse proxy can't be at fault ? Or does the elapsed time takes into account the time to receive the request ?

flamber · June 30, 2022, 4:24pm

@clement.l So what happened on the 28th?
Post the output by running this query on the Metabase application database:

with tbl as (select table_schema, table_name from information_schema.tables where table_name not like 'pg_%' and table_schema != 'information_schema' and table_schema in ('public')) --change 'public' if Metabase appdb is in another schema
select
  table_schema
  , table_name
  , pg_total_relation_size('"'||table_schema||'"."'||table_name||'"') size_total
  , pg_relation_size('"'||table_schema||'"."'||table_name||'"') size_tbl
  , (xpath('/row/c/text()', query_to_xml(format('select count(*) as c from %I.%I', table_schema, table_name), false, true, '')))[1]::text::bigint count_rows
from tbl
order by 5 desc nulls last;

clement.l · July 1, 2022, 9:00am

@flamber

Hey !

Here's the output :

public	query_execution	233529344	90947584	671785
public	view_log	94896128	66117632	651857
public	task_history	15450112	9453568	100017
public	query	79699968	58212352	59071
public	metabase_field	3588096	1957888	6320
public	activity	1826816	1343488	5570
public	revision	8486912	5488640	5117
public	query_cache	19832832	1843200	1845
public	metabase_fieldvalues	712704	360448	1000
public	report_dashboardcard	1007616	868352	981
public	report_card	3293184	2318336	901
public	login_history	499712	196608	489
public	databasechangelog	368640	245760	438
public	metabase_table	737280	81920	403
public	core_session	180224	81920	395
public	collection	114688	16384	112
public	permissions_group_membership	90112	8192	112
public	report_dashboard	229376	90112	88
public	pulse_card	57344	8192	64
public	core_user	98304	16384	56
public	permissions	122880	8192	52
public	pulse	65536	8192	38
public	setting	253952	8192	35
public	pulse_channel	65536	8192	31
public	pulse_channel_recipient	24576	8192	18
public	bookmark_ordering	106496	8192	13
public	data_migrations	40960	8192	13
public	permissions_group	57344	8192	10
public	qrtz_cron_triggers	65536	8192	9
public	qrtz_job_details	65536	8192	9
public	qrtz_triggers	270336	16384	9
public	dashboard_bookmark	73728	8192	8
public	timeline_event	65536	8192	8
public	collection_bookmark	73728	8192	4
public	timeline	49152	8192	4
public	qrtz_locks	24576	8192	2
public	permissions_revision	32768	8192	2
public	qrtz_scheduler_state	57344	8192	1
public	native_query_snippet	81920	8192	1
public	databasechangeloglock	57344	8192	1
public	metabase_database	32768	8192	1
public	card_bookmark	73728	8192	1
public	computation_job	16384	0	0
public	qrtz_simprop_triggers	16384	0	0
public	metric	40960	0	0
public	dashboard_favorite	32768	0	0
public	dashboardcard_series	24576	0	0
public	metric_important_field	32768	0	0
public	label	32768	0	0
public	moderation_review	24576	0	0
public	collection_permission_graph_revision	16384	0	0
public	segment	40960	0	0
public	qrtz_paused_trigger_grps	8192	0	0
public	qrtz_blob_triggers	16384	0	0
public	qrtz_calendars	16384	0	0
public	dimension	32768	0	0
public	computation_job_result	16384	0	0
public	secret	16384	0	0
public	report_cardfavorite	32768	0	0
public	dependency	40960	0	0
public	card_label	32768	0	0
public	group_table_access_policy	32768	0	0
public	application_permissions_revision	16384	0	0
public	qrtz_simple_triggers	32768	8192	0
public	qrtz_fired_triggers	163840	8192	0

clement.l · July 1, 2022, 9:11am

After a bit of investigation, it seems that those errors where present before the 28th but started skyrocketting, causing probably the slowdown. Any idea what can cause those "Error: Query cancelled on user's request" ?

Here's the daily count of this error, with a noticeable spike on the 28th :

2022-06-01	94
2022-06-02	149
2022-06-03	93
2022-06-07	224
2022-06-08	85
2022-06-09	206
2022-06-10	136
2022-06-11	4
2022-06-13	113
2022-06-14	349
2022-06-15	142
2022-06-16	137
2022-06-17	56
2022-06-18	2
2022-06-20	72
2022-06-21	189
2022-06-23	10
2022-06-24	145
2022-06-26	5
2022-06-27	67
2022-06-28	3999
2022-06-29	11358
2022-06-30	10991

flamber · July 1, 2022, 9:18am

@clement.l I don't know the size of your Metabase app or Postgres application database instance, but it's not a huge application database, so it should not be a problem. You can try truncating your query_execution and view_log (make a backup first).

It's very rare that computers just start changing behavior without something being changed.

You'll see Error: Query cancelled on user's request, when someone is loading a dashboard, but closes the dashboard before queries finished, since Metabase tries to send a cancel request to the database to free up resources.
Though, it can also be caused by an incorrect timeout of your data source database. You'll have to check the logs of everything involved.

SM_J · August 25, 2022, 8:09am

@clement.l Were you able to figure this out? I'm running in to the same issue now and haven't been able to nail down the problem

clement.l · August 25, 2022, 8:59am

Hi @SM_J

In my case, multiple users were stuck in an endless loop that was making them call and then cancel questions dozens of times per minute. That caused a global instance slowdown.

The fix I found was to redeploy the instance from scratch and deactivate then reactivate those users which stopped the request/cancel loop.

Best of luck !

SM_J · August 25, 2022, 5:50pm

@clement.l Thanks for the prompt reply. I'll give it a go!