Instance webapp response time slow, lots of Request cancelled on user's request errors

Hello @flamber, I hope you're doing well.

I'm opening this topic because the last few days, I noticed that one of my Metabase instance became very slow (talking mainly about webapp response time) going from 50ms - 200ms to serve HTTP requests to around 3000ms now. CPU and RAM of my instance also goes up significantly even when usage is low (lunch break for instance).

This is the case on only one of my 3 instances, all with version 0.43.4, it started before the upgrade (0.43.3) but remains afterwards.

When looking at the logs, I see a lot of exceptions passing by that mention "ERROR: abort query" or "ERROR: Query (98713593) cancelled on user's request " (I can provide examples if necessary).

When investigating further, it seems that Metabase is launching requests in a loop from users, then cancelling those requests, which can overload the instance. I could not determine what triggered this behaviour though.

I was wondering if you had some ideas on what could be causing those requests and errors, and if anything else could be related to the instance's slowdown.

Here's the diagnostic info :
{
"browser-info": {
"language": "fr-FR",
"platform": "MacIntel",
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36",
"vendor": "Google Inc."
},
"system-info": {
"file.encoding": "UTF-8",
"java.runtime.name": "OpenJDK Runtime Environment",
"java.runtime.version": "11.0.15+10",
"java.vendor": "Eclipse Adoptium",
"java.vendor.url": "https://adoptium.net/",
"java.version": "11.0.15",
"java.vm.name": "OpenJDK 64-Bit Server VM",
"java.vm.version": "11.0.15+10",
"os.name": "Linux",
"os.version": "4.14.275-207.503.amzn2.x86_64",
"user.language": "en",
"user.timezone": "GMT"
},
"metabase-info": {
"databases": [
"redshift"
],
"hosting-env": "unknown",
"application-database": "postgres",
"application-database-details": {
"database": {
"name": "PostgreSQL",
"version": "13.4"
},
"jdbc-driver": {
"name": "PostgreSQL JDBC Driver",
"version": "42.3.2"
}
},
"run-mode": "prod",
"version": {
"date": "2022-06-27",
"tag": "v0.43.4",
"branch": "release-x.43.x",
"hash": "61cc28e"
},
"settings": {
"report-timezone": null
}
}
}

Thanks a lot for your help.

Hi @clement.l
When did the problems start?
I don't know what the difference is between your instances, but there must be some difference.
Sounds like reverse proxing issues: https://www.metabase.com/docs/latest/troubleshooting-guide/proxies.html

Thanks for your response,

The problem started at 8am UTC on the 28th of June.

The instances are exact replicas, but one of them (the one with issues) is used a lot more, and contains a lot more users, questions etc...

I can explore the reverse proxy issues as I did not have a look for now and both instances are hosted behind different proxies.

Even though I don't really see how the reverse proxy could be linked to higher CPU usage... Maybe I have 2 separate issues at hand

Just to be sure I'm not mistaken, when I see those lines in the logs :

[5d4bca9f-b2f0-4ff0-89e7-77f882221914] 2022-06-30T17:27:30+02:00 DEBUG metabase.server.middleware.log GET /api/activity/popular_items null 1.5 mins (10 DB calls) App DB connections: 2/30 Jetty threads: 6/50 (17 idle, 0 queued) (121 total active threads) Queries in flight: 1 (0 queued)

[5d4bca9f-b2f0-4ff0-89e7-77f882221914] 2022-06-30T17:27:33+02:00 DEBUG metabase.server.middleware.log GET /api/activity/recent_views 200 58.3 s (8 DB calls) App DB connections: 1/30 Jetty threads: 4/50 (19 idle, 0 queued) (121 total active threads) Queries in flight: 1 (0 queued)
[5d4bca9f-b2f0-4ff0-89e7-77f882221914] 2022-06-30T17:27:34+02:00 DEBUG metabase.server.middleware.log GET /api/activity/popular_items 200 58.7 s (10 DB calls) App DB connections: 0/30 Jetty threads: 3/50 (20 idle, 0 queued) (121 total active threads) Queries in flight: 1 (0 queued)
[5d4bca9f-b2f0-4ff0-89e7-77f882221914] 2022-06-30T17:27:39+02:00 INFO metabase.query-processor.middleware.cache Query took 14.1 s to run; minimum for cache eligibility is 5.0 s

And especially "GET /api/activity/recent_views 200 58.3 s (8 DB calls)" or "GET /api/activity/popular_items 200 58.7 s" it does mean that the metabase webapp took 58 seconds to process those requests right ? So the reverse proxy can't be at fault ? Or does the elapsed time takes into account the time to receive the request ?

@clement.l So what happened on the 28th?
Post the output by running this query on the Metabase application database:

with tbl as (select table_schema, table_name from information_schema.tables where table_name not like 'pg_%' and table_schema != 'information_schema' and table_schema in ('public')) --change 'public' if Metabase appdb is in another schema
select
  table_schema
  , table_name
  , pg_total_relation_size('"'||table_schema||'"."'||table_name||'"') size_total
  , pg_relation_size('"'||table_schema||'"."'||table_name||'"') size_tbl
  , (xpath('/row/c/text()', query_to_xml(format('select count(*) as c from %I.%I', table_schema, table_name), false, true, '')))[1]::text::bigint count_rows
from tbl
order by 5 desc nulls last;

@flamber

Hey !

Here's the output :

public query_execution 233529344 90947584 671785
public view_log 94896128 66117632 651857
public task_history 15450112 9453568 100017
public query 79699968 58212352 59071
public metabase_field 3588096 1957888 6320
public activity 1826816 1343488 5570
public revision 8486912 5488640 5117
public query_cache 19832832 1843200 1845
public metabase_fieldvalues 712704 360448 1000
public report_dashboardcard 1007616 868352 981
public report_card 3293184 2318336 901
public login_history 499712 196608 489
public databasechangelog 368640 245760 438
public metabase_table 737280 81920 403
public core_session 180224 81920 395
public collection 114688 16384 112
public permissions_group_membership 90112 8192 112
public report_dashboard 229376 90112 88
public pulse_card 57344 8192 64
public core_user 98304 16384 56
public permissions 122880 8192 52
public pulse 65536 8192 38
public setting 253952 8192 35
public pulse_channel 65536 8192 31
public pulse_channel_recipient 24576 8192 18
public bookmark_ordering 106496 8192 13
public data_migrations 40960 8192 13
public permissions_group 57344 8192 10
public qrtz_cron_triggers 65536 8192 9
public qrtz_job_details 65536 8192 9
public qrtz_triggers 270336 16384 9
public dashboard_bookmark 73728 8192 8
public timeline_event 65536 8192 8
public collection_bookmark 73728 8192 4
public timeline 49152 8192 4
public qrtz_locks 24576 8192 2
public permissions_revision 32768 8192 2
public qrtz_scheduler_state 57344 8192 1
public native_query_snippet 81920 8192 1
public databasechangeloglock 57344 8192 1
public metabase_database 32768 8192 1
public card_bookmark 73728 8192 1
public computation_job 16384 0 0
public qrtz_simprop_triggers 16384 0 0
public metric 40960 0 0
public dashboard_favorite 32768 0 0
public dashboardcard_series 24576 0 0
public metric_important_field 32768 0 0
public label 32768 0 0
public moderation_review 24576 0 0
public collection_permission_graph_revision 16384 0 0
public segment 40960 0 0
public qrtz_paused_trigger_grps 8192 0 0
public qrtz_blob_triggers 16384 0 0
public qrtz_calendars 16384 0 0
public dimension 32768 0 0
public computation_job_result 16384 0 0
public secret 16384 0 0
public report_cardfavorite 32768 0 0
public dependency 40960 0 0
public card_label 32768 0 0
public group_table_access_policy 32768 0 0
public application_permissions_revision 16384 0 0
public qrtz_simple_triggers 32768 8192 0
public qrtz_fired_triggers 163840 8192 0

After a bit of investigation, it seems that those errors where present before the 28th but started skyrocketting, causing probably the slowdown. Any idea what can cause those "Error: Query cancelled on user's request" ?

Here's the daily count of this error, with a noticeable spike on the 28th :

2022-06-01 94
2022-06-02 149
2022-06-03 93
2022-06-07 224
2022-06-08 85
2022-06-09 206
2022-06-10 136
2022-06-11 4
2022-06-13 113
2022-06-14 349
2022-06-15 142
2022-06-16 137
2022-06-17 56
2022-06-18 2
2022-06-20 72
2022-06-21 189
2022-06-23 10
2022-06-24 145
2022-06-26 5
2022-06-27 67
2022-06-28 3999
2022-06-29 11358
2022-06-30 10991

@clement.l I don't know the size of your Metabase app or Postgres application database instance, but it's not a huge application database, so it should not be a problem. You can try truncating your query_execution and view_log (make a backup first).

It's very rare that computers just start changing behavior without something being changed.

You'll see Error: Query cancelled on user's request, when someone is loading a dashboard, but closes the dashboard before queries finished, since Metabase tries to send a cancel request to the database to free up resources.
Though, it can also be caused by an incorrect timeout of your data source database. You'll have to check the logs of everything involved.

@clement.l Were you able to figure this out? I'm running in to the same issue now and haven't been able to nail down the problem

Hi @SM_J

In my case, multiple users were stuck in an endless loop that was making them call and then cancel questions dozens of times per minute. That caused a global instance slowdown.

The fix I found was to redeploy the instance from scratch and deactivate then reactivate those users which stopped the request/cancel loop.

Best of luck !

@clement.l Thanks for the prompt reply. I'll give it a go!