Metabase environment performance constantly degrading

We have been getting performance degrading messages multiple times in a day for our metabase environment..

Metabase Version - 0.41.2 , Built on 2021-11-09

AWS Setup - Elastic Beanstalk, Instance types: t3.medium ,Minimum of 1 instance and maximum of 4 instances can be added in case of performance issues

Command timeout: 600

Load balancer timeout , nginx proxy are set to 600 seconds.

Load Balancer Health check path - /api/health

AWS notification messages are

  1. Environment health has transitioned from Info to Degraded. 14.3 % of the requests are failing with HTTP 5xx
  2. Environment health has transitioned from Ok to Degraded. No data received from 1 out of 2 instances.

EBS Monitoring Stats

Diagonistic report -
{
"browser-info": {
"language": "en-AU",
"platform": "MacIntel",
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36",
"vendor": "Google Inc."
},
"system-info": {
"file.encoding": "UTF-8",
"java.runtime.name": "OpenJDK Runtime Environment",
"java.runtime.version": "11.0.13+8",
"java.vendor": "Eclipse Adoptium",
"java.vendor.url": "https://adoptium.net/",
"java.version": "11.0.13",
"java.vm.name": "OpenJDK 64-Bit Server VM",
"java.vm.version": "11.0.13+8",
"os.name": "Linux",
"os.version": "4.14.252-195.483.amzn2.x86_64",
"user.language": "en",
"user.timezone": "GMT"
},
"metabase-info": {
"databases": [
"redshift",
"postgres"
],
"hosting-env": "unknown",
"application-database": "postgres",
"application-database-details": {
"database": {
"name": "PostgreSQL",
"version": "12.7"
},
"jdbc-driver": {
"name": "PostgreSQL JDBC Driver",
"version": "42.2.23"
}
},
"run-mode": "prod",
"version": {
"date": "2021-11-09",
"tag": "v0.41.2",
"branch": "release-x.41.x",
"hash": "ad599fd"
},
"settings": {
"report-timezone": "Australia/Sydney"
}
}
}

Logs -
[03908025-1ca8-4b0c-becd-fe3c4601667f] 2021-12-09T10:21:59+11:00 ERROR metabase.query-processor.middleware.catch-exceptions Error processing query: null
{:database_id 2,
:started_at #t "2021-12-08T23:21:42.667448Z[GMT]",
:via
[{:status :failed,
:class clojure.lang.ExceptionInfo,
:error "Error executing query",
:stacktrace
["--> driver.sql_jdbc.execute$execute_reducible_query$fn__52436.invoke(execute.clj:504)"
"driver.sql_jdbc.execute$execute_reducible_query.invokeStatic(execute.clj:501)"
"driver.sql_jdbc.execute$execute_reducible_query.invoke(execute.clj:487)"
"driver.sql_jdbc.execute$execute_reducible_query.invokeStatic(execute.clj:496)"
"driver.sql_jdbc.execute$execute_reducible_query.invoke(execute.clj:487)"
"driver.sql_jdbc$fn__85079.invokeStatic(sql_jdbc.clj:54)"
"driver.sql_jdbc$fn__85079.invoke(sql_jdbc.clj:52)"
"query_processor.context$executef.invokeStatic(context.clj:59)"
"query_processor.context$executef.invoke(context.clj:48)"
"query_processor.context.default$default_runf.invokeStatic(default.clj:68)"
"query_processor.context.default$default_runf.invoke(default.clj:66)"
"query_processor.context$runf.invokeStatic(context.clj:45)"
"query_processor.context$runf.invoke(context.clj:39)"
"query_processor.reducible$pivot.invokeStatic(reducible.clj:34)"
"query_processor.reducible$pivot.invoke(reducible.clj:31)"
"query_processor.middleware.mbql_to_native$mbql__GT_native$fn__49089.invoke(mbql_to_native.clj:25)"
"query_processor.middleware.check_features$check_features$fn__49575.invoke(check_features.clj:39)"
"query_processor.middleware.limit$limit$fn__47462.invoke(limit.clj:37)"
"query_processor.middleware.cache$run_query_with_cache.invokeStatic(cache.clj:177)"
"query_processor.middleware.cache$run_query_with_cache.invoke(cache.clj:168)"
"query_processor.middleware.cache$maybe_return_cached_results$fn__49949.invoke(cache.clj:203)"
"query_processor.middleware.optimize_temporal_filters$optimize_temporal_filters$fn__50942.invoke(optimize_temporal_filters.clj:204)"
"query_processor.middleware.validate_temporal_bucketing$validate_temporal_bucketing$fn__50986.invoke(validate_temporal_bucketing.clj:50)"
"query_processor.middleware.auto_parse_filter_values$auto_parse_filter_values$fn__49146.invoke(auto_parse_filter_values.clj:43)"
"query_processor.middleware.wrap_value_literals$wrap_value_literals$fn__39248.invoke(wrap_value_literals.clj:161)"
"query_processor.middleware.annotate$add_column_info$fn__43679.invoke(annotate.clj:615)"
"query_processor.middleware.permissions$check_query_permissions$fn__45617.invoke(permissions.clj:108)"
"query_processor.middleware.pre_alias_aggregations$pre_alias_aggregations$fn__50126.invoke(pre_alias_aggregations.clj:40)"
"query_processor.middleware.cumulative_aggregations$handle_cumulative_aggregations$fn__46015.invoke(cumulative_aggregations.clj:60)"
"query_processor.middleware.visualization_settings$update_viz_settings$fn__45953.invoke(visualization_settings.clj:63)"
"query_processor.middleware.resolve_joined_fields$resolve_joined_fields$fn__47060.invoke(resolve_joined_fields.clj:102)"
"query_processor.middleware.resolve_joins$resolve_joins$fn__50712.invoke(resolve_joins.clj:171)"
"query_processor.middleware.add_implicit_joins$add_implicit_joins$fn__50248.invoke(add_implicit_joins.clj:190)"
"query_processor.middleware.large_int_id$convert_id_to_string$fn__47079.invoke(large_int_id.clj:59)"
"query_processor.middleware.format_rows$format_rows$fn__50300.invoke(format_rows.clj:74)"
"query_processor.middleware.add_default_temporal_unit$add_default_temporal_unit$fn__46451.invoke(add_default_temporal_unit.clj:23)"
"query_processor.middleware.desugar$desugar$fn__45926.invoke(desugar.clj:21)"
"query_processor.middleware.binning$update_binning_strategy$fn__39148.invoke(binning.clj:229)"
"query_processor.middleware.resolve_fields$resolve_fields$fn__45260.invoke(resolve_fields.clj:34)"
"query_processor.middleware.add_dimension_projections$add_remapping$fn__49501.invoke(add_dimension_projections.clj:314)"
"query_processor.middleware.add_implicit_clauses$add_implicit_clauses$fn__49798.invoke(add_implicit_clauses.clj:147)"
"query_processor.middleware.upgrade_field_literals$upgrade_field_literals$fn__47447.invoke(upgrade_field_literals.clj:40)"
"query_processor.middleware.add_source_metadata$add_source_metadata_for_source_queries$fn__46828.invoke(add_source_metadata.clj:123)"
"query_processor.middleware.reconcile_breakout_and_order_by_bucketing$reconcile_breakout_and_order_by_bucketing$fn__50048.invoke(reconcile_breakout_and_order_by_bucketing.clj:100)"
"query_processor.middleware.auto_bucket_datetimes$auto_bucket_datetimes$fn__48398.invoke(auto_bucket_datetimes.clj:147)"
"query_processor.middleware.resolve_source_table$resolve_source_tables$fn__45241.invoke(resolve_source_table.clj:45)"
"query_processor.middleware.parameters$substitute_parameters$fn__48052.invoke(parameters.clj:111)"
"query_processor.middleware.resolve_referenced$resolve_referenced_card_resources$fn__45314.invoke(resolve_referenced.clj:79)"
"query_processor.middleware.expand_macros$expand_macros$fn__51370.invoke(expand_macros.clj:184)"
"query_processor.middleware.add_timezone_info$add_timezone_info$fn__47831.invoke(add_timezone_info.clj:15)"
"query_processor.middleware.splice_params_in_response$splice_params_in_response$fn__50315.invoke(splice_params_in_response.clj:32)"
"query_processor.middleware.resolve_database_and_driver$resolve_database_and_driver$fn__49809$fn__49813.invoke(resolve_database_and_driver.clj:31)"
"driver$do_with_driver.invokeStatic(driver.clj:60)"
"driver$do_with_driver.invoke(driver.clj:56)"
"query_processor.middleware.resolve_database_and_driver$resolve_database_and_driver$fn__49809.invoke(resolve_database_and_driver.clj:25)"
"query_processor.middleware.fetch_source_query$resolve_card_id_source_tables$fn__45864.invoke(fetch_source_query.clj:274)"
"query_processor.middleware.store$initialize_store$fn__46052$fn__46053.invoke(store.clj:11)"
"query_processor.store$do_with_store.invokeStatic(store.clj:44)"
"query_processor.store$do_with_store.invoke(store.clj:38)"
"query_processor.middleware.store$initialize_store$fn__46052.invoke(store.clj:10)"
"query_processor.middleware.validate$validate_query$fn__50098.invoke(validate.clj:10)"
"query_processor.middleware.normalize_query$normalize$fn__50105.invoke(normalize_query.clj:22)"
"query_processor.middleware.add_rows_truncated$add_rows_truncated$fn__47777.invoke(add_rows_truncated.clj:35)"
"query_processor.middleware.results_metadata$record_and_return_metadata_BANG_$fn__49075.invoke(results_metadata.clj:147)"
"query_processor.middleware.constraints$add_default_userland_constraints$fn__47795.invoke(constraints.clj:42)"
"query_processor.middleware.process_userland_query$process_userland_query$fn__50084.invoke(process_userland_query.clj:146)"
"query_processor.middleware.catch_exceptions$catch_exceptions$fn__50395.invoke(catch_exceptions.clj:169)"
"query_processor.reducible$async_qp$qp_STAR___42499$thunk__42500.invoke(reducible.clj:103)"
"query_processor.reducible$async_qp$qp_STAR___42499$fn__42502.invoke(reducible.clj:108)"],
:error_type :invalid-query,
:ex-data
{:sql
"-- /* partner: "metabase", {"dashboard_id":null,"chart_id":614,"optional_user_id":34,"optional_account_id":"fa0a25e5-f5f7-4a57-9e0e-b2be9ffdb643","filter_values":{}} / Metabase:: userID: 34 queryType: MBQL queryHash: 9ea7ed2b53917b602cbeadd3b5f0baa415704504e65fcfc055e0bf3bbbdca7d6\nSELECT date_trunc('quarter', CAST("source"."closedate" AS timestamp)) AS "closedate", sum("source"."sum") AS "sum" FROM (SELECT CAST("hubspot"."deals_properties"."closedate" AS date) AS "closedate", sum("hubspot"."deals_properties"."hs_arr") AS "sum" FROM "hubspot"."deals_properties" WHERE ("hubspot"."deals_properties"."closedate" > ? AND "hubspot"."deals_properties"."pipeline" = ? AND "hubspot"."deals_properties"."dealstage" = ?) GROUP BY CAST("hubspot"."deals_properties"."closedate" AS date) ORDER BY CAST("hubspot"."deals_properties"."closedate" AS date) ASC) "source" GROUP BY date_trunc('quarter', CAST("source"."closedate" AS timestamp)) ORDER BY date_trunc('quarter', CAST("source"."closedate" AS timestamp)) ASC",
:params ("2020-01-01" "2109039" "2109048"),
:type :invalid-query}}],
:state "57014",
:error_type :invalid-query,
:json_query
{:constraints {:max-results 10000, :max-results-bare-rows 2000},
:type :query,
:middleware {:js-int-to-string? true, :ignore-cached-results? false, :process-viz-settings? false},
:database 2,
:query
{:source-query
{:source-table 2728,
:expressions {:LTV [:
[:field 20870 nil] 10]},
:filter
[:and [:> [:field 20847 nil] "2020-01-01"] [:= [:field 21027 nil] "2109039"] [:= [:field 21025 nil] "2109048"]],
:aggregation [[:sum [:field 21004 nil]]],
:breakout [[:field 20847 {:temporal-unit :day}]]},
:aggregation [[:sum [:field "sum" {:base-type :type/Float}]]],
:breakout [[:field "closedate" {:base-type :type/Text, :temporal-unit :quarter}]]},
:parameters [],
:async? true,
:cache-ttl 28},
:native
{:query
"SELECT date_trunc('quarter', CAST("source"."closedate" AS timestamp)) AS "closedate", sum("source"."sum") AS "sum" FROM (SELECT CAST("hubspot"."deals_properties"."closedate" AS date) AS "closedate", sum("hubspot"."deals_properties"."hs_arr") AS "sum" FROM "hubspot"."deals_properties" WHERE ("hubspot"."deals_properties"."closedate" > ? AND "hubspot"."deals_properties"."pipeline" = ? AND "hubspot"."deals_properties"."dealstage" = ?) GROUP BY CAST("hubspot"."deals_properties"."closedate" AS date) ORDER BY CAST("hubspot"."deals_properties"."closedate" AS date) ASC) "source" GROUP BY date_trunc('quarter', CAST("source"."closedate" AS timestamp)) ORDER BY date_trunc('quarter', CAST("source"."closedate" AS timestamp)) ASC",
:params ("2020-01-01" "2109039" "2109048")},
:status :failed,
:class com.amazon.redshift.util.RedshiftException,
:stacktrace
["com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2579)"
"com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2253)"
"com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1880)"
"com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1872)"
"com.amazon.redshift.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)"
"com.amazon.redshift.jdbc.RedshiftStatementImpl.executeInternal(RedshiftStatementImpl.java:514)"
"com.amazon.redshift.jdbc.RedshiftStatementImpl.execute(RedshiftStatementImpl.java:435)"
"com.amazon.redshift.jdbc.RedshiftPreparedStatement.executeWithFlags(RedshiftPreparedStatement.java:200)"
"com.amazon.redshift.jdbc.RedshiftPreparedStatement.executeQuery(RedshiftPreparedStatement.java:115)"
"com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:431)"
"--> driver.sql_jdbc.execute$fn__52354.invokeStatic(execute.clj:364)"
"driver.sql_jdbc.execute$fn__52354.invoke(execute.clj:362)"
"driver.sql_jdbc.execute$execute_statement_or_prepared_statement_BANG_.invokeStatic(execute.clj:377)"
"driver.sql_jdbc.execute$execute_statement_or_prepared_statement_BANG_.invoke(execute.clj:373)"
"driver.sql_jdbc.execute$execute_reducible_query$fn__52436.invoke(execute.clj:502)"
"driver.sql_jdbc.execute$execute_reducible_query.invokeStatic(execute.clj:501)"
"driver.sql_jdbc.execute$execute_reducible_query.invoke(execute.clj:487)"
"driver.sql_jdbc.execute$execute_reducible_query.invokeStatic(execute.clj:496)"
"driver.sql_jdbc.execute$execute_reducible_query.invoke(execute.clj:487)"
"driver.sql_jdbc$fn__85079.invokeStatic(sql_jdbc.clj:54)"
"driver.sql_jdbc$fn__85079.invoke(sql_jdbc.clj:52)"
"query_processor.context$executef.invokeStatic(context.clj:59)"
"query_processor.context$executef.invoke(context.clj:48)"
"query_processor.context.default$default_runf.invokeStatic(default.clj:68)"
"query_processor.context.default$default_runf.invoke(default.clj:66)"
"query_processor.context$runf.invokeStatic(context.clj:45)"
"query_processor.context$runf.invoke(context.clj:39)"
"query_processor.reducible$pivot.invokeStatic(reducible.clj:34)"
"query_processor.reducible$pivot.invoke(reducible.clj:31)"
"query_processor.middleware.mbql_to_native$mbql__GT_native$fn__49089.invoke(mbql_to_native.clj:25)"
"query_processor.middleware.check_features$check_features$fn__49575.invoke(check_features.clj:39)"
"query_processor.middleware.limit$limit$fn__47462.invoke(limit.clj:37)"
"query_processor.middleware.cache$run_query_with_cache.invokeStatic(cache.clj:177)"
"query_processor.middleware.cache$run_query_with_cache.invoke(cache.clj:168)"
"query_processor.middleware.cache$maybe_return_cached_results$fn__49949.invoke(cache.clj:203)"
"query_processor.middleware.optimize_temporal_filters$optimize_temporal_filters$fn__50942.invoke(optimize_temporal_filters.clj:204)"
"query_processor.middleware.validate_temporal_bucketing$validate_temporal_bucketing$fn__50986.invoke(validate_temporal_bucketing.clj:50)"
"query_processor.middleware.auto_parse_filter_values$auto_parse_filter_values$fn__49146.invoke(auto_parse_filter_values.clj:43)"
"query_processor.middleware.wrap_value_literals$wrap_value_literals$fn__39248.invoke(wrap_value_literals.clj:161)"

Is there any recommendation or suggestion that we can use to improve the performnace. Is there any configuration changes required or missing that we can add that will help us to reduce performce related issues.

Thankyou

Thanks for the detailed info. First of all, update your Metabase deployment to 41.3.1 as having the latest and greatest is the only way in which you can get the latest bugfixes.

Are you running the latest Elastic Beanstalk version or the previous one?

The logs you posted are for an error on a question, which doesn't bring the application down, so I'm guessing that what's happening here is that Elastic Beanstalk triggers a scale out event and the ALB tries to check for the instance health but it's too soon as the application hasn't started (Elastic Beanstalk needs to start a VM, spin up a container and then the container needs to be ready).

regards,
Luis

Thankyou Luis for your help.
We will update the version and monitor it.

Just one thing. I tried to access the query mentioned in the log aand there were no issue and AWS didnt throw any error. Any idea what could be wrong here or if you could help me in understand the error mentioned on the logs.

Thankyou

AWS won't throw any errors as the error belongs to Metabase and not the infrastructure. The source of the error seems to be an invalid query