Slow card load after upgrade to 0.42.1

Hello,

We attempted to upgrade our Metabase instance from 0.40.5 to 0.42.1

After the upgrade, some of the reports take a long time (1 mintue) to load - they are stuck on the "Loading.." screen.

Other findings:

  1. When I duplicate the same report, it loads instantly.

  2. When I call API get /api/card/:id it is slow as well - it takes 1 minute

  3. I looked in the logs in Admin > Troubleshooting > Logs and haven't found any error specific to that report (only that a required parameter is missing)

We are out of ideas on how to debug this, so I wondered if anybody else has encountered the same behavior as we are.

Thanks a lot,
Matus

1 Like

Here is a detail of response time when calling the API:

Hi @matusp
Post "Diagnostic Info" from Admin > Troubleshooting.
Something is telling me that your application database somehow has not gotten the index:
https://github.com/metabase/metabase/issues/18759

Hi @flamber, here it is:

    {
  "browser-info": {
    "language": "en-US",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.14.1+1",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.14.1",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.14.1+1",
    "os.name": "Linux",
    "os.version": "5.4.95-42.163.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "UTC"
  },
  "metabase-info": {
    "databases": [
      "postgres",
      "snowflake",
      "mysql"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "12.7"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.2.23"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2022-02-17",
      "tag": "v0.42.1",
      "branch": "release-x.42.x",
      "hash": "629f4de"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

@matusp Check the Metabase application database databasechangelog.id=381 for exec_type

Make sure that you actually have an index for query_execution.card_id called idx_query_execution_card_id

Otherwise you might need to look in your Postgres database query logs and run a EXPLAIN ANALYZE on the queries being executed, when accessing /api/card/1783

1 Like

@flamber

  1. When I run:
    select exectype from databasechangelog where id='381';
    I get this response: EXECUTED

  2. We do have the idx_query_execution_card_id index on query_execution

  3. We will look into this, but wanted to update you on the first two items

@matusp While you're looking into queries, can you provide some numbers about your setup - specifically the row count of your tables

with tbl as (select table_schema, table_name from information_schema.tables where table_name not like 'pg_%' and table_schema != 'information_schema' and table_schema in ('public')) --change 'public' if Metabase appdb is in another schema
select
  table_schema
  , table_name
  , pg_total_relation_size('"'||table_schema||'"."'||table_name||'"') size_total
  , pg_relation_size('"'||table_schema||'"."'||table_name||'"') size_tbl
  , (xpath('/row/c/text()', query_to_xml(format('select count(*) as c from %I.%I', table_schema, table_name), false, true, '')))[1]::text::bigint count_rows
from tbl
order by 5 desc nulls last;

@flamber Here it is:

@flamber Here are the queries that we identified as those that were triggered, and their EXPLAIN ANALYZE.

metabase=> EXPLAIN ANALYZE SELECT * FROM "report_card" WHERE "id" = 1783 LIMIT 1;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.28..8.30 rows=1 width=1533) (actual time=0.019..0.020 rows=1 loops=1)
   ->  Index Scan using pk_report_card on report_card  (cost=0.28..8.30 rows=1 width=1533) (actual time=0.018..0.018 rows=1 loops=1)
         Index Cond: (id = 1783)
 Planning Time: 0.685 ms
 Execution Time: 0.067 ms
(5 rows)

metabase=> EXPLAIN ANALYZE SELECT count(*) AS "count" FROM "report_dashboardcard" WHERE "card_id" = 1783 LIMIT 1
metabase-> ;
                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13.64..13.65 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1)
   ->  Aggregate  (cost=13.64..13.65 rows=1 width=8) (actual time=0.021..0.022 rows=1 loops=1)
         ->  Index Only Scan using idx_dashboardcard_card_id on report_dashboardcard  (cost=0.28..13.63 rows=4 width=0) (actual time=0.016..0.019 rows=4 loops=1)
               Index Cond: (card_id = 1783)
               Heap Fetches: 4
 Planning Time: 0.169 ms
 Execution Time: 0.055 ms
(7 rows)

metabase=> EXPLAIN ANALYZE SELECT avg("running_time") FROM "query_execution" WHERE ("running_time" IS NOT NULL AND "cache_hit" <> TRUE AND "card_id" = 1783);
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1305989.21..1305989.22 rows=1 width=32) (actual time=73862.329..73862.702 rows=1 loops=1)
   ->  Gather  (cost=1305988.99..1305989.20 rows=2 width=32) (actual time=73860.805..73862.682 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=1304988.99..1304989.00 rows=1 width=32) (actual time=73856.824..73856.825 rows=1 loops=3)
               ->  Parallel Seq Scan on query_execution  (cost=0.00..1303647.98 rows=536404 width=4) (actual time=17600.196..73777.251 rows=323347 loops=3)
                     Filter: ((running_time IS NOT NULL) AND (NOT cache_hit) AND (card_id = 1783))
                     Rows Removed by Filter: 16932989
 Planning Time: 0.327 ms
 Execution Time: 73862.743 ms
(10 rows)

metabase=> EXPLAIN ANALYZE SELECT max("started_at") FROM "query_execution" WHERE ("running_time" IS NOT NULL AND "cache_hit" <> TRUE AND "card_id" = 1783);
                                                                                    QUERY PLAN                                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=2.61..2.62 rows=1 width=8) (actual time=0.060..0.061 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.56..2.61 rows=1 width=8) (actual time=0.057..0.058 rows=1 loops=1)
           ->  Index Scan Backward using idx_query_execution_started_at on query_execution  (cost=0.56..2637165.10 rows=1287369 width=8) (actual time=0.057..0.057 rows=1 loops=1)
                 Index Cond: (started_at IS NOT NULL)
                 Filter: ((running_time IS NOT NULL) AND (NOT cache_hit) AND (card_id = 1783))
                 Rows Removed by Filter: 149
 Planning Time: 0.216 ms
 Execution Time: 0.092 ms

Indexes on query_execution:

metabase=> SELECT
metabase->     tablename,
metabase->     indexname,
metabase->     indexdef
metabase-> FROM
metabase->     pg_indexes
metabase->     where tablename = 'query_execution';
    tablename    |                 indexname                 |                                                    indexdef                                                     
-----------------+-------------------------------------------+-----------------------------------------------------------------------------------------------------------------
 query_execution | pk_query_execution                        | CREATE UNIQUE INDEX pk_query_execution ON public.query_execution USING btree (id)
 query_execution | idx_query_execution_query_hash_started_at | CREATE INDEX idx_query_execution_query_hash_started_at ON public.query_execution USING btree (started_at)
 query_execution | idx_query_execution_started_at            | CREATE INDEX idx_query_execution_started_at ON public.query_execution USING btree (started_at)
 query_execution | idx_query_execution_card_id_started_at    | CREATE INDEX idx_query_execution_card_id_started_at ON public.query_execution USING btree (card_id, started_at)
 query_execution | idx_query_execution_card_id               | CREATE INDEX idx_query_execution_card_id ON public.query_execution USING btree (card_id)

@matusp What's up with your Postgres? It says that there's 596 rows in query_execution, but your explain says there's 17 million rows.

Could it be that it has never been optimized (VACUUM)?
Or perhaps it's running on an instance that doesn't have enough memory to keep the indexes in memory, so it's actually swapping when looking through the indexes?

If you don't need the historical execution data, then you can truncate query_execution or just remove anything older than a month.
Alternative add another index:

CREATE INDEX idx_query_execution_card_id_running_time ON public.query_execution USING btree (card_id, running_time)

@flamber Thank you so much for your help.

  1. In which version did you add calling "SELECT avg("running_time") FROM "query_execution" WHERE.. " every time a report is open? Seems like this wasn't present in 0.40.5 and was added in versions later.

  2. We tried to add index, but it did not help, The query is ignoring all indexes.

    metabase=> EXPLAIN ANALYZE SELECT avg("running_time") FROM "query_execution" WHERE ("running_time" IS NOT NULL AND "cache_hit" <> TRUE AND "card_id" = 1783);
    QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------------------------------------------------
    Finalize Aggregate (cost=1364121.92..1364121.93 rows=1 width=32) (actual time=145712.603..145731.305 rows=1 loops=1)
    -> Gather (cost=1364121.70..1364121.91 rows=2 width=32) (actual time=145691.014..145729.658 rows=3 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    -> Partial Aggregate (cost=1363121.70..1363121.71 rows=1 width=32) (actual time=145676.617..145676.637 rows=1 loops=3)
    -> Parallel Seq Scan on query_execution (cost=0.00..1361746.61 rows=550034 width=4) (actual time=2.655..145245.645 rows=340415 loops=3)
    Filter: ((running_time IS NOT NULL) AND (NOT cache_hit) AND (card_id = 1783))
    Rows Removed by Filter: 17735437
    Planning Time: 3.171 ms
    Execution Time: 145733.558 ms
    (10 rows)

@matusp
In 0.41.0, and with fixes in 0.41.2 as I wrote in my first comment.

You didn't answer most of my questions in previous comment.

@flamber I only have limited information since there is a lot of DevOps work out of my scope.

But we were able to fix this by adding this index:
CREATE INDEX idx_query_execution_card_id_running_time_cache_hit ON public.query_execution USING btree (card_id, running_time, cache_hit)

And we also deleted everything older than 3 months in the query_execution table.

Thank you for your help!