Autocomplete_suggestions request takes more than 10 minutes to complete

heits · June 21, 2021, 8:12pm

Hello.

First of all thank you very much for providing such an awesome tool, we absolutely love it at our company.

We recently implemented partitioning in our Postgres database which Metabase queries. It has been implemented on 3 tables with around 2000 partitions each, with a total of 6000 to 7000 partitions. I think it may have caused Metabase to slow down the request:

GET /api/databases/:dbid/autocomplete_suggestions

Some requests take more than 10 minutes to complete looking at the Metabase access logs.

We have also implemented the citus extension for columnar access method and converted some tables to columnar, but I really don't think that's the problem, still worth noticing since it's a big change in the database structure.

We are running Metabase 0.39.3 and Postgres 12.7.

Any ideias of what may be happening?

Thank you very much!

flamber · June 21, 2021, 8:49pm

Hi @heits
There's no versioning of the API requests, so that's another change that you have made.

Have a look at the query log on Postgres to see what is happening and then add an EXPLAIN ANALYZE in front to see more details. Perhaps you can rewrite the request or query, or optimize your database in a different way.

There's a big difference between your database warehouse, which contains your data and the Metabase application database.
You should not try to change the structure of the Metabase application database.

heits · June 21, 2021, 9:01pm

Hi @flamber thank for the fast reply.

Sorry, I typed it by hand and made a mistake. Log is copied below and I edited my question to be more precise.

Nothing interesting is going on at Postgres. I cannot add EXPLAIN ANALYZE because I can't trace these slow API calls to a Postgres query.

We have different databases for the warehousing and Metabase's application db. However, absolutely nothing changed on Metabase's application db, so I assume the problem is with the warehouse one and the number of partitions, but that's a complete guess.

Access log:

2021-06-21 19:33:04,298 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 56.2 s (6 DB calls) App DB connections: 9/15 Jetty threads: 48/50 (0 idle, 25 queued) (208 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:37:14,186 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 5.7 mins (6 DB calls) App DB connections: 5/15 Jetty threads: 50/50 (0 idle, 139 queued) (214 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:37:38,766 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 6.2 mins (6 DB calls) App DB connections: 7/15 Jetty threads: 50/50 (0 idle, 153 queued) (214 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:37:45,153 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 6.2 mins (6 DB calls) App DB connections: 4/15 Jetty threads: 50/50 (0 idle, 155 queued) (216 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:38:24,540 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 6.7 mins (6 DB calls) App DB connections: 5/15 Jetty threads: 50/50 (0 idle, 178 queued) (218 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:38:37,175 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 6.9 mins (6 DB calls) App DB connections: 5/15 Jetty threads: 50/50 (0 idle, 186 queued) (218 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:40:41,666 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 7.6 mins (6 DB calls) App DB connections: 4/15 Jetty threads: 50/50 (0 idle, 271 queued) (222 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:40:54,317 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 9.4 mins (6 DB calls) App DB connections: 5/15 Jetty threads: 50/50 (0 idle, 281 queued) (222 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:41:51,749 INFO middleware.exceptions :: Request canceled before finishing.
2021-06-21 19:41:51,751 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions null 11.5 mins (6 DB calls) App DB connections: 4/15 Jetty threads: 50/50 (0 idle, 305 queued) (224 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:42:09,664 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 10.6 mins (6 DB calls) App DB connections: 4/15 Jetty threads: 49/50 (0 idle, 315 queued) (224 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:42:35,078 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 11.0 mins (6 DB calls) App DB connections: 4/15 Jetty threads: 49/50 (0 idle, 325 queued) (225 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:43:10,566 INFO middleware.exceptions :: Request canceled before finishing.
2021-06-21 19:43:33,017 INFO middleware.exceptions :: Request canceled before finishing.
2021-06-21 19:43:33,096 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions null 13.0 mins (6 DB calls) App DB connections: 9/15 Jetty threads: 49/50 (0 idle, 326 queued) (228 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:43:58,248 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 8.8 mins (6 DB calls) App DB connections: 9/15 Jetty threads: 48/50 (0 idle, 333 queued) (228 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:44:11,491 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions 200 12.5 mins (6 DB calls) App DB connections: 9/15 Jetty threads: 48/50 (0 idle, 339 queued) (228 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:45:18,904 INFO middleware.exceptions :: Request canceled before finishing.
2021-06-21 19:45:18,905 INFO middleware.exceptions :: Request canceled before finishing.
2021-06-21 19:45:18,907 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions null 14.9 mins (6 DB calls) App DB connections: 9/15 Jetty threads: 49/50 (0 idle, 325 queued) (232 total active threads) Queries in flight: 0 (0 queued)
2021-06-21 19:45:30,511 INFO middleware.exceptions :: Request canceled before finishing.
2021-06-21 19:45:51,808 INFO middleware.exceptions :: Request canceled before finishing.
2021-06-21 19:46:02,394 INFO middleware.exceptions :: Request canceled before finishing.
2021-06-21 19:46:02,396 DEBUG middleware.log :: GET /api/database/5/autocomplete_suggestions null 15.5 mins (6 DB calls) App DB connections: 7/15 Jetty threads: 48/50 (0 idle, 345 queued) (232 total active threads) Queries in flight: 0 (0 queued)

Thank you very much!

flamber · June 22, 2021, 10:55am

@heits Okay, let's get some information.

Post "Diagnostic Info" from Admin > Troubleshooting
How are you running Metabase - JAR, Docker, Kubernetes, etc. - with a load-balancer?
How many databases are configured in Metabase > Admin > Databases? And how many schemas, tables and columns do each of those databases have? You can run this on the application database:

SELECT "metabase_database"."id" AS "database_id", "metabase_database"."engine" AS "database_engine", count(distinct "metabase_table"."schema") AS "count_schemas", "metabase_table"."active" AS "table_active", count(distinct "metabase_table"."id") AS "count_tables", "metabase_field"."active" AS "field_active", count(distinct "metabase_field"."id") AS "count_columns"
FROM "metabase_database"
LEFT JOIN "metabase_table" ON "metabase_database"."id" = "metabase_table"."db_id"
LEFT JOIN "metabase_field" ON "metabase_table"."id" = "metabase_field"."table_id"
GROUP BY "metabase_database"."id", "metabase_database"."engine", "metabase_table"."active", "metabase_field"."active"

Did the problem start after you made a change on your warehouse, or was it perhaps after an upgrade of Metabase? Tracking down when the problem started is very important to figuring out possible root causes.
The autocomplete_suggestions API does not make calls to your warehouse, only to the application database, and the requests should be below 200ms - even on big instances.
https://github.com/metabase/metabase/blob/master/src/metabase/api/database.clj#L310

heits · June 22, 2021, 2:54pm

Hey, @flamber.

Here it goes:

{
  "browser-info": {
    "language": "en-US",
    "platform": "Linux x86_64",
    "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "Java(TM) SE Runtime Environment",
    "java.runtime.version": "1.8.0_191-b12",
    "java.vendor": "Oracle Corporation",
    "java.vendor.url": "http://java.oracle.com/",
    "java.version": "1.8.0_191",
    "java.vm.name": "Java HotSpot(TM) 64-Bit Server VM",
    "java.vm.version": "25.191-b12",
    "os.name": "Linux",
    "os.version": "4.4.0-1111-aws",
    "user.language": "en",
    "user.timezone": "Etc/UTC"
  },
  "metabase-info": {
    "databases": [
      "postgres"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "12.5"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.2.18"
      }
    },
    "run-mode": "prod",
    "version": {
      "tag": "v0.39.3",
      "date": "2021-05-27",
      "branch": "release-x.39.x",
      "hash": "598a112"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

We are running Metabase JAR in AWS in a t3.medium EC2. It's a single node. It sits behind an Application Load Balancer.
There's only one database, besides the one from metabase application itself which we also map so that we can query some metabase details such as average query duration. Check the output of the query below:

 database_id | database_engine | count_schemas | table_active | count_tables | field_active | count_columns 
-------------+-----------------+---------------+--------------+--------------+--------------+---------------
           5 | postgres        |             4 | f            |         4965 | f            |        130095
           5 | postgres        |             7 | f            |        19636 | t            |        576995
           5 | postgres        |             1 | f            |            4 |              |             0
           5 | postgres        |             5 | t            |           25 | f            |           208
           5 | postgres        |             8 | t            |         6245 | t            |        172874
           5 | postgres        |             1 | t            |            1 |              |             0
           7 | postgres        |             1 | f            |            4 | t            |            12
           7 | postgres        |             1 | t            |            3 | f            |             4
           7 | postgres        |             1 | t            |           52 | t            |           439
           7 | postgres        |             1 | t            |            5 |              |             0

One week we added a huge number of partitions to our warehouse as well as added some columnar tables (citus extension), and the next week we upgraded Metabase to 0.39.3. Not sure what started the issued.
Ok, so when it happens again I will post here the output of pg_stat_activity for the application database.

Thank you once again!

flamber · June 22, 2021, 3:54pm

@heits Okay, so I think I can see the real problem. It's not often that I see someone with a 880k columns in a database, where it looks like 600k of those are not active anymore.

If you open Native query > select the Metabase appdb > and start typing something like meta - it shows the suggestion quite fast, right?

It looks like there's no limit on the amount of suggestions that can be returned.
Though, when I test on one of my instances with 190k column, then it finishes the request within 5 seconds, but I only have ~80k columns that starts with a "t"

Something makes me think that either your Postgres appdb doesn't have enough resources, or some of the indexes are missing, which would cause slow temporary tables.
Though the 10-15 minute delay seems quite excessive.

Can you check your browser developer Network-tab, when it makes a slow suggestion request - what is the size of the return? Mine was "only" 210,29 KB transferred (3,68 MB unzipped) for ~80k column in about 5 seconds.

heits · June 22, 2021, 4:36pm

@flamber I made a somewhat strange test. I typed SELECT * FROM <letter> replacing letter for a-z. The request is actually pretty slow, but not minutes. The returned payload is also pretty huge, mainly when all requests are concerned.

You mentioned:

where it looks like 600k of those are not active anymore.

what does "not active anymore" mean? Is there a way to simply erase inactive columns/tables?

Also, if you think this might be solved by indexing, would you mind providing me with the CREATE INDEX clauses?

Edit: I checked appdb CPU Usage and Network Transmit logs and both of these are very high during the incidents. The problem may be actually related with the amount of returned rows, so indexing would not help in this case.

Edit2: I saved the output of the b prefix with curl and found out that the gzip'd payload is 293kB but the raw payload is 5.7MB. That's huge haha.

Thank you.

heits · June 22, 2021, 5:03pm

@flamber I think there's more information I can provide. We actually drop and recreate thousands of partition once a day. That may be confusing Metabase. I ran SELECT count(*) FROM information_schema.columns in our warehouse databaes and it returned 175k columns, far from the 880k reported by metabase_fields. If I simply ask Metabase to resync the schema is enough, or I must resync field values? Or none will work?

Also, I should mention that I see a lot of the lines below in the log:

2021-06-22 17:02:50,299 WARN sync.describe-table :: Don't know how to map column type '_text' to a Field base_type, falling back to :type/*.

Thanks

flamber · June 22, 2021, 5:59pm

@heits Okay.

I need to understand how you ended up with 15 minute requests - do you know what was input here?
That's the only interesting part.

What you are showing with 5-50 seconds is exactly what I would expect - with the current bug (no limits).

Metabase has indexed 880k columns over for this database over time. You currently have 172874 columns.
Every time Metabase does sync and sees a new table/column, then it's added to the metadata, and when a table/column is not in the database anymore, then Metabase will set it as active=false instead of deleting it.

You can manually delete all tables and columns from the metadata if you are sure that you don't want them, and they are not linked to questions, which might break. If you had any questions based on these active=false tables/columns, then they wouldn't run anyways.

Make sure you have a backup before you remove them!
You would look in the table metabase_table and metabase_field.

As for the _text - do you have any columns with that type? That's not a valid column type, it should be text.
Check your warehouse information_schema, since it might hint at why it's returning that type to Metabase.

heits · June 22, 2021, 6:08pm

Not sure what made it took 15 min. Is there any way to make Metabase log request query parameters?

Taking 50 seconds seems too much already because it eats up a whole CPU core for that time.

All those inactive columns is pretty strange. Maybe I will try to manually delete old entries later.

I ran a SELECT DISTINCT data_type FROM information_schema.columns in warehouse db and no fields with a _ prefix returned. That's very strange.

flamber · June 22, 2021, 6:18pm

@heits

No, the URL parameters are stripped somewhere for security/privacy. You would have to enable trace logging, but you would drown because Jetty is so noisy. Since you have a load-balancer, then I would guess that you could find the information there.

50 seconds makes sense if the appdb is a little underloaded with resources, and some other queries were running, and considering that you have 10 times the amount of columns than what I have available on this test instance.
That doesn't make it right - it's a bug, but I already know the reason behind that.
I'm much more interested in the reason for the 15min problem, since that makes it a higher priority bug.

Since I don't know which columns you're seeing the issue on, it's hard to tell. Check the metabase_field.database_type
Perhaps you're using a View or something like that, which can distort the actual types that Metabase sees.

heits · June 22, 2021, 6:57pm

Does it make sense adding a limit, say 1000 auto complete results to the autocomplete_suggestions endpoint?

Check this output:

metabase=> select distinct database_type from metabase_field;
 database_type 
---------------
 jsonb
 int4
 bool
 timestamp
 regclass
 text
 int2
 time
 geometry
 float8
 numeric
 date
 bpchar
 citext
 _float8
 bytea
 serial
 _int8
 oid
 _text
 int8
 timestamptz
 _bool
 name
 varchar
(25 rows)

These columns with a prefixed _ does not exist on postgres.

Also, I deleted inactive tables and columns and the server is running much smoother. Still, autocompleting with prefix=p still returns 2.8MB of raw data, way too much.

flamber · June 22, 2021, 7:04pm

@heits It's difficult, when you're mixing completely different issues into one topic.

That is what Metabase receives from the database, when it does a sync query. But if you check one of those prefixed _ columns and check the Postgres warehouse and then provide a DDL of that table, then perhaps we can figure out why you are seeing this, so we can create a fix for it.

I feel like I'm repeating myself. Yes, I know it's returning a lot of data in the autocomplete - like I already said, the query does not have any limit on it, so it will return everything starting with that prefix.
I will create an issue about that, but that's a minor problem compared to requests that takes 15 minutes.

heits · June 22, 2021, 7:11pm

Sorry didn't mean to mix anything, the _text error is related to the metabase_field table, which is the one slowing down the queries and returning many rows. I am not familiar with Metabase's implementation details so I though it was a good ideia do share it.

Maybe when metabase is under load and many users runs huge autocomplete_suggestions then we have queries taking 15min. The log is all the information I have, as I immediately kill the JVM and reboot it when it happens because the whole application is unresponsive.

I think limiting the results must be enough.

Please, let me know if there's any information I can provide you to help debugging.

For now, seems like deleting inactive tables and fields made it better. If 15min requests happen again I will come back here.

Thank you very much.

flamber · June 22, 2021, 7:23pm

@heits It would be great if you checked your ALB logs, since they should hold full URLs - as far as I know.

I think I can see the problem - but I don't know how many concurrent users you have.
You are running out of Jetty connections - the default max is 50 - but I would guess if you have 40 people actively using Metabase, then it just takes a couple of users to open new tabs to hit the limit.
Unless you are not using HTTP/2, which you definitely should:
https://www.metabase.com/learn/data-diet/analytics/metabase-at-scale.html

flamber · June 23, 2021, 2:29pm

@heits And there's now an issue open for limiting the results:
https://github.com/metabase/metabase/issues/16736 - upvote by clicking on the first post

heits · June 23, 2021, 8:27pm

@flamber thank you very much, just upvoted it!