How does Metabase create autocomplete suggestions?

Hello everyone

I have a field filter in a column which has more than 100 distinct values, so Metabase created a search box for it. I have the following questions:

  1. How does Metabase fetch values for autocomplete suggestions? Does it run a query against the database every time the input in the search box changes?
  2. What type of query does Metabase generate? (is it an exact match query or LIKE ‘%variable%’ query?) Is it possible for the user to change the type of query?
  3. Are there any suggestions or best practices on tuning the database for such kind of filters? (I think it’s good to have an index which is good for text search on the filtered column)

Hi @MusicMaker1337

  1. Metabase analysis your database, when it’s added as a data source. This is part of the sync+scan process and can be configured:
    https://www.metabase.com/docs/latest/administration-guide/01-managing-databases.html#database-sync-and-analysis
  2. Currently Metabase does a LIKE 'search%' and it’s not configurable. There’s many problems with doing a LIKE '%search%', since it would break indexes, which could be very heavy on larger data sets.
  3. The query is done against the Metabase application database - not data source - so the query are kept in a way that should not be breaking the indexes.

This is not the same as a regular filter, which allows you to search the data source and specify how it searches.

Hi @flamber
Thank you so much for the reply

I didn’t quite understand the part about caching the values of the fields. Is it correct to say that Metabase performs a SELECT DISTINCT on the fields and puts the result into its own database? Is it also correct to say that if I disable caching the values the field filters will not work?

@MusicMaker1337 You’re welcome :wink:

Yes to the SELECT DISTINCT (with a limit of 5k values if I remember correctly).
And kinda yes to the second, disabling scan (and clearing any values) will not disable the Field Filter, but you won’t have any autocomplete or dropdown, just a text input.

When Metabase does analysis of your data, it’s called sync+scan, instead of caching, since Metabase also has a caching function used for the results (not related to your other questions):
https://www.metabase.com/docs/latest/administration-guide/14-caching.html

I’m trying to answer your questions without confusing too much, so I might have over simplified some of it. Haven’t seen a question like this before I think.

@flamber thanks, now it’s clear for me :slightly_smiling_face:

@flamber since you mentioned the 5k limit, I have another question: what happens if a column has more than 5k distinct values? Will some values be “lost” and not shown as suggestions in the search box?

@MusicMaker1337 I’m not sure. I’m going to have to see if @sbelak could answer that:
When doing scan, do we only save the first 5k distinct values for autocomplete or do we do some pagination of all the data (or is that only a thing that was being discussed at some point, and I’m getting confused about reality)?

@flamber @sbelak
The 5k limit is mentioned in this file. As far as I understood, Metabase first runs the query against its internal database and then limits the result to 5k in order not to explode. The question is whether the internal database is initially populated by all the distinct values. Please correct me if I’m wrong.

Let me also just clarify the reason why I’m asking all these questions, since it may help to solve my use-case in some other way. I have a column with 8k distinct values and I want to be able to filter by it. A simple text input widget would be ok for me (even without autocomplete suggestions), but the problem is that I want this filter to be multi-select, and at this moment only 1 value can be passed to the text input widget.

@MusicMaker1337 You can give multiple inputs (even with “Plain text input”) by using comma. But it’s not as intuitive as setting the “Filtering on this field” to Search (Admin > Data Model > db > table > column :gear:)
The Field Type needs to be one on this list:
https://www.metabase.com/docs/latest/users-guide/13-sql-parameters.html#creating-sql-question-filters-using-field-filter-variables

@flamber thank you for the tip! You’re right, that’s more intuitive to use the search box, but only if the search box doesn’t drop all the suggestions that exceed the 5k limit. If the search box drops all the values above the 5k limit, there may be a situation when a user types in some values that are present in the data but not suggested - so the user thinks these values are not present (and that’s counter-intuitive).

I think Metabase could pull in all the distinct values of category fields with high cardinality during the sync, but only show some of them in the suggestions (when the user starts to type something, the number of returned elements will not be so big anyway)

@flamber I have yet another question: does the 5k limit apply only to ‘category’ field type? If I choose for example ‘entity key’ field type, will the 5k limit still be applied?

@MusicMaker1337 I don’t know. I would guess it’s the same. Let’s see what Simon says, when he has time to look at this.

@sbelak @flamber Any chance you’ll have a look at this?

It’s the same regardless of the type. We have a few limits like this in place to not accidentally blow up people’s DBs. The downside of having users with a few 100s of rows to a few trillion rows is that it’s hard to find sane limits that work for all. Some of these could be made more dynamic (use heuristics instead of hardcoded values), so it’s worth opening an issue. At least that way we have an overview of problems people are hitting.

There’s a feature request for allowing override of (most/all?) hardcoded constants:
https://github.com/metabase/metabase/issues/4564 - upvote by clicking :+1: on the first post

A possible solution could be instead of querying the Metabase internal application database for suggestions to query the data source itself using a query somewhat like SELECT column FROM table WHERE column LIKE searchString% LIMIT 100

The workflow could be as follows:

  1. The user starts typing something in the search box
  2. When the search box contents don’t change for some time (e.g. 1 second) Metabase runs a query against the data source using the syntax above
  3. The suggestions box is populated with 100 results. Everyone happy, database not blown up, search box can draw suggestions from a table with a million rows (if the search column is properly indexed).

@MusicMaker1337 I’m pretty sure that would be a no, since some have thousands of users and that would put even higher cost on them if they’re using BigQuery or such pay-as-you-go.

Quite. And if the column is not indexed, that’s a full table scan which can get quite expensive.

@flamber

" Currently Metabase does a LIKE 'search%' and it’s not configurable. There’s many problems with doing a LIKE '%search%' , since it would break indexes, which could be very heavy on larger data sets."

I really need this feature what is the difference if user using

select * from tableA where col1 like "%search_input%"

where col1 is not an indexed column this would be the same risk anyway. So why not let us have an option to use it.

@eric.halim You can do exactly what you’re asking - with regular Text filter - depending on which database you’re querying with, something like one of these:

'%'||{{search_input}}||'%'
or
CONCAT('%',{{search_input}},'%')

There’s a request for allowing “contains” vs “begins with”:
https://github.com/metabase/metabase/issues/8219 - upvote by clicking :+1: on the first post