Limit for storing field values in `metabase_fieldvalues`

King_Edward · August 17, 2020, 11:10pm

I have a table with 4 million records. One of the columns has 3700 distinct values (text). The field values for this column don't get cached when running "re-scan field values". As a result, when I want to filter that column, it takes too long to search for a value.

cache

I looked in the table metabase_fieldvalues and I don't see a row for that specific column as I expected. I concluded there is a limit on the number of distinct values of a column which beyond that the values are not saved in the metabase_fieldvalues.

Is this correct? If so, what is the limit, and can I change it? If not, what can I do?

AndrewMBaines · August 18, 2020, 7:44am

What setting do you have in your data model:

flamber · August 18, 2020, 9:38am

Hi @King_Edward
When there’s more than 300 distinct values, then no list are shown, but it’s not quite an autocomplete either:
https://github.com/metabase/metabase/issues/7603 - upvote by clicking on the first post

King_Edward · August 18, 2020, 4:02pm

@flamber I was aware of the 300 limit. But my question is different. The 300 limit is for when you click the filter button on that column and you see all the values. My question was: After clicking the filter button on that column, I type the first few characters (‘bank’ in the above example), and I expect to see a few values that start with ‘bank’. But this takes a long time for my table. If the distinct values were cached (in the metabase_fieldvalues table I assume) this should have taken a fraction of a second.

Are you saying if we have more than 300 distinct values for a column, the table metabase_fieldvalues does not get updated with that data?

@AndrewMBaines Search box.

flamber · August 18, 2020, 4:17pm

@King_Edward
When there’s more than 300 values, then those are not stored in metabase_fieldvalues, since there could be 50k distinct values, and the way it’s stored in the database probably wouldn’t be efficient.

The search does a lookup in your database, not the Metabase application database, so that’s probably the reason for the latency.
This is also the reason for the autocomplete not using “contains”, but “begins with”, since that could lockup large tables.

Not sure if we could handle this differently, but let me ping @sbelak

sbelak · August 18, 2020, 4:57pm

How we currently handle this is not ideal, but the gist is that it’s a balancing act. Getting all the values is a reasonably expensive query (full scan if no index) which limits us to how often we do it … which is at odds with freshness that you very much need for decent UX. We have some ideas around incremental updates and probabilistic updates (based on inferred change frequency), but no timeline yet.

King_Edward · August 18, 2020, 5:45pm

@sbelak I agree that getting all the values can be expensive. one solution could be that the values are not stored by automatic scan but are stored if initiated from data model (a scan initiated manually for a certain column).

Without cached values, every time the users want to filter the data, the whole table is scanned which is way more expensive.

King_Edward · August 18, 2020, 7:10pm

The workaround for now is to create pattern indexes for these columns (i.e. columns that have more than 300 distinct values in huge tables):

CREATE INDEX index_name ON table_name (LOWER(column_name) varchar_pattern_ops)

Depending on the column data type, use
varchar_pattern_ops or
bpchar_pattern_ops or
text_pattern_ops

Note: The above index syntax is for Postgres. Other SQL variants should have sth similar.