Disable scanning or change the setting of the LIMIT value used in scanning

masakazuwatanabe · May 27, 2021, 10:50am

Hello.
I have a request because I had a problem when using Athena.

Scan table schema
Scan field values (although this already has a disable setting)
And fingerprinting

I want to disable autorun.
Even if that is not possible, is there a way to change the LIMIT value to a small one such as 100?

In the case of Athena, you will be charged for the amount of data read during the search.
Apart from that, you will also be charged for the number of files that Athena has read from S3. (GET requests and other)

This is a problem with my database,
It had a huge database structure of 1 file and 1 record.
For that reason,

Scan table schema (LIMIT 10000?)
Scan field values (LIMIT 5000?)
And fingerprinting (LIMIT 10000?)

Scans are executed repeatedly without noticing
Athena scans a lot of S3 files, increasing S3 request charges and causing problems such as billing of tens of thousands of dollars, which was usually hundreds of dollars.

I know it's a problem with my database structure and Athena's specs, not with Metabase. However, even if it is not Athena, I feel that changing the LIMIT setting etc. is useful if it is a huge database.

flamber · May 27, 2021, 10:58am

Hi @masakazuwatanabe
Post "Diagnostic Info" from Admin > Troubleshooting, and which version of the Athena driver you're using.

You can disable field scan:
https://www.metabase.com/docs/latest/administration-guide/01-managing-databases.html#database-sync-and-analysis

But sync cannot be disabled. Fingerprinting is done during the first sync, so Metabase has a better understanding of the field and how it can be used (binning, etc).

masakazuwatanabe · May 27, 2021, 11:50am

It's my environment. (Running Metabase on Docker.)

{
  "browser-info": {
    "language": "ja",
    "platform": "Linux x86_64",
    "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.11+9",
    "java.vendor": "AdoptOpenJDK",
    "java.vendor.url": "https://adoptopenjdk.net/",
    "java.version": "11.0.11",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.11+9",
    "os.name": "Linux",
    "os.version": "4.18.0-240.22.1.el8_3.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "athena"
    ],
    "hosting-env": "unknown",
    "application-database": "h2",
    "application-database-details": {
      "database": {
        "name": "H2",
        "version": "1.4.197 (2018-03-18)"
      },
      "jdbc-driver": {
        "name": "H2 JDBC Driver",
        "version": "1.4.197 (2018-03-18)"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2021-04-27",
      "tag": "v0.39.1",
      "branch": "release-x.39.x",
      "hash": "6beba48"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

Athena Driver Version: v1.2.0

masakazuwatanabe · May 27, 2021, 11:58am

There were 50 million rows in my Athena database.
Therefore, even if it is "LIMIT 10000", only some information can be obtained.

However, in terms of Athena's billing, "LIMIT 10000" can incur a very large amount of money.

In such a case, LIMIT 10000 etc. is not necessary, and I felt that 100 mag would be sufficient.
I think it is case by case, but I thought that if I could set the value of "LIMIT", I could handle various cases, so I posted it.

flamber · May 27, 2021, 12:23pm

@masakazuwatanabe You would have to create your own build of Metabase. Those values are hardcoded based on estimate of what would work for 95%+ installations. Metabase tries to "Just Work" without having a million configuration toggles.

Fingerprinting only 100 values might be enough for your data if it's mostly the same. Each database works differently.

You can probably create a custom driver, where you handle the fingerprinting differently, which means you don't have to create a custom build of Metabase.
Have a look at the MongoDB driver, which has a different sync/scan/fp strategy:
https://github.com/metabase/metabase/blob/master/modules/drivers/mongo/src/metabase/driver/mongo.clj#L85