Data is synched while it shouldn't

Hi,

This might be me misunderstanding how data syncs/fingerprints, but it seems to sync data more often than we have set up.

We have a database on Snowflake and an X amount of questions on a dashboard. One of the questions on the dashboard is "Games played today" which shows exactly what is in the title, the names and frequency with the timestamp of today.

We have set up metabase to sync the data daily, specifically at 5 AM and due to this, we actually do not expect any data to show up, because our customers do not play before this time (mainly schools and child daycares), so essentially we are a day behind on data.

However, during the day, we can see the table being populated by data of the current day, while syncing happens in the morning before anyone has played.

I don't see anything off in the logs either, so we're kinda scratching our heads here, trying to figure out where that data comes from (or rather, how it gets into metabase, because we do send it periodically to snowflake)

I did see there are/were existing (related) issues from other users that recently got fixed, but something is still happening and we're not sure what.

We're on the cloud version, here is our diagnostic info:

{
  "browser-info": {
    "language": "en-US",
    "platform": "Win32",
    "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.20+8",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.20",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.20+8",
    "os.name": "Linux",
    "os.version": "5.10.184-175.731.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "snowflake",
      "mysql"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "14.7"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.5.1"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2023-08-15",
      "tag": "v1.46.8",
      "branch": "release-x.46.x",
      "hash": "27a9ec6"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

And the database settings (in Dutch)

If people are viewing the questions, then data will be pulled from snowflake.
The syncs aren't loading all the data (with the exception of cached models).
The 5am sync is the databse schema.
The other one is just for the filter values.

Thanks for the reply, than that makes sense.

I was under the impression that data would not be pulled, as there is also an explicit button to do that at the bottom of the screen, Download all/full results.

Is there a way to turn off the data pull when a question is viewed, so I can manually download something whenever I feel the need?

Also, is it only being pulled if I actually enter the question? So if I go into a dashboard with multiple questions, does it pull data for all the questions? Or only if I explicitly go into a specific question?

So for example, if I have 20 questions that work on the same database, would it pull that data 20 times when I enter the question? The specific dataset we want to check has all the information we need for the 20 questions, so we would only need to update once for all the questions.

Yes, it queries for all questions, by design.
If you create a model and design all your questions using that model, you can cache the model data and set an update interval of 24 hours.
Dowside of that is the performance can be a bit slow with very large datasets as Metabase doesn't do anything clever with indexes on the model's cache.

Assuming, it's the egress charges that are a concern, you could do an ETL job to a more local Postgres or MySQL database.

Thanks a lot for the clarification. We just moved from a local database to snowflake, so I will have to look into what would make sense for us. (As moving to snowflake was for some other reasons as well).

Plenty to think about, but thanks again for the info!

I've not worked with Snowflake, but from what I can gather, it's great until you receive the bill! The hidden cost of cloud databases is always the cost of getting the data out.

Depending upon your company preferences, both MS SQL and Postgres are general purpose databases that also have columnstore indexes that will make your queries fly along. MS SQL also has in-memory but that gets expensive due to the server requirements.
All has to be weighed against the admin savings of cloud.

We knew there would be costs before, but was hard to estimate how much for sure, while we already lowered the costs, we want to reduce it as much as possible of course.

There were some logistical choices with a cloud based solution, with one of them "forcing" us to use snowflake, so we decided to move all our data there so we would have one point of entry for various reasons.

Will definitely look at other options of costs outweigh the use, so thanks for the advice, it is appreciated!