Nested histograms return only a partial

Kenneth · January 13, 2022, 7:30am

Platform: Bigquery

Overview:
Previously, to resolve issues with nested aggregations not properly displaying in histogram format, We worked around this with a temporary question that we then aggregated on as per:

Attempting this again recently resulted in a histogram with only three bars, despite having data outside the presented range:

Note that values are definitely present above 37.5

Running the generated SQL output in bigquery, we observed duplicated groupby columns with single counts, rather than the N columns we originally expected:

Currently, we've worked around this by preaggregating the data, but would like to know if this has already been observed. A quick check shows other binning issues, but none exactly like this.

Diagnostics:

{
  "browser-info": {
    "language": "en-GB",
    "platform": "Win32",
    "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.13+8",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.13",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.13+8",
    "os.name": "Linux",
    "os.version": "4.14.219-164.354.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "bigquery-cloud-sdk",
      "bigquery"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "11.9"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.2.23"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2021-12-16",
      "tag": "v1.41.5",
      "branch": "release-x.41.x",
      "hash": "fbfffc6"
    },
    "settings": {
      "report-timezone": null
    }
  }
}```

flamber · January 13, 2022, 9:26am

Hi @Kenneth
I don't think I understand what the problem is, since I don't understand your data.
If you showed a full screenshot of your histogram, there would likely be a black triangle in upper-right corner, which would tell you that your results are limited to 10,000 rows.

Kenneth · January 14, 2022, 4:12am

Hi @flamber

I'll go a bit further into the details then, and if there's any part that you feel was difficult to grok, just point it out and I'll elaborate.

Essentially, each user has a certain score (float) and user_id (str), and we wish to compute the average of each user, then and then bin saidaverage in a histogram. This should be done via a sequential aggregation in the GUI as follows:

As per the previous issue, this doesn't work and the issue is still open. We thus applied the workaround instead, creating an interim table in which average scores for each user were computed.

We then proceeded to count users, grouping by the score value, which should then provide a histogram of the values across the 1-100 range. However, we see the below histogram instead, alongside the warning you mentioned

To determine the root cause of this, I executed at the generated SQL query on bigquery (some formatting done for readability)

SELECT ((floor(((`source`.`avg` - 0.0) / 12.5)) * 12.5) + 0.0) AS `avg`, 
    count(distinct `source`.`user_id`) AS `count`
FROM (
    SELECT `dwh_aktivo_mart.dailyagg_wide`.`user_id` AS `user_id`, avg(`dwh_aktivo_mart.dailyagg_wide`.`score`) AS `avg` FROM `dwh_aktivo_mart.dailyagg_wide`
    GROUP BY `user_id`
    ORDER BY `user_id` ASC
) `source`
 GROUP BY `source`.`avg` ORDER BY `source`.`avg` ASC

Here I noticed that something seemed to be going wrong with the binning component of the algorithm, as it was creating multiple entries for what should be a single group

Furthermore, because the number of such entries was extremely large, it resulted in the warning above.

This seems to stem from a mistargeted group variable. Where it should instead be grouping by the floored average (binning variable), it is instead grouping by the the average value in the user_average table


SELECT ((floor(((`source`.`avg` - 0.0) / 12.5)) * 12.5) + 0.0) AS bin,
    count(distinct `source`.`user_id`) AS `count`
FROM (
    SELECT `dwh_aktivo_mart.dailyagg_wide`.`user_id` AS `user_id`, avg(`dwh_aktivo_mart.dailyagg_wide`.`score`) AS `avg` FROM `dwh_aktivo_mart.dailyagg_wide`
    GROUP BY `user_id`
    ORDER BY `user_id` ASC
) `source`
 GROUP BY ((floor(((`source`.`avg` - 0.0) / 12.5)) * 12.5) + 0.0)

Changing the grouping variable then produces the expected result

Please feel free to request for more information if required.

flamber · January 14, 2022, 8:44am

@Kenneth Some fingerprinting information, which is used for binning and bucketing dates, is currently lost in nested queries. There are several issues open about that. You could probably get around that by creating a View on your database as the interim table.

Kenneth · January 14, 2022, 8:51am

Yeah, we subsequently worked around that with a temporary view at the database level, but I'd posted this here in case the bug was different from the existing ones. I hadn't seen one that dug into the SQL itself yet.