Consistent quoting of text values in CSV output

tmaddison · October 29, 2020, 4:41am

Is it possible to change the delimiter Metabase uses for CSV output?

I have a data set that includes commas in the data. The Metabase CSV export uses commas as the delimiter, which means when I attempt to import the data into other systems it doesn’t work - the commas embedded in the data create issues.

Normally if I were doing this in Access or Excel I would just change the delimiter in the export - use | or ; or something not used in my data - but I can’t find any way to do that.

Am I missing it?

Thanks!

flamber · October 29, 2020, 9:30am

Hi @tmaddison
It is currently not possible to change delimiter:
https://github.com/metabase/metabase/issues/10256 - upvote by clicking on the first post
Metabase follows the standard, so if your data contains commas, then the value is enclosed in quotes. It’s the same way as most programs do, including Excel etc.

tmaddison · October 29, 2020, 1:02pm

I would have thought that as well, however that is not what is happening (as of yesterday evening, at least…)

Here’s a sample of a few records from the CSV output of my data file:

25045171,CASTILLO, MARIA,51899.11,18105.13,52667.02,4,520.00,247.91,castillo-maria,ELIGIBILITY TECHNICIAN III,70772.15,2018
25045170,LOZA, PATRICIA,51899.12,20568.04,52419.12,4,520.00,0.00,loza-patricia,ELIGIBILITY TECHNICIAN III,72987.16,2018
25045169,CAMARENA, OFELIA,51899.09,16215.03,53752.81,4,520.00,1333.72,camarena-ofelia,ELIGIBILITY TECHNICIAN III,69967.84,2018

I exported about 10 different files - all based on identical queries from the same source table (just using different selection criteria) and oddly enough if I look at the CSV files I see most of them did not enclose text values in quotes but SOME did.

I don’t know what triggers the quotes, I would think it would be the source field data type but that did not change - the exact same source table was used, all I did was select a different range of ID’s to create output files that were smaller and more manageable.

Hmmm… will retry the exports again and see if they now come out with quotes around the text, or see if there’s anything else strange I can see going on.

Thanks!

tmaddison · October 29, 2020, 1:09pm

Update. Re-tried the export process with a few different queries, no quotes enclosing text values in any of the exported CSV files.

flamber · October 29, 2020, 1:27pm

@tmaddison
Post “Diagnostic Info” from Admin > Troubleshooting.
Values will be quoted, when they contain “breaking” characters. That would be newline, comma, quote and probably a couple more.

tmaddison · October 29, 2020, 2:20pm

Thanks. Unfortunately I’m not an admin, just a user, so I’ll have to figure out who that is to get them to post the troubleshooting info.

The original data doesn’t have any of the “breaking” characters you describe - the original data is just “in a field in the database” in Metabase. I would not expect anyone would store newlines or quotes in their data normally, and the fact that commas are in that data is the problem, really…

I’m a newbie to Metabase but very familiar with Microsoft Access (I’ve designed import/export specs to deal with all kinds of legacy data file formats in the past) and I believe that when you set up an export to CSV in Access, it decides whether a field is text (and therefore needs quotes) based on the data field being defined as a “text” data type of some sort, not based on examination of the content.

If Metabase works differently that would seem a bit chancy rather than having a specific method to follow that adheres to the DB admin’s original design, wouldn’t it?

flamber · October 29, 2020, 2:30pm

@tmaddison I think anything that Access and Excel has been doing is very incorrect, but it’s hard for them to fix things, since they would then break legacy stuff.

CSV is called “Comma-Separated Values” for a reason - Microsoft didn’t like standards back in the early years, but have since come around big time - but they still cannot change the past without breaking legacy, so they won’t do that.
https://en.wikipedia.org/wiki/Comma-separated_values

tmaddison · October 29, 2020, 2:40pm

Sure, I can appreciate the peculiarities of Access and how they define their workings, but I’ve got to say that having an export process that “breaks” based on the content of the data in the field is inherently wrong.

Enforcing consistent data entry - particularly when dealing with data from other systems - is almost impossible. Certainly if you design the system from the ground up you can control that entry to some degree, but what if you’re dealing with old data that happens to have commas, quote marks, tildes, pipes, ampersands, or whatever as part of that data set?

If I define a field as a “text” (or perhaps “general”) data type, it should accept entry of ANY valid printable ASCII character. And any export process should then simply accept the fact that the data may contain characters that are “reserved” in some way, and put such data types in quotes.

I would view the fact that Metabase does not do this - and apparently therefore can’t create a proper CSV export from data tables that may contain those reserved characters in the data - as a flaw, not a benefit or advantage of some sort…

Sorry, just me…

Not sure what the heck I do now. I have 2.8 million records in a table, some of which have commas, some don’t. I need to export that data to CSV for use in an external system, but there appears to be no way to do it in away that conforms to a CSV format that will work for any normal import system.

Is there a way to write some SQL that would manage this export in a way that allows for a different dellimiter?

By the way, I’ve tried the .xlsx export as a way around this - figuring I’d go to xlsx and then let Excel convert to CSV (it handles export of text values in the expected way) but every time I try exporting to xlsx the Metabase system just gives me an Application Error and then won’t let me log back in for a while…

Dead in the water…?

flamber · October 29, 2020, 2:58pm

@tmaddison I cannot reproduce on latest 0.37.0, so if you can provide steps to reproduce, then that would be much appreciated.

I tend to use JSON, when I need a format that can be somewhat trusted - and then convert to other formats from there.

But Metabase will not export more than 1 million rows.

tmaddison · October 29, 2020, 3:18pm

Sure. Just as an aside before I start, I did find a work-around. If I limit the data set to somewhere around 100,000 records or less I can get it to export XLSX without erroring so I’ve been doing that, which is a bit of a slog given I have 2.8 million records to do…

I had not considered JSON, have never dealt with it and read that it was basically a comma-delimited format as well…

To reproduce is somewhat simple…

I have a data table with 16 fields. Some are numeric, some are text.

A clip of a few records (as exported by Metabase to CSV) is below. I chose a section of the file that had some records (the first four) that did not have commas in the employee names, and the second section (the last four rows) that does.

There have also been commas identified in other text fields as well. No difficulties have been identified in the numeric values.

This table has about 20 million records in it overall. I’d love to just download the whole thing and work on it locally but…

To get the data set I need now, I create a filter that selects a set of the data. The overall set I need is 2.8 million records (all records from a specific year - 2018.)

To keep the result smaller than 1 million rows, I did a summary/count by Jurisdiction ID and use that information to select a set of Jurisdiction ID’s that result in less than 1 million records output.

What I’ve found is that if I get anywhere close to 1 million (but still under) I get an “Application Error” that then locks me out for a while, so through trial and error I’ve managed to figure out if I keep the resulting data set to 250,000 records or less it will let me export without failing.

As an aside, in that troubleshooting I thought maybe it had something to do with total file size - not just row count - so I removed all but the four fields I need for this project - but that did nothing - same Application Error if I attempt to export more than about 250,000 records (to CSV.)

This means I need to re-run the same filter 11 times - resetting the begin and ending Jurisdiction ID’s each time to select a different portion of the data, with file outputs of less than 250,000 records each.

Once I have the proper result on screen, I just click the download button, choose CSV, wait for the filename chooser to popup, accept the default name, and let it download.

Thanks very much for looking at this!

ID,Name,Base,Benefits,Gross,Jurisdiction ID,Notes,Other,Overtime,Rank,Slug,Status,Title,Total,U Al,Year
28343760,Susan D Woodhouse,65017.17,27677.39,65017.17,3,0.00,0.00,susan-d-woodhouse,Administrative Svcs Asst III,92694.56,2018
28343761,Kathryn A Burroughs,74667.38,21113.80,74933.71,3,266.33,0.00,kathryn-a-burroughs,Administrative Svcs Asst II-Pr,96047.51,2018
28343762,Alexandra M.M. Madsen,83692.49,38368.78,83692.49,3,0.00,0.00,alexandra-mm-madsen,EHS Division Manager-Project,122061.27,2018
28343763,Caylin R Patterson,46518.14,20517.55,52839.87,3,6321.73,0.00,caylin-r-patterson,EHS Division Manager-Project,73357.42,2018
25044665,QUINONES, EDGAR,79906.28,35518.69,140000.52,4,10471.97,49622.27,quinones-edgar,SHERIFF’S SERGEANT,175519.21,2018
25043809,ESCOBAR, BECKY,10385.86,9123.61,10385.86,4,0.00,0.00,escobar-becky,OFFICE TECHNICIAN,19509.47,2018
25043810,MENDOZA, ANGELICA,44553.07,16548.74,45073.07,4,520.00,0.00,mendoza-angelica,OFFICE SUPERVISOR I,61621.81,2018
25043811,OTERO, RAECHEL,46044.18,21326.64,46044.18,4,0.00,0.00,otero-raechel,OFFICE SUPERVISOR I,67370.82,2018

flamber · October 29, 2020, 3:40pm

@tmaddison

Since Excel doesn’t support more than 1 million records, then it doesn’t make much sense if Metabase would export more.

Whenever you get someone to sneak out “Diagnostic Info”, then please post. There has been a bunch of fixes to exports (both CSV and XLSX) in 0.36.8.2.
But with the information you’re mentioning, it sounds like you’re querying BigQuery, which has a known limitation of 10MB per “page”, but this was addressed as well, so Metabase is now paginating.

Also, it would be extremely useful if the database admin could make a SQL dump, which you of course would replace real names/values with something different. Because the SQL dump also contains the actual database column types, which could be the trigger in all of this.

Let me try to see if I can use the sample you’ve supplied to create something that can trigger the problem.

flamber · October 30, 2020, 12:24pm

@tmaddison I have tried multiple things, but it correctly adds quotes around values that contains comma - like the standard expects.

tmaddison · October 30, 2020, 1:05pm

Thanks for the help. I’ve heard back from our admin, no copy of the log file yet but he did tell me that in this db the three fields I’m selecting that normally contain text - “name”, “slug”, and “title” are set up as text, varchar(50), and text data type respectively.

I’ve also run some further tests on the export with oddly inconsistent results. Or, at least I’m not able to tell what the consistency is…

Each record in our data is related to a “jurisdiction ID”.

I can’t download the entire data set because it is 2.8 million records. To create smaller downloadable data sets I’ve been filtering by Jurisdiction ID to create smaller subsets for download.

Some of the data contains names without commas in them - format “firstname MI lastname”. Some of the data contains names with commas - format “lastname, firstname”

The sample “problem” jurisdiction is identified in our records as jurisdiction ID 4 (there are many more, that’s just what I’m using as a sample…)

There is a sample “not problem” jurisdiction as well, ID 3.

My original download was to pull data on a large set - ID’s 0 through 100, which resulted in about 400K records.

That worked, however the text fields were not output using quotes. This meant records without the comma had a different number of fields than the records with the comma.

A sample of this file showing a few records from Jurisdiction ID 3 (no commas) and 4 (commas).

ID,Name,Base,Benefits,Gross,Jurisdiction ID,Notes,Other,Overtime,Rank,Slug,Status,Title,Total,U Al,Year
28343762,Alexandra M.M. Madsen,83692.49,38368.78,83692.49,3,0.00,0.00,alexandra-mm-madsen,EHS Division Manager-Project,122061.27,2018
28343763,Caylin R Patterson,46518.14,20517.55,52839.87,3,6321.73,0.00,caylin-r-patterson,EHS Division Manager-Project,73357.42,2018
25044665,QUINONES, EDGAR,79906.28,35518.69,140000.52,4,10471.97,49622.27,quinones-edgar,SHERIFF’S SERGEANT,175519.21,2018
25043809,ESCOBAR, BECKY,10385.86,9123.61,10385.86,4,0.00,0.00,escobar-becky,OFFICE TECHNICIAN,19509.47,2018

On further testing, if I download Jurisdiction 4 on it’s own, the CSV does DOES use quotes.

id,name,base,benefits,gross,jurisdiction_id,other,overtime,status,title,total,ual,year
25045434,“ESPINO, ORLANDO GUADALUPE”,64776.85,26805.93,78088.73,4,4311.24,9000.64,DA INVESTIGATOR II,104894.66,2018
25045433,“CHAVARIN, JUAN”,66849.77,34153.39,70153.22,4,3303.45,0.00,CORRECTIONAL LIEUTENANT,104306.61,2018

Strangely enough, the header row is different between the two as well.

If I download a sample that includes JUST 3 and 4, I get this:

id,name,base,benefits,gross,jurisdiction_id,other,overtime,status,title,total,ual,year
28332001,Kelly L Dunham,28086.72,0.00,0.00,3,-28086.72,0.00,Animal Svcs Officer,0.00,2018
28332000,Tim J Dumandan Iii,67222.10,42208.41,76183.72,3,8961.62,0.00,Animal Svcs Officer,118392.13,2018
25045434,“ESPINO, ORLANDO GUADALUPE”,64776.85,26805.93,78088.73,4,4311.24,9000.64,DA INVESTIGATOR II,104894.66,2018
25045433,“CHAVARIN, JUAN”,66849.77,34153.39,70153.22,4,3303.45,0.00,CORRECTIONAL LIEUTENANT,104306.61,2018

Which has a MIX of quote usage and non-usage, in the same fields.

Seems to me like the Metabase system is using some kind of evaluation of the content of the field to determine the need for quotes - not simply based on field type - and somehow is “guessing wrong” when the data set is larger for some reason.

Meanwhile, it looks like I’m going to be able to pull this data from a backup that our Admin has so this is not such a priority for me any more - if it’s not something you’ve had come up as a problem for anyone else I don’t know if you need to continue working on it, but it would be interesting to know why the inconsistency.

Also fyi our admin told me he thinks the reason I can’t export a full 1M records without “Application Error” is because he thinks the Heroku system has a 30 second timeout and on the size of data set I’m trying to pull it takes longer than that to finish the query. I don’t know if that timeout is adjustable, but there are users working with larger data sets it might be good to do something there if possible.

Thanks again!

Todd

flamber · October 30, 2020, 1:26pm

@tmaddison

There really isn’t much that can be done about Heroku timeouts - besides not using Heroku:
https://github.com/metabase/metabase/issues/11463

There’s a known issue with XLSX downloads, when there’s more than 1 million records:
https://github.com/metabase/metabase/issues/13585

I would recommend that you get your admin to post here with “Diagnostic Info” and possibly logs when exports fails.

Again, I cannot reproduce on 0.37.0.2 - even when downloading from a table with 1.3 million rows.

Like I said originally - yes, Metabase does a value evaluation and adds quotes around values with comma/etc. That is not an inconsistency - that follows the standard.
If you want to create a feature request for having it always apply quotes to text-values, then please do:
https://github.com/metabase/metabase/issues/new/choose

tmaddison · October 30, 2020, 1:41pm

Thanks. Greatly appreciate your help.

I’ll submit the feature request, seems like that consistency would be good to be able to enforce in situations that require it, along with the ability to choose other delimiters (which would have made this moot…)

Understand the Excel limit, I think my issue is more about the timeout, the system can’t produce an Excel file of more than about 100,000 records or so within 30 seconds.

Thanks again!

tmaddison · October 30, 2020, 1:49pm

Just fyi, when I went to the GitHub to submit the feature request I found this. One user reports the same issues with timeouts as well.

flamber · October 30, 2020, 1:55pm

@tmaddison Please make sure you are using Metabase 0.37.0.2 and not running on Heroku, then we can continue debugging. Like you can see, I have closed the issue, since I could not reproduce.
There has been so many changes since that issue was created.

tmaddison · October 30, 2020, 1:59pm

Thanks! Will do. If I run into further issues I’ll let you know. I’ve submitted the feature request on GitHub, don’t know if I’m the only one in the world who wants it, though!

flamber · October 30, 2020, 2:00pm

For reference:
https://github.com/metabase/metabase/issues/13640 - upvote by clicking on the first post

tmaddison · November 3, 2020, 5:40pm

Update - our admin updated Metabase to the latest version (apparently it was not…) and that seems to have fixed the quote issue when exporting text fields to CSV.

Thanks!

Todd