MySql crash around time we started running metabase queires

markd · March 25, 2021, 9:54pm

Hello!

We've had great feedback from our customers on our beta product which incorporates metabase. However, there was a mySql crash on 3/20, right around the time we started using metabase (the day after, actually).

Here are the relevant versions:

Metabase:

You're on version v0.33.2
Built on 2019-09-05

MySql:

version innodb_version 5.7.25

The mysql log:

https://sicks3bucket.s3.amazonaws.com/sick_an_mysql.err

Here's the metabase log:

https://sicks3bucket.s3.amazonaws.com/metabase_log.txt

Only two queries were run:

Query 1:

select date, system_name, device_name, valid_read_rate from 
(SELECT  left(from_unixtime(c.capture_time/1000),10) as date,  a.name as 'system_name', b.name as 'device_name', c.read_count, c.valid_object_count, 
if(valid_object_count = 0,'00.00%',concat(format(read_count/valid_object_count*100,2),'%')) as valid_read_rate, read_count/valid_object_count*100 as valid_read_rate_unformatted
FROM  sick_il_services.system_config a
join sick_il_services.system_device b
on a.id = b.system_id
join sick_il_aap.daily_device_read_rate c
on b.id = c.device_id
where 1=1
[[AND c.capture_time >= unix_timestamp(concat({{start_date}}, ' 00:00:00.000')) * 1000 ]]
[[AND c.capture_time <= unix_timestamp(concat({{end_date}}, ' 23:59:59.999')) * 1000]]
[[AND a.name = '{{system_name}}']]
group by c.capture_time,a.name, b.name
order by c.capture_time, a.name, b.name) as sub
where sub.valid_read_rate_unformatted < {{expected_read_rate_percent}}

Query 2:

SELECT from_unixtime(capture_time/1000), group_concat(distinct `daily_system_read_rate`.`system_id`) AS `System Ids`,
`daily_system_read_rate`.`condition_level` AS `condition_level`, 
`daily_system_read_rate`.`condition_name` AS `condition_name`, 
sum(`daily_system_read_rate`.`read_count`) AS `read_count`, 
sum(`daily_system_read_rate`.`valid_object_count`) AS `valid_object_count`, 
((sum(`daily_system_read_rate`.`read_count`) * 100) / CASE WHEN sum(`daily_system_read_rate`.`valid_object_count`) = 0 THEN NULL ELSE sum(`daily_system_read_rate`.`valid_object_count`) END) AS `read_rate`
FROM `daily_system_read_rate`
Where condition_id != -1
AND `daily_system_read_rate`.`capture_time` >= unix_timestamp(concat({{start_date}}, ' 00:00:00.000')) * 1000
[[AND `daily_system_read_rate`.`capture_time` <= unix_timestamp(concat({{end_date}}, ' 23:59:59.999')) * 1000]]
[[AND {{system_id}}]]
[[AND {{condition_name}}]]
GROUP BY capture_time, `daily_system_read_rate`.`condition_level`, `daily_system_read_rate`.`condition_name`
ORDER BY capture_time, `daily_system_read_rate`.`condition_level` ASC, `daily_system_read_rate`.`condition_name` ASC

The error message from the MySql log where it crashed is as follows:

2021-03-20T11:50:22.345980-07:00 5196223 [Note] Aborted connection 5196223 to db: 'sick_il_services' user: 'sick_ap_reporting' host: 'localhost' (Got timeout reading communication packets)
22:45:44 UTC - mysqld got exception 0xc00000fd ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=49152
read_buffer_size=2048000
max_used_connections=75
max_threads=200
thread_count=40
connection_count=40
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1221685 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...

Nothing after that until the restart on 3/25...

The error code matches exactly matches this SO post, which resolved the error by removing the sub-query.

(I can't post the link, but here's the message)

Im trying to get the 10 best matches of a fulltext search using fulltext score.

Following query works perfectly on my 10M lines MyISAM table :

SELECT 

    index,
    MATCH(villes) AGAINST('montreuil') as score
FROM 
    (

        SELECT index, villes
        FROM table
        WHERE MATCH(villes) AGAINST('montreuil' IN BOOLEAN MODE) 

    ) t

However, when I get an error when trying to retrieve to order scores :

SELECT index 

FROM (

       *** PREVIOUS QUERY ***

) t2

ORDER BY t2.score

Error is : "Lost connection to mysql server" in mysql workbench after 4 seconds.

So far I've tried:

using phpmyadmin => same error
removing the "order by" => same error
extending the mysql buffer memory => same error
limiting the number of the subquery results to 10 (using "limit 10") => weird error "cannot find a fulltext index"

The mysql log is unclear to me :

10:10:58 UTC - mysqld got exception 0xc00000fd ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=67108864
read_buffer_size=8388608
max_used_connections=3
max_threads=151
thread_count=2
connection_count=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2541349 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x21ee0a9b580

Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
2019-04-16T10:11:12.359438Z 0 [Warning] 'NO_AUTO_CREATE_USER' sql mode was not set.

EDIT

SHOW CREATE TABLE :

'CREATE TABLE `table` (
  `index` int(11) NOT NULL,
  `villes` text,
  PRIMARY KEY (`index`),
  FULLTEXT KEY `villes` (`villes`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8'

EXPLAIN :

type: fulltext
possible keys: villes
key: villes
ref: const
rows: 1
filtered: 100.00
extra: using where, using filesort

And here's the fix:

I couldnt find a direct solution to this problem. I read some posts explaining that sorting a subquery using fulltext might cause some memory issues.

However a workaround was to create a temporary table with the result of the subquery and sort it which works perfectly.

Is this simply a matter of restructuring the query to remove the subquery? I put in in for ease of use, but if it's going to crash mysql, obviously it's useless. Is this a metabase problem or a mysql/subquery problem?

Thanks!

Luiggi · March 25, 2021, 10:04pm

Hi! try with the following:

update Metabase to the latest version (the one you have is really old)
try using MySQL 8 or at least the most recent version of the 5.7 branch to wipe all possible issues

regards,

markd · March 25, 2021, 10:13pm

Hi Luiggi,

We may do this, but it would be a fairly substantial effort.

For example, do you know if there have been issues with past versions of metabase and sub-queries by any chance?

Do the log crashes I posted remind you of anything you've run across?

flamber · March 25, 2021, 10:24pm

Hi @markd

MyISAM is a pain, when it comes to optimization. I'm currently running a real-time system with millions of records and keeping the system fast and optimized is a pain. And the table-locking drives me insane.

You're likely seeing issues with an overloaded MySQL, when Metabase does it's sync/scan.
There has been so many changes in Metabase about the sync/scan process since 0.33.2, so upgrading would likely help.

Also, limiting the amount of resources (connections/CPU/etc) that Metabase can consume from MySQL would probably be advisable.

Change the scan, so it doesn't run every night, but perhaps just weekly (or not at all, if you don't need it)
https://www.metabase.com/docs/latest/administration-guide/01-managing-databases.html#database-sync-and-analysis

I'm not even going to start recommending MariaDB, not-Windows, not-MyISAM, etc, since that's a huge project by itself.

markd · March 25, 2021, 10:45pm

@flamber - thanks for the excellent feedback.

I'm assuming the re-scanning process is meant to optimize metabase reports. So, if no or limited scanning is done, does that mean the sync-scan would be done only when you kicked off the query? Or just that the query would run slowly? Would a re-scan run on a limited basis actually increase the risk of a system crash, since it's playing catchup?

Also, if the cause is a re-scan as you suspect, would restructuring the query to not include a subquery help? Or is that just a red-herring, and following the other suggestions (reduce connections, reduce scan frequency) be the more likely fix?

Sorry for all the questions - I'm just trying to figure out the next steps.

We also may consider an upgrade to metabase. Better now than after the release.

flamber · March 26, 2021, 9:44am

@markd
With reference to the documentation I linked to, Metabase does 3 things: Sync, Fingerprint and Scan.

Sync is required, so Metabase knows which schemas/tables/columns are in your database - otherwise Metabase would be not able to present that in the UI.
This is a light-weight process and is done every hour by default, but can be changed to daily.
Fingerprint is done to categories the column and which functionality it has. This can be a heavy process and cannot be disabled, but fingerprinting is rarely redone (unless manually re-fingerprint is enabled).
Fingerprinting is done on first sync too, which might be part of the reason for your problems.
Scan is optional. It is done to include values in dropdown lists for easier usage. This can be disabled, but that would mean there would not be any dropdown lists.

All of this has nothing to do with your queries in questions - it's all background processes.

But please just upgrade. The one you're using has multiple known security flaws, which has been fixed in later versions. (same goes for the MySQL version)

No matter what, a client (in this case Metabase) should not be able to crash a server (MySQL), but that's a bug in the server (or it's configuration).

markd · March 26, 2021, 9:52pm

@flamber Thanks again for the illuminating feedback.

Another question (of course). Is it possible to create a temporary table in a Metabase query? Essentially, stringing two queries together - one to create the table, the next to read it. I'm trying to factor out the sub-query and that seems to be the only way out of it, at least from my limited knowledge.

god830 · March 26, 2021, 9:56pm

I don't think you can create tables in Metabase. @flamber let me know if I'm wrong.

flamber · March 26, 2021, 10:07pm

@markd Metabase is a report tool and should in that sense only have read-access, but it's possible to do write-commands and multi-statement, but with very limited feedback.
You will get various errors, since Metabase is expecting a proper return, but it still does it's thing - it's not officially supported and might be removed at any point.
https://github.com/metabase/metabase/issues/4050

I'm not sure what you're talking about with sub-query - don't understand what it has to do with this.
The MySQL server crashed - it's a bug, which is likely fixed if you updated to the latest 5.7-series.
If you can consistently reproduce the crash (simply by running 1 query from any client) on the latest MySQL release, then you should enable debugging mode and report the problem to MySQL.

markd · March 27, 2021, 12:42pm

@flamber - you're right, of course. I have run the query many times and it hasn't crashed mySql since.

The only reason I'm getting rid of the subquery is the error code for the crash in the mysql log matched directly to an SO link where the crash was caused by a mysql subquery. So, I'd just rather eliminate that variable.

Fortunately, it turned out that by using a "having" clause, I was able to do so.

We will definitely be upgrading Metabase in the next release, and very possibly mySql. Management doesn't seem to want in particular to replace mySql for this release, since the current version been highly tested, including performance testing. I guess at this point, and they don't won't to introduce the possibility of unanticipated errors just before release. It's not optimal, but that's a kind of a real-world trade-off that is being considered.

Thanks again for all your help! I do have another question relating to visualization of the updated query that I'll post separately.

flamber · March 27, 2021, 1:17pm

@markd I understand what management is thinking, but it's wrong. When it's an old version of software with known security flaws (doesn't matter which software it is), and there has been a crash in the test environment, then it's not ready for production.

I know there are a lot of other factors in play - and timelines - so I can understand the decision to stay with what you got, but unless the project has been almost 2 years in the making, then I don't understand the reason for the old versions.
Seems like management is placing security below many other things - a decision I would not agree with.

As for MySQL crashing and updating it. Think of this scenario:
Someone with an ancient Internet Explorer 6 visits your brand new webserver project, which you have battle tested, but for some reason IE6 is sending some request that makes your webserver crash.
Is the problem IE6 or the webserver? It's a problem with the webserver - no discussion about that.

As for the sub-query - I think that's just a smoke-screen and not the actual trigger of the problem. But you should run testing environments with a lot more logging (debug perhaps), so you would have had a chance to figure out what would be causing the crash more precisely.

By the way, thank you for not mixing multiple very different topics in the same forum topic. This topic is difficult enough by itself.

markd · March 27, 2021, 2:47pm

@flamber - true enough. There are a number of factors at play here, and one thing I didn't mention is that this is not a project that runs on the web - it's distributed, installed and run in in controlled environments that have no or very limited access to the internet. We specify what browsers they can use, etc. The other consideration is this crash just was discovered two days ago, so we're still kind of digesting the implications.

I also suspect the subquery isn't the cause, but it was the first thing that came into focus so I was pretty much put in a position where I was obligated to get rid of it.

I wonder how I could replicate the crash. It may have something to do with the initial scans / fingerprints you described. Maybe we can get the QA team to do some work on it. One factor is that the project had been running for a few months (Dec 15) before Metabase was used, so there was a lot of data/metadata to ingest initially. Just a thought.

Anyway, thank you for your feedback, and I also saw you responded to the separate topic on the updated query. I'm really excited about Metabase, and we're just starting to scratch the surface on learning about it. We just have to understand what caused that crash. I will keep you posted.