Question with 10 minutes long are being canceled

mateusluzifb · November 28, 2019, 6:28pm

We have long queries (MongoDB aggregation ones) that usually take 12/15 min to load, but recently we had some problems loading them, sometimes the question simply stops. After reading through the logs I’ve had found this:

Exception in thread "async-dispatch-429" java.lang.IllegalStateException: AsyncContext completed and/or Request lifecycle recycled
	at org.eclipse.jetty.server.AsyncContextState.state(AsyncContextState.java:54)
	at org.eclipse.jetty.server.AsyncContextState.complete(AsyncContextState.java:74)
	at ring.util.servlet$make_output_stream$fn__46932.invoke(servlet.clj:89)
	at ring.util.servlet.proxy$java.io.FilterOutputStream$ff19274a.close(Unknown Source)
	at java.base/sun.nio.cs.StreamEncoder.implClose(Unknown Source)
	at java.base/sun.nio.cs.StreamEncoder.close(Unknown Source)
	at java.base/java.io.OutputStreamWriter.close(Unknown Source)
	at java.base/java.io.BufferedWriter.close(Unknown Source)
	at metabase.async.api_response$write_chan_vals_to_writer_BANG_$fn__67375$state_machine__6775__auto____67380$fn__67382.invoke(api_response.clj:87)
	at metabase.async.api_response$write_chan_vals_to_writer_BANG_$fn__67375$state_machine__6775__auto____67380.invoke(api_response.clj:79)
	at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:973)
	at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:972)
	at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:977)
	at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:975)
	at clojure.core.async.impl.ioc_macros$take_BANG_$fn__6793.invoke(ioc_macros.clj:986)
	at clojure.core.async.impl.channels.ManyToManyChannel$fn__1679$fn__1680.invoke(channels.clj:95)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
11-25 20:14:27 e[1mDEBUG middleware.loge[0m :: e[32mPOST /api/card/801/query 200 [ASYNC: canceled] 10.0 mins (7 DB calls) Jetty threads: 2/50 (4 idle, 0 queued) (119 total active threads) Queries in flight: 10e[0m

And this happens to every question that take more than 10 min to load. What seems strange to me is if I try to run the aggregations direct on DB (using Robo 3T) the query run normally and sometimes even faster.

We already have the cache enabled and that don’t helped a lot, and our Metabase version is the 0.33.6.

Also, what does that Request lifecycle recycled means?

mateusluzifb · November 28, 2019, 6:31pm

Also, sometimes we get this error too:

11-25 20:35:43 e[1mWARN middleware.process-userland-querye[0m :: Query failure e[31m{:status :failed,
 :class com.mongodb.MongoInterruptedException,
 :error "Interrupted acquiring a permit to retrieve an item from the pool ",
 :stacktrace
 ("com.mongodb.internal.connection.ConcurrentPool.acquirePermit(ConcurrentPool.java:203)"
  "com.mongodb.internal.connection.ConcurrentPool.get(ConcurrentPool.java:140)"
  "com.mongodb.internal.connection.ConcurrentPool.get(ConcurrentPool.java:123)"
  "com.mongodb.internal.connection.PowerOfTwoBufferPool.getBuffer(PowerOfTwoBufferPool.java:78)"
  "com.mongodb.internal.connection.SocketStream.read(SocketStream.java:105)"
  "com.mongodb.internal.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:578)"
  "com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:441)"
  "com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:295)"
  "com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:255)"
  "com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99)"
  "com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:444)"
  "com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72)"
  "com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:200)"
  "com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269)"
  "com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131)"
  "com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123)"
  "com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:242)"
  "com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:233)"
  "com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:136)"
  "com.mongodb.operation.AggregateOperationImpl$1.call(AggregateOperationImpl.java:193)"
  "com.mongodb.operation.AggregateOperationImpl$1.call(AggregateOperationImpl.java:189)"
  "com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:462)"
  "com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:406)"
  "com.mongodb.operation.AggregateOperationImpl.execute(AggregateOperationImpl.java:189)"
  "com.mongodb.operation.AggregateOperation.execute(AggregateOperation.java:295)"
  "com.mongodb.operation.AggregateOperation.execute(AggregateOperation.java:41)"
  "com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:179)"
  "com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:161)"
  "com.mongodb.DBCollection.aggregate(DBCollection.java:1502)"
  "com.mongodb.DBCollection.aggregate(DBCollection.java:1462)"
  "com.mongodb.DBCollection.aggregate(DBCollection.java:1447)"
  "monger.collection$aggregate.invokeStatic(collection.clj:551)"
  "monger.collection$aggregate.doInvoke(collection.clj:537)"
  "clojure.lang.RestFn.invoke(RestFn.java:573)"
  "--> driver.mongo.query_processor$execute_query.invokeStatic(query_processor.clj:744)"
  "driver.mongo.query_processor$execute_query.invoke(query_processor.clj:736)"
  "driver.mongo$eval2311$fn__2312.invoke(mongo.clj:185)"
  "query_processor$fn__44449$execute_query__44454$fn__44455.invoke(query_processor.clj:71)"
  "query_processor$fn__44449$execute_query__44454.invoke(query_processor.clj:65)"
  "query_processor.middleware.mbql_to_native$mbql__GT_native$fn__20887.invoke(mbql_to_native.clj:38)"
  "query_processor.middleware.annotate$result_rows_maps__GT_vectors$fn__36504.invoke(annotate.clj:541)"
  "query_processor.middleware.annotate$add_column_info$fn__36410.invoke(annotate.clj:485)"
  "query_processor.middleware.cumulative_aggregations$handle_cumulative_aggregations$fn__37445.invoke(cumulative_aggregations.clj:57)"
  "query_processor.middleware.resolve_joins$resolve_joins$fn__41557.invoke(resolve_joins.clj:184)"
  "query_processor.middleware.limit$limit$fn__38522.invoke(limit.clj:19)"
  "query_processor.middleware.results_metadata$record_and_return_metadata_BANG_$fn__44316.invoke(results_metadata.clj:87)"
  "query_processor.middleware.format_rows$format_rows$fn__38510.invoke(format_rows.clj:26)"
  "query_processor.middleware.add_dimension_projections$add_remapping$fn__34938.invoke(add_dimension_projections.clj:232)"
  "query_processor.middleware.add_source_metadata$add_source_metadata_for_source_queries$fn__35638.invoke(add_source_metadata.clj:107)"
  "query_processor.middleware.resolve_source_table$resolve_source_tables$fn__41607.invoke(resolve_source_table.clj:46)"
  "query_processor.middleware.add_row_count_and_status$add_row_count_and_status$fn__35498.invoke(add_row_count_and_status.clj:16)"
  "driver.mongo$eval2143$fn__2144$fn__2146$f__857__auto____2148.invoke(mongo.clj:64)"
  "driver.mongo.util$_with_mongo_connection$fn__848.invoke(util.clj:205)"
  "util.ssh$with_ssh_tunnel_STAR_.invokeStatic(ssh.clj:106)"
  "util.ssh$with_ssh_tunnel_STAR_.invoke(ssh.clj:91)"
  "driver.mongo.util$_with_mongo_connection.invokeStatic(util.clj:199)"
  "driver.mongo.util$_with_mongo_connection.invoke(util.clj:194)"
  "driver.mongo$eval2143$fn__2144$fn__2146.invoke(mongo.clj:63)"
  "query_processor.middleware.driver_specific$process_query_in_context$fn__37521.invoke(driver_specific.clj:12)"
  "query_processor.middleware.resolve_driver$resolve_driver$fn__41221.invoke(resolve_driver.clj:22)"
  "query_processor.middleware.bind_effective_timezone$bind_effective_timezone$fn__36835$fn__36836.invoke(bind_effective_timezone.clj:9)"
  "util.date$call_with_effective_timezone.invokeStatic(date.clj:88)"
  "util.date$call_with_effective_timezone.invoke(date.clj:77)"
  "query_processor.middleware.bind_effective_timezone$bind_effective_timezone$fn__36835.invoke(bind_effective_timezone.clj:8)"
  "query_processor.middleware.store$initialize_store$fn__44341$fn__44342.invoke(store.clj:11)"
  "query_processor.store$do_with_store.invokeStatic(store.clj:46)"
  "query_processor.store$do_with_store.invoke(store.clj:40)"
  "query_processor.middleware.store$initialize_store$fn__44341.invoke(store.clj:10)"
  "query_processor.middleware.async$async__GT_sync$fn__20798.invoke(async.clj:23)"
  "query_processor.middleware.async_wait$runnable$fn__36561.invoke(async_wait.clj:89)"

flamber · November 28, 2019, 7:01pm

Hi @mateusluzifb

Which version of MongoDB?
Have you defined any Connection String options in Admin > Databases > (mongo)?
Or have you tried setting connectTimeout=1200000 (1000ms * 60 * 20, for 20 minutes)

There shouldn’t be any timeouts, but I know there’s a timeout in the Druid driver, but I cannot see any in the MongoDB driver.

The only issue that references the first error is #11391, but that’s specific to Postgres.

When doing a bit of searching, I find things like this, which links to their JIRA, with “works as designed” (whatever that means):
https://stackoverflow.com/questions/40962094/interrupted-acquiring-a-permit-to-retrieve-an-item-from-the-pool-in-mongdb

mateusluzifb · November 29, 2019, 5:13pm

Hi @flamber

Mongo version: mongod version: 3.6.5 (WiredTiger)

What seems strange to me is that the queries that takes more than 10 minutes run normally using Robo 3T, only when running in Metabase I get this problem. That makes me think that this is driver thing, the stackoverflow link that you attached reinforce that, don’t you think?

flamber · November 29, 2019, 5:54pm

@mateusluzifb
I don’t know what Robo3T is, but I guess it’s a client. Metabase is also just a client, but uses different drivers for the connection (unless Robo3T is also build in Clojure), so maybe it has something to do with the drivers or Robo3T sets some configurations, which extends the timeout.

@camsaul Is there some hidden timeout on the MongoDB driver that kills the connection after 10 minutes exactly? It seems there’s one in the Druid driver, so that’s why I’m asking.

mateusluzifb · December 3, 2019, 8:37pm

@camsaul I wasn’t able to find the 10 min timeout, can you help me?

Also, our current solution is mongo’s map-reduce, we’re using it to minimize the query load time

flamber · December 4, 2019, 1:41am

@mateusluzifb I’m hoping Cam will pop in and has an answer, but everyone is working hard on the upcoming 0.34 release

Have you checked if it could be a timeout on your server?
https://docs.mongodb.com/manual/faq/diagnostics/#does-tcp-keepalive-time-affect-mongodb-deployments

Also, it seems like it’s not connectionTimeout, but connectionTimeoutMS and socketTimeoutMS (milliseconds):
https://docs.mongodb.com/manual/reference/connection-string/#timeout-options
So have you tried adding this to the Connection String in Admin > Databases > (mongodb)?
connectionTimeoutMS=10000&socketTimeoutMS=1200000

k_p · March 11, 2020, 7:25am

I am getting a similar error with Oracle. Increasing the oracle ReadTimeout by setting -Doracle.jdbc.ReadTimeout=900000 will not affect the 10 minute timeout.

flamber · March 11, 2020, 11:22am

@k_p There shouldn’t be any timeouts on Oracle - as far as I know. It seems like ReadTimeout is set to prevent errors, when the database connection is down, but that would be a different problem than this topic.
There are timeouts on BigQuery and Druid currently: https://github.com/metabase/metabase/issues/12003

k_p · March 12, 2020, 5:15am

I was able to extend the 10 minute query timeout problem by setting the environment variable:

# set timeout to 15 minutes (15 * 60 * 1000)
export MB_JETTY_ASYNC_RESPONSE_TIMEOUT=900000

This is defaulted in the jetty startup to 10 minutes.

I am extending this value by a lot to handle some of our long running queries - but there may be some further consequences I’m not aware of.

speculation: I’m assuming this is has accidentally effected query timouts, as the “async jetty thread” that is handling the ajax query request from the browser is timed out after 10 minutes which causes all resources to be cleaned up - aka connection to DB is prematurely closed by jetty.

flamber · March 12, 2020, 9:21am

@k_p I don’t know which version you’re using, but there’s a lot of work put into the async Jetty in the upcoming 0.35.0, but there has been several changes since it was introduced too.

k_p · March 12, 2020, 11:47pm

We are using v0.34.2 at the moment (the latest) and I am currently migrating long running queries from another tool into metabase.

flamber · March 13, 2020, 9:09am

@k_p The latest version is 0.34.3