Metadata API endpoint not working for one database

Hi all,

Since upgrading to v0.46.1 (from v0.42.3) we have been facing issues accessing just one of /api/database/N/metadata API endpoints, in particular our main Redshift database. I am not sure what goes on but it times out before returning anything. Whilst another endpoint (N=4) works just fine and returns a JSON.

N=2 (Redshift) returns a 504 Gateway timeout and in the Metabase Logs it reads:

[a167b5ad-7695-411c-8047-91d5774cdc4b] 2023-05-04T23:25:46+01:00 ERROR metabase.server.middleware.log GET /api/database/2/metadata 500 5.0 mins (1,672 DB calls) 
{:via
 [{:type java.io.IOException,
   :message "java.util.concurrent.TimeoutException: Idle timeout expired: 200000/200000 ms",
   :at [org.eclipse.jetty.util.SharedBlockingCallback$Blocker block "SharedBlockingCallback.java" 221]}
  {:type java.util.concurrent.TimeoutException,
   :message "Idle timeout expired: 200000/200000 ms",
   :at [org.eclipse.jetty.io.IdleTimeout checkIdleTimeout "IdleTimeout.java" 170]}],
 :trace
 [[org.eclipse.jetty.io.IdleTimeout checkIdleTimeout "IdleTimeout.java" 170]
  [org.eclipse.jetty.io.IdleTimeout idleCheck "IdleTimeout.java" 112]
  [java.util.concurrent.Executors$RunnableAdapter call nil -1]
  [java.util.concurrent.FutureTask run nil -1]
  [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask run nil -1]
  [java.util.concurrent.ThreadPoolExecutor runWorker nil -1]
  [java.util.concurrent.ThreadPoolExecutor$Worker run nil -1]
  [java.lang.Thread run nil -1]],
 :cause "Idle timeout expired: 200000/200000 ms",
 :message "java.util.concurrent.TimeoutException: Idle timeout expired: 200000/200000 ms"}

The main difference seems to be the numbers of models that is getting metadata for. The one that is failing is our main Data Warehouse, whilst the others are smaller dbs.

Does anyone have any idea what might causing this? Could there be a permission issue on the Metabase Postgres DB? Or some upper limit on the amount of data that can be returned? This end point worked fine before the migration...

There are some further error logs that begin with:

[a167b5ad-7695-411c-8047-91d5774cdc4b] 2023-05-04T23:03:33+01:00 ERROR metabase.server Unexpected Exception in API request handler
org.eclipse.jetty.io.EofException: Closed

and

[a167b5ad-7695-411c-8047-91d5774cdc4b] 2023-05-04T23:03:33+01:00 ERROR metabase.server Unexpected exception in endpoint
org.eclipse.jetty.io.EofException: Closed

Checking our postgres database, there are no locked queries.

I don't want to roll back if possible but without this endpoint working we cannot sync our docs to metabase which is a pain!

Thanks!

Is this endpoint getting called when you run something? Can you help us to debug when this endpoint gets hit? Also: do you see anything in your Postgres that might be causing a slowdown?

It gets called in a CICD process typically using dbt-metabase python package. However I have been calling it in my browser (already authenticated) to debug and I hit the same issue.

I have been using the Metabase troubleshoot logs / K8s logs to get the output above. Is there anything being called aside from the PG backend? Do you know if the way data is fetched has been modified?

I also tried running the same API call on our development metabase application to our dev Redshift DWH and hit the exact same timeout issue.

this doesn't share a PG backend..

Just a quick update on this.

Hi again,

I set up a brand new instance of Metabase locally, using docker. I connected to the same Data Warehouse. Ran the command: http://localhost:3000/api/database/2/metadata and it returned data, slowly. I then let metabase finish syncing all the table names and columns in the background, I then retried the API call and this resulted in a 50 second API response. It didn't timeout I am guessing as I am running this locally.

Given this issue wasn't present on the same Database before the upgrade this feels like a performance issue that has been introduced between versions. Or indeed a timeout rule has been implemented.

Does anyone have any insight into changes of this nature?

thanks

So what you say is that you use that endpoint first to sync metadata and then the second time it takes a huge amount of time. How many databases/fields you have? How much data did you sync? I’m guessing this is a performance issue on the endpoints

When I was testing it locally, I attempted calling the endpoint before all the synchronisation was finished, and then again straight after. All this told me was that the endpoint only really becomes available once the sync finishes.

The main thing is that prior to upgrading to the new version of Metabase we used to call the endpoint without problem (despite there being no change in the number of tables "(1,672 DB calls)". Now it appears too slow and we hit a timeout on our K8s deployment of Metabase. It does work when I run a local version of Metabase, but 50seconds.

So I agree with you that this must be a performance issue on the endpoint. If it is calling the database, maybe an Index is missing or something simple like that?

How many tables and fields do you have?

tables: 752 , columns: 9219 that are exposed to Metabase.

I am pretty sure this is related to the Elastic Load Balancer on K8s EC2 being set at 60seconds after a bit more testing. I will test upping this amount.