Frequent GC in metabase application

I see frequent GC in metabase logs.
Also, we see around 22k queries from our app on a peak usage day, what are best practise around managing these workloads.
Database is Redshift and metabase backend is postgress. Have two instances of metabase running behind a round robin balancer. 24G is allocated as memory to app.
As we see because of GC allocation failures because of which we have to re-start our app multiple times a day. Any advices are welcome.

Hi @hks21
Which version of Metabase?
How are you running Metabase - JAR, Docker, …?
Since you’re have high load, I’m guessing you’re Java flags/parameters for tuning - which?
When you have to restart Metabase, how much memory is it using?
Can you post a GC error?

Hi @flamber,
We are running a jar on EC2 instance(aws).
When allocation failure occurs heap usage is almost 24 G.
JVM options -
-Xmx24g -Xms24g -XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:-ResizePLAB -XX:+PerfDisableSharedMem -XX:+UseCompressedOops -XX:+UseStringDeduplication -XX:+ParallelRefProcEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
Few question -

  1. Is query cache in memory or stored in DB, I could not find the DB where it is stored. Since, our cache MINIMUM QUERY DURATION is 5 I think all the queries are getting cached, which I want to increase.
  2. Also, is it better to have smaller JVM backed apps like 10G and 4 application behind load balancer or like now where we have 24G and two app instances behind load balancer.
  3. Does number of in flight queries also has an impact, we have increased it to 200 per instance which is due to our heavy work load.
  4. Does the huge number of downloads contribute to frequent GC or queries with heavy output. We face both like downloads and heavy queries in this case what could be best way to run metabase.

GC log for one of the GC triggered -
07-24 12:01:48 DEBUG middleware.log :: GET /api/database 200 8 ms (4 DB calls) Jetty threads: 6/256 (16 idle, 0 queued) (138 total active threads) Queries in flight: 57
82426.375: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 83886080 bytes, new threshold 15 (max 15)

  • age 1: 72873272 bytes, 72873272 total
    , 0.3732569 secs]
    [Parallel Time: 97.3 ms, GC Workers: 4]
    [GC Worker Start (ms): Min: 82426375.5, Avg: 82426375.5, Max: 82426375.5, Diff: 0.0]
    [Ext Root Scanning (ms): Min: 1.9, Avg: 2.3, Max: 2.9, Diff: 0.9, Sum: 9.2]
    [Update RS (ms): Min: 16.3, Avg: 16.3, Max: 16.4, Diff: 0.2, Sum: 65.3]
    [Processed Buffers: Min: 59, Avg: 68.8, Max: 84, Diff: 25, Sum: 275]
    [Scan RS (ms): Min: 0.2, Avg: 0.6, Max: 1.0, Diff: 0.8, Sum: 2.4]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 77.9, Avg: 78.0, Max: 78.2, Diff: 0.3, Sum: 312.0]
    [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [GC Worker Total (ms): Min: 97.3, Avg: 97.3, Max: 97.3, Diff: 0.0, Sum: 389.1]
    [GC Worker End (ms): Min: 82426472.8, Avg: 82426472.8, Max: 82426472.8, Diff: 0.0][Code Root Fixup: 0.0 ms][Code Root Purge: 0.0 ms]
    [String Dedup Fixup: 273.8 ms, GC Workers: 4]
    [Queue Fixup (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Table Fixup (ms): Min: 273.7, Avg: 273.7, Max: 273.8, Diff: 0.1, Sum: 1095.0]
    [Clear CT: 0.2 ms]
    [Other: 1.9 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 1.3 ms]
    [Ref Enq: 0.1 ms]
    [Redirty Cards: 0.0 ms]
    [Humongous Register: 0.1 ms]
    [Humongous Reclaim: 0.0 ms]
    [Free CSet: 0.1 ms]
    [Eden: 1104.0M(1104.0M)->0.0B(1056.0M) Survivors: 112.0M->160.0M Heap: 23.3G(24.0G)->22.3G(24.0G)]
    [Times: user=1.49 sys=0.00, real=0.37 secs]
    82427.332: [GC pause (G1 Evacuation Pause) (young)
    Desired survivor size 83886080 bytes, new threshold 2 (max 15)
  • age 1: 74210736 bytes, 74210736 total
  • age 2: 72818296 bytes, 147029032 total
    , 0.4098389 secs]
    [Parallel Time: 133.4 ms, GC Workers: 4]
    [GC Worker Start (ms): Min: 82427332.2, Avg: 82427332.2, Max: 82427332.2, Diff: 0.0]
    [Ext Root Scanning (ms): Min: 1.9, Avg: 2.3, Max: 2.9, Diff: 1.0, Sum: 9.2]
    [Update RS (ms): Min: 2.9, Avg: 3.4, Max: 3.6, Diff: 0.7, Sum: 13.4]
    [Processed Buffers: Min: 17, Avg: 23.2, Max: 30, Diff: 13, Sum: 93]
    [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 127.3, Avg: 127.6, Max: 128.0, Diff: 0.7, Sum: 510.5]
    [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [GC Worker Total (ms): Min: 133.3, Avg: 133.3, Max: 133.3, Diff: 0.0, Sum: 533.3]
    [GC Worker End (ms): Min: 82427465.5, Avg: 82427465.5, Max: 82427465.5, Diff: 0.0][Code Root Fixup: 0.0 ms][Code Root Purge: 0.0 ms]
    [String Dedup Fixup: 274.1 ms, GC Workers: 4]
    [Queue Fixup (ms): Min: 0.1, Avg: 0.4, Max: 0.8, Diff: 0.7, Sum: 1.4]
    [Table Fixup (ms): Min: 273.2, Avg: 273.7, Max: 273.9, Diff: 0.7, Sum: 1094.6]
    [Clear CT: 0.1 ms]
    [Other: 2.1 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 1.3 ms]
    [Ref Enq: 0.1 ms]
    [Redirty Cards: 0.3 ms]
    [Humongous Register: 0.1 ms]
    [Humongous Reclaim: 0.0 ms]
    [Free CSet: 0.1 ms]
    [Eden: 1056.0M(1056.0M)->0.0B(1056.0M) Survivors: 160.0M->160.0M Heap: 23.3G(24.0G)->22.4G(24.0G)]
    [Times: user=1.63 sys=0.01, real=0.41 secs]
    82427.818: [GC concurrent-string-deduplication, 9013.1K->1422.2K(7590.9K), avg 72.4%, 0.0760109 secs]
    82428.289: [GC pause (G1 Evacuation Pause) (young)
    Desired survivor size 83886080 bytes, new threshold 2 (max 15)
  • age 1: 71172144 bytes, 71172144 total
  • age 2: 74163528 bytes, 145335672 total
    , 0.4084746 secs]
    [Parallel Time: 131.4 ms, GC Workers: 4]
    [GC Worker Start (ms): Min: 82428289.1, Avg: 82428289.1, Max: 82428289.2, Diff: 0.0]
    [Ext Root Scanning (ms): Min: 2.0, Avg: 2.4, Max: 3.1, Diff: 1.1, Sum: 9.7]
    [Update RS (ms): Min: 16.6, Avg: 16.9, Max: 17.1, Diff: 0.5, Sum: 67.8]
    [Processed Buffers: Min: 50, Avg: 71.0, Max: 93, Diff: 43, Sum: 284]
    [Scan RS (ms): Min: 0.9, Avg: 1.2, Max: 1.4, Diff: 0.5, Sum: 4.8]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 110.6, Avg: 110.7, Max: 110.9, Diff: 0.3, Sum: 442.9]
    [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [GC Worker Total (ms): Min: 131.3, Avg: 131.3, Max: 131.4, Diff: 0.0, Sum: 525.3]
    [GC Worker End (ms): Min: 82428420.5, Avg: 82428420.5, Max: 82428420.5, Diff: 0.0][Code Root Fixup: 0.0 ms][Code Root Purge: 0.0 ms]
    [String Dedup Fixup: 274.8 ms, GC Workers: 4]
    [Queue Fixup (ms): Min: 0.0, Avg: 0.4, Max: 0.8, Diff: 0.8, Sum: 1.5]
    [Table Fixup (ms): Min: 273.9, Avg: 274.3, Max: 274.6, Diff: 0.8, Sum: 1097.1]
    [Clear CT: 0.2 ms]
    [Other: 2.1 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 1.3 ms]
    [Ref Enq: 0.1 ms]
    [Redirty Cards: 0.2 ms]
    [Humongous Register: 0.1 ms]
    [Humongous Reclaim: 0.0 ms]
    [Free CSet: 0.1 ms]
    [Eden: 1056.0M(1056.0M)->0.0B(1056.0M) Survivors: 160.0M->160.0M Heap: 23.4G(24.0G)->22.5G(24.0G)]
    [Times: user=1.63 sys=0.00, real=0.41 secs]
    82428.779: [GC concurrent-string-deduplication, 9315.0K->1422.8K(7892.2K), avg 72.4%, 0.0814471 secs]
    82429.233: [GC pause (G1 Evacuation Pause) (young)
    Desired survivor size 83886080 bytes, new threshold 2 (max 15)
  • age 1: 71196384 bytes, 71196384 total
  • age 2: 71125200 bytes, 142321584 total
    , 0.4184778 secs]
    [Parallel Time: 140.8 ms, GC Workers: 4]
    [GC Worker Start (ms): Min: 82429233.5, Avg: 82429233.5, Max: 82429233.6, Diff: 0.0]
    [Ext Root Scanning (ms): Min: 1.9, Avg: 2.3, Max: 2.9, Diff: 1.0, Sum: 9.2]
    [Update RS (ms): Min: 27.8, Avg: 27.9, Max: 28.0, Diff: 0.2, Sum: 111.5]
    [Processed Buffers: Min: 71, Avg: 80.0, Max: 87, Diff: 16, Sum: 320]
    [Scan RS (ms): Min: 0.3, Avg: 0.6, Max: 0.8, Diff: 0.5, Sum: 2.5]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 109.7, Avg: 109.9, Max: 110.0, Diff: 0.3, Sum: 439.5]
    [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [GC Worker Total (ms): Min: 140.7, Avg: 140.7, Max: 140.7, Diff: 0.0, Sum: 562.8]
    [GC Worker End (ms): Min: 82429374.2, Avg: 82429374.2, Max: 82429374.2, Diff: 0.0][Code Root Fixup: 0.0 ms][Code Root Purge: 0.0 ms]
    [String Dedup Fixup: 275.5 ms, GC Workers: 4]
    [Queue Fixup (ms): Min: 0.1, Avg: 0.3, Max: 0.6, Diff: 0.5, Sum: 1.4]
    [Table Fixup (ms): Min: 274.8, Avg: 275.0, Max: 275.3, Diff: 0.4, Sum: 1100.2]
    [Clear CT: 0.1 ms]
    [Other: 2.0 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 1.3 ms]
    [Ref Enq: 0.1 ms]
    [Redirty Cards: 0.2 ms]
    [Humongous Register: 0.1 ms]
    [Humongous Reclaim: 0.0 ms]
    [Free CSet: 0.1 ms]
    [Eden: 1056.0M(1056.0M)->0.0B(1056.0M) Survivors: 160.0M->160.0M Heap: 23.5G(24.0G)->22.6G(24.0G)]
    [Times: user=1.66 sys=0.00, real=0.42 secs]
    82429.725: [GC concurrent-string-deduplication, 8293.7K->1730.8K(6562.9K), avg 72.4%, 0.0729526 secs]
    82430.296: [GC pause (G1 Evacuation Pause) (young)
    Desired survivor size 83886080 bytes, new threshold 2 (max 15)
  • age 1: 70923512 bytes, 70923512 total
  • age 2: 71162752 bytes, 142086264 total
    , 0.4095816 secs]
    [Parallel Time: 130.5 ms, GC Workers: 4]
    [GC Worker Start (ms): Min: 82430296.0, Avg: 82430296.0, Max: 82430296.1, Diff: 0.0]
    [Ext Root Scanning (ms): Min: 2.0, Avg: 2.3, Max: 2.9, Diff: 0.9, Sum: 9.3]
    [Update RS (ms): Min: 17.9, Avg: 18.4, Max: 18.9, Diff: 0.9, Sum: 73.6]
    [Processed Buffers: Min: 65, Avg: 72.5, Max: 94, Diff: 29, Sum: 290]
    [Scan RS (ms): Min: 0.3, Avg: 0.7, Max: 0.8, Diff: 0.5, Sum: 2.6]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 108.8, Avg: 109.0, Max: 109.1, Diff: 0.3, Sum: 436.0]
    [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Termination Attempts: Min: 1, Avg: 2.2, Max: 3, Diff: 2, Sum: 9]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [GC Worker Total (ms): Min: 130.4, Avg: 130.4, Max: 130.4, Diff: 0.0, Sum: 521.6]
    [GC Worker End (ms): Min: 82430426.4, Avg: 82430426.4, Max: 82430426.4, Diff: 0.0][Code Root Fixup: 0.0 ms][Code Root Purge: 0.0 ms]
    [String Dedup Fixup: 276.8 ms, GC Workers: 4]
    [Queue Fixup (ms): Min: 0.0, Avg: 0.3, Max: 0.6, Diff: 0.5, Sum: 1.4]
    [Table Fixup (ms): Min: 276.1, Avg: 276.4, Max: 276.7, Diff: 0.6, Sum: 1105.6]
    [Clear CT: 0.2 ms]
    [Other: 2.1 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 1.3 ms]
    [Ref Enq: 0.1 ms]
    [Redirty Cards: 0.2 ms]
    [Humongous Register: 0.1 ms]
    [Humongous Reclaim: 0.0 ms]
    [Free CSet: 0.1 ms]
    [Eden: 1056.0M(1056.0M)->0.0B(1056.0M) Survivors: 160.0M->160.0M Heap: 23.6G(24.0G)->22.7G(24.0G)]
    [Times: user=1.63 sys=0.00, real=0.41 secs]
    82430.778: [GC concurrent-string-deduplication, 8243.1K->1468.0K(6775.1K), avg 72.4%, 0.0720738 secs]
    82431.464: [GC pause (G1 Evacuation Pause) (young)
    Desired survivor size 83886080 bytes, new threshold 2 (max 15)
  • age 1: 71045560 bytes, 71045560 total
  • age 2: 70888672 bytes, 141934232 total
    (to-space exhausted), 0.8409257 secs]
    [Parallel Time: 329.1 ms, GC Workers: 4]
    [GC Worker Start (ms): Min: 82431464.1, Avg: 82431464.1, Max: 82431464.1, Diff: 0.0]
    [Ext Root Scanning (ms): Min: 2.0, Avg: 2.5, Max: 3.4, Diff: 1.4, Sum: 9.8]
    [Update RS (ms): Min: 26.1, Avg: 26.6, Max: 26.8, Diff: 0.7, Sum: 106.5]
    [Processed Buffers: Min: 62, Avg: 77.0, Max: 86, Diff: 24, Sum: 308]
    [Scan RS (ms): Min: 0.3, Avg: 0.5, Max: 0.7, Diff: 0.4, Sum: 2.1]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 299.2, Avg: 299.4, Max: 299.5, Diff: 0.3, Sum: 1197.5]
    [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [GC Worker Total (ms): Min: 329.0, Avg: 329.0, Max: 329.0, Diff: 0.0, Sum: 1316.1]
    [GC Worker End (ms): Min: 82431793.1, Avg: 82431793.1, Max: 82431793.1, Diff: 0.0][Code Root Fixup: 0.0 ms][Code Root Purge: 0.0 ms]
    [String Dedup Fixup: 334.7 ms, GC Workers: 4]
    [Queue Fixup (ms): Min: 0.3, Avg: 0.6, Max: 1.1, Diff: 0.8, Sum: 2.2]
    [Table Fixup (ms): Min: 333.6, Avg: 334.1, Max: 334.3, Diff: 0.8, Sum: 1336.3]
    [Clear CT: 0.2 ms]
    [Other: 176.9 ms]
    [Evacuation Failure: 174.6 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 1.3 ms]
    [Ref Enq: 0.1 ms]
    [Redirty Cards: 0.5 ms]
    [Humongous Register: 0.1 ms]
    [Humongous Reclaim: 0.1 ms]
    [Free CSet: 0.1 ms]
    [Eden: 1056.0M(1056.0M)->0.0B(1120.0M) Survivors: 160.0M->96.0M Heap: 23.7G(24.0G)->23.8G(24.0G)]
    [Times: user=2.71 sys=0.01, real=0.84 secs]

Hi @hks21

Have you tried changing the number of threads for parallel / concurrent GC phases? You might try setting XX:ParallelGCThreads=n using this formula:
n = 8+(logical processors-8)(5/8)
And then also -XX:ConcGCThreads=m using this forumla:
m = n / 4

Also, what version of the JVM are you running and what family EC2 instance?

Hi @octavian,
I will give it a try.
A question on what is the logic behind the params which you suggested?

JVM version -
$ java -version
java version “1.8.0_201”
Java™ SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot™ 64-Bit Server VM (build 25.201-b09, mixed mode)

EC2 instance - m5.2xlarge

@hks21
Which version of Metabase?
You’re probably going to get better answers from @octavian, since that’s one of the core developers, but just to answer some of your questions:
1.) The query cache is stored in the metadata table query_cache. I’m guessing you’re using Postgres for your metadata?
4.) Yes, downloads makes the memory footprint grow and it seems like it’s not releasing it (that’s my experience). There’s a couple of issues open about large downloads, example issue 5187

@flamber Thanks for your response.
Metabase version is v0.32.9 that is the latest version.
Yes, we are using Postgres for our metadata.
For query_cache its strange this table does not shows up in MetabaseBackend list of tables in Metabase UI. I will check in db though.

@hks21

These sort of issues are hard to debug, adding those options just makes the results for predictable even in case of failures in my experience (in this case eliminating lack of resources that could lead to unsuccessful cycles). I’m trying to eliminate the GC cycle as a possible red herring for this issue, as you might be hitting an application OOM (like @flamber said) masked as a GC allocation failure.

Lines like this in the GC output: [Eden: 1056.0M(1056.0M)->0.0B(1120.0M) Survivors: 160.0M->96.0M Heap: 23.7G(24.0G)->23.8G(24.0G)] make me think the GC has very little to actually clean up, and the app is hanging on to memory.

Another thing you might try is opening up a JConsole/VisualVM/JProfiler and connecting to your JVM and doing a heap dump next time the heap is near 24G to see what’s filling it up.

@octavian
Today app crashed and did create a dump(hprof) however thats huge like 32GB file. I am not sure with this size dump eclipse MAT will be able to load this file or not.
I will try to see what going on. Will update on that, meanwhile will try to change jvm settings you suggested.

@hks21

Ah yeah sorry I was thinking to just look at the heap in those profilers, but wrote heap dump. Indeed that’d be too big to really look at without specialized tooling. Wanted to see what classes are being held on to, if you can check? It probably doesn’t have to be nearly full, either. If you are reaching 24G then it’s probably filling with junk early on also, so anytime should work.

hi @octavian/@flamber ,
For heavy workloads like us where on average we see 13-14k queries/day, in my view the caching params ‘minimum query duration’ should be on higher side so that queries which takes time to fetch are cached rather than quicker queries. Your views on this…

@hks21
It depends. I’m running a couple of installations with only 1-3k queries/day, but they’re heavy queries, so I have a cache for 10 minutes, since real-time stats are not that important.
So if real-time or something close to that is needed, then you should not use cache or set it very low.
EDIT: Note that cache is not used for downloads, so it won’t help if downloads are the root cause of your memory issue.