We just started playing and using Metabase we design our data warehouse using Spark, Cassandra and Kafka.
Metabase is excellent tools to build our BI dashboard but to do so we need to pull the data to Metabase, and we believe the best way to do this moves it direct from Kafka events.
But we are not sure if we should build Kafka connector inside Metabase (using Clojure) or use the Metabase APi layer from Kafka direct (as a plugin in Kafka).
In general weâd recommend connecting Metabase to something that can hit data at rest. In your case, I assume youâre pushing an event stream through Kafka into Cassandra and then running spark jobs against the data in Cassandra. Is that right?
If thatâs the case, Iâd say the simplest solution would be to point Metabase at Cassandra directly. If youâre willing and able to us there, weâd love the help.
Regarding hitting Kafka directly, it would require a pretty serious design review as currently we donât have any notion of streams in Metabase. Open to exploring it with you! Would you mind either elaborating on your reasoning here or in a github issue? Sounds very exciting and a different angle than we were on =)
I have thought about the direct connection with Cassandra but with +200 TB data and 120B documents, I did not feel right about the idea. As well as rely on CQL will reduce a lot of advantages!
Based on that image, Iâd say that the best things to point metabase at would be the âinitial aggregated data tablesâ.
In any event, weâre actively interested in exploring what a truly real time/streaming driver might look like and would love to work with you to figure out how best to do that. We also think a CQL/Cassandra driver might be useful to you and others, and if youâre able and willing to write one, would love to include one with Metabase =)
More than happy to take the conversation offline if I can be useful in working through which of the intermediate tables might be best to use with Metabase if youâre not comfortable sharing details in a public forum.