Open Source ETL projects

There are a number of Open Source ETL projects out there.

If you have experience with any of the below, please let us know how you’ve found using it:

Luigi - https://github.com/spotify/luigi
DuckETL - http://dataducketl.com/
Pentaho - http://community.pentaho.com/projects/data-integration/
Jaspersoft - http://community.jaspersoft.com/project/jaspersoft-etl
Talend - https://www.talend.com/resource/free-etl.html

This great article, which I believe has been brought up in other parts of Discuss, talks about Luigi.

I recently started using Luigi to build our internal ETL/BI from the ground up.

It required a lot more code than I expected to actually get things up and running. Half of that is just due to the messy nature of the data (different time formats, unix vs windows csv’s, etc, etc), but half of it was due to the fact that in the early stages of prototyping thigns are changing often. Because of that, clearing out previous runs of Redshift loads, or changing schemas has not been easy. I started writing some custom code to auto-migrate and whatnot, but ditched it as it seems unnecessary once I get the schema locked down.

All in all, had a good experience with Luigi. It’s really easy to add new data sources, but chaining everything together was a little clunky. I briefly tried Airbnb’s Airflow, but it wasn’t as easy to add a custom data source.

Hey Sameer - I used Pentaho a few years ago to build out an entire reporting system (including data warehouse) for my company at the time, which needed client specific dashboards in addition to internal. Started from scratch and ramp up wasn’t too bad.

I’ve seen a bunch of them used by my clients. I suspect that for most clients here non-open source solutions would actually be more cost effective. The economics have really shifted in the past few years. Many new entrants (Fivetran, Stitch Data, Alooma, etc).

But back to Open-source. There are two issues here that I see:

  1. maintaining the logic as schemas change
  2. maintaining the process (schedule, reporting, error logging, etc)

(1) is obviously unavoidable without someone managing it for you. (2) however increasingly has a solution. For example, I work with Keboola (at full disclosure, I partner with Fivetran and Keboola and a few ETL vendors) to essentially freeze modules of python code in Keboola’s platform. This lets the client avoid type (2) problems while focusing all effort on changes to (1).

Founder,
InnerJoin Analytics community