Event Collection Overview

(courtesy of @agilliland)


So you want to do event analytics, awesome! This document provides some basic things you should know. There’s a few sections to read as follows:

  • General strategies for implementing event analytics. Describes three high-level options to consider.

  • The anatomy of an event analytics pipeline and how it works. This will help you make better decisions throughout your journey.

  • Practical advice. How to stay out of trouble and benefit from the mistakes of the many people who have tried this stuff before you.

Event Analytics Strategies

Which option a team goes for will depend a lot on the needs, but most folks tend to work their way through these options progressively as their product and business scales.

  • Use an end-to-end SAAS analytics provider. You’ll do this when you can’t afford the cost of managing any of your event analytics pipeline yourself or you simply don’t want to.

    • Benefits

      • completely hands free and no infra required

      • fastest option to get something in place

      • solutions are often well optimized for event analytics workloads

    • Drawbacks

      • proprietary solution w/ no control over capabilities

      • data is isolated and can’t be joined with any other data from your application

      • often gets quite expensive at higher volumes

      • in most cases you won’t own your data or have the ability to get it back completely if you decided that you wanted to do more with it

    • Example Solutions

      • Google Analytics

      • Mixpanel

      • Keen IO

  • Custom pipeline made from managed services. This is the middle path. Get back some critical elements of control in exchange for a bit more work.

    • Benefits

      • you own the data all the way through the pipeline, giving you the opportunity to decide any way you want to use it.

      • joining your various data sources together will be possible and will enable analyses not possible otherwise.

      • data privacy and security are more firmly in your control

    • Drawbacks

      • you’re taking on the responsibility for collecting and storing your data, so you need to be ready to own the solution.

      • typically you’ll need to create and maintain some code to handle extracting and transforming your collected data for use.

    • Example Solutions

      • AWS Mobile Events + S3 + Redshift + BI

      • AWS Kinesis + S3 + DWH + BI

      • Kafka + S3 + DWH + BI

      • Kafka + Storm + BI (real-time)

      • Segment.io + S3 or Redshift + BI

  • Fully custom pipeline. This is an anything goes fully constructed pipeline which often times will contain components which are highly customized.

    • Benefits

      • full control over every step in the pipeline ensures you have the ability to solve anything you need

      • you can optimize for your own specific needs where appropriate rather than relying on more generalized solutions

      • all of your data can be moved around and consolidated as needed, so any analysis is possible

    • Drawbacks

      • this is the most costly option and will certainly require dedicated headcount to execute on.

      • it takes time to build out a full data pipeline. this often takes place over several months or years.

    • Example Solutions

      • you’re on your own here. that’s what you wanted right?

Anatomy of an Analytics Pipeline

Data pipelines come in many shapes and sizes but for most cases the steps below represent a solid framework for discussing the pieces in the pipeline.

  • Generation - this is where and how the data is created. could be coming from a mobile app, a user in their browser, a backend server, whatever. in most cases you are either writing this code yourself or you are using a 3rd party sdk like GA, Segment.io, mixpanel, etc.

  • Collection - once data is generated it typically needs to be sent somewhere and stored for downstream usage. in most cases this is an HTTP based API of some sort. it’s important to keep in mind that often times data is collected once but intended to be used multiple times, so designing for a pub/sub model makes sense here to allow multiple subscribers to the same data.

  • Processing - after collecting data it typically goes through 1 or more transformations to prepare it for consumption. often times this step is highly custom and depends greatly on how data generation and collection take place as well as what’s available for data warehousing. this is your ETL and it’s effectively the glue in your pipeline.

  • Warehousing - refers to how data is stored and made available for analysis. this can be as simple as a CSV file or as complex as a 2k+ node hadoop cluster. the principle factors in picking your warehouse is data volume and analytical needs.

  • Analysis - the set of tools and processes which pull data from a warehouse and format it for active consumption, usually by humans. this is where your BI tools and custom dashboards tend to focus. if you have an ML discipline then you would have modeling software here as well.

Practical Advice

General Thoughts

  • ride each solution as far as you can before moving up to a more sophisticated approach.

  • don’t underestimate the work involved in running your own pipeline. a data pipeline is not a system where you set it up and it just runs without any attention, it will require care and feeding on a regular basis.

  • general degree of difficulty/time cost for parts of your pipeline:

    • Generation = EASY

    • Capture = MODERATE -> EASY

    • Processing = HARD -> MODERATE

    • Warehousing = HARD -> MODERATE

    • Analysis = MODERATE

When going with an end-to-end SAAS service

  • typically you’ll have to do a little coding to implement a 3rd party sdk and integrate the data generation code into your application. depending on what you want to track this can be very easy, like GA on a website, or it can be a bit more involved.

  • customers/clients/users will be sending data into a 3rd party so there are at least some legal and privacy implications to consider. in most cases this isn’t a big issue, but it’s worth thinking about.

  • you’ll access your data through the proprietary set of tools created by your provider and that can be a mixed bag. in most cases you’ll end up running into walls regarding what you can do and you’ll just have to live with them because you have no way to change anything.

When you are planning to start up your own pipeline

  • to execute on this you’ll have to run some infrastructure, even if it’s managed services such as on AWS, so make sure you’ve got some basic techops capabilities.

  • it’s tempting to think you’ll need a data engineer to execute on this analytics strategy, but you don’t if you keep things simple. there is no reason to jump the gun on hiring data engineers.

  • focus on creating a stable and robust data collection pattern to start with because “literally” everything is downstream from there. once you have something solid in place for data collection you can build lots on top of it. top choices here are AWS Kinesis and Kafka.

  • your DWH choice will mostly be dependent on data volume. for small volumes stick with the simplest tools like open source SQL databases, when you outgrow that you’ll be looking at an analytics DWH of which there are lots of choices: AWS Redshift, Vertica, ParAccelParacel, and Terradata are among the big names. There’s also newer tools like Spark, Presto, Impala, and Druid.

  • everyone wants their data pipeline to be real-time (meaning data is ready for analysis immediately) but it’s certainly more work to execute on, so decide early on if it’s worth it for you. to do this often requires specialized tools and a different pipeline configuration, so take that into account.

Sources of Inspiration


http://mattmazur.com/2015/12/12/analytics-event-naming-conventions/ has some good guidelines on naming events.

GA, unlike Mixpanel, actually does provide raw access to data in their BigQuery database (as long as you buy the 360 product).