By Andrew Backes, Software Engineer (Data pipeline)
We thought it would be really cool if we could run an exact copy of our entire data pipe locally. This would open up a whole slew of possibilities. The one that we were most excited about: testing system integration, but on our laptops. During the last few weeks of June this is what we (the data team at ShareThis) worked on.
So what does this mean, exactly? Right now our data pipeline runs on a cluster of about 30 nodes. It interacts with a Cassandra cluster, an Aerospike cluster, two Kafka clusters and a Graphite/Seyren node. The goal was to get all of this to somehow run on a single laptop. Also, we wanted to actually have the pipeline ingest data and have the correct thing come out the other end.
The first step was to containerize our applications. The tricky part was that our applications were not written with this capability in mind. Lucky though, we used dependency injection and adapter models during development. This turned out to help quite a bit in getting things to wire together correctly. Docker compose also helped in that regard. About a week after we got all of this working, the advanced networking features of Docker 1.7 were announced. Those new features would have made our job a lot easier.
In the end we used 8 docker containers (each representing an Amazon VPC group) for our internal applications and a single docker container for each Cassandra, Aerospike, Kafka 7, Kafka 8, and Graphite. Since we were running this on Mac, we also had to pump up the amount of memory and disk space was available to boot2docker in the VM. It was a joyous moment when we put some data in front of the pipeline, were able to track it on graphite and see the fully processed version pop out the other end.