At ShareThis, we don’t like creating extra work for ourselves. Managing Terabytes and Petabytes of data in real-time is already hard enough. As such, we aggressively look for ways to work faster and more efficiently. This means evaluating different technologies and automating as much as possible. It also involves rapid prototyping and getting things up and running with as little code as possible.
Recently, I wanted to test out Azkaban to schedule job flows quickly and easily. However, I did not want to spend a lot of time bringing the system up and down while testing, and I wanted to keep track of all the changes so that I could build and destroy the system at will. This sounded like a great time to pull out my Docker hat and use docker-compose. It also allowed me to brush up on bash curl.
The first step was to get a Dockerfile that sets up a single node application. This would allow me to get a first go at using the system:
from java:7
COPY azkaban-solo-server-2.5.0.tar.gz /azkaban-solo-server-2.5.0.tar.gz
RUN tar -xf /azkaban-solo-server-2.5.0.tar.gz
RUN apt-get update
RUN apt-get install zip
ADD flows /azkaban-solo-2.5.0/flows
ADD run.sh /
ADD jq /usr/bin/jq
CMD /azkaban-solo-2.5.0/bin/azkaban-solo-start.sh
Then, docker-compose could be used to bring the whole thing up. As such, I wrote a dc-solo.yml file:
azkaban:
build: solo/.
ports:
- 8081:8081
Now I can :
$> docker-compose -f dc-solo.yml build $> docker-compose -f dc-solo.yml up
Azkaban is running! I played with the UI and realized if this was going to work, we would need to have flows saved in Github and auto-loaded. This led me to the second part of the process which was to play around with how to get flows from Github to Azkaban. To do this, I started hacking their API’s using curl (I would never advise this for a finished project, but for iterating quickly, this works). I also got to know a nice json tool: “jq”.
First, let’s write a function that saves a session token to a variable called FCRED:
#
getSession simply creates a session with default credentials.
#
getSession () { CRED=
curl -k -X POST --data "action=login&username=azkaban&password=azkaban" $PROD
while [ $? -ne 0 ] do sleep 1 CRED=curl -k -X POST --data "action=login&username=azkaban&password=azkaban" $PROD
done FCRED=echo $CRED | jq '."session.id"' | sed s/\"//g
}
Then, let’s create a project in Azkaban:
#
createProject creates a project in Azkaban
#
$1 The name of the project.
$2 The description of the project.
#
createProject () { RESP=
curl -k -X POST --data "session.id=$FCRED&name=${1}&description=$2" $PROD/manager?action=create
}
You can upload a zip file with your flow to your project:
uploadZip () { RESP=
curl -k -H "Content-Type: multipart/mixed" -X POST --form "session.id=$1" --form "ajax=upload" --form "file=@$2;type=application/zip" --form "project=$3" $PROD/manager
PROJECTID=echo $RESP | jq '.projectId' | sed s/\"//g
}
Even schedule it:
schedule () { RESP=
curl -k "localhost:8081/schedule?ajax=scheduleFlow&session.id=$1&projectName=$2&flow=$3&projectId=$4&scheduleTime=$5&scheduleDate=$6&is_recurring=on&period=$7"
echo "scheduling: $?" echo $RESP echo $RESP | jq '.' }
I ended up just iterating through the directory tree, zipping the directories that had Azkaban job files and uploading them.
#
uploadFlow will zip up the contents of each project directory and upload the flow.
#
$1 The name of the project which corresponds to a directory.
#
uploadFlow () { proj=$1 rm $proj.zip zip $proj.zip $proj/* uploadZip $FCRED $proj.zip $proj }
#
Main Script
#
getSession for dir in
ls -d */
; do proj=${dir%%/} desc="cat ${dir}description.txt
" createProject $proj "$desc" uploadFlow $proj done;
This is not the end. These Azkaban flows are proving themselves useful in an MVP fashion and I’ve started expanding the docker-compose recipe so that we are backed by amazon’s RDS. Do you see how I’m saving myself work by not implementing the DB? I love the cloud!! Here’s my multi-node docker-compose: (my working dc-full.yml for staging on a local environment. For using RDS, I replace the docker image with a real network image.)
mysql: image: mysql environment: - MYSQLROOTPASSWORD=root - MYSQLDATABASE=azkaban - MYSQLUSER=azkaban - MYSQL_PASSWORD=azkaban volumes: - /mnt/mysql/azkaban:/var/lib/mysql executor: build: exec/. links: - mysql ports: - 12321 web: build: web/. links: - mysql - executor ports: - 8081:8081
That’s it for now. We’ll see if this proves out at ShareThis and then continue to iterate on it. One day, it might run all of our automated pipelines. If you like iterating quickly and hate processes that clog up dev time, then please join us!