Breaking the Tension Between Operations and Developers
Historically there has been a natural tension between operations(ops) and and developers. The mission of Ops are to keep systems stable and running. For most systems, if nothing changes with the system, it will tend to behave as it did in the past. Developers however, are typically incentivized to grow usage of the system which means, wether good or bad, more features. New features results in rapid changes to the system and some level of instability and risk is introduced to the system. These opposite incentives creates an immense amount of tension between the two teams. This results in operations removing access to the machines and thus freedom to do anything operations related. This puts an end to new ideas like micro-services within the organization because infrastructure is always required to put up new services and new infrastructure requires deep collaboration with another team. This tension isn’t good and creates a division within your engineering teams leading to no progress being made because the two teams refuse to help each other. This not only hurts productivity but ultimately the happiness of your employees and they’ll start to leave. If I’ve learned anything as my tenure as an engineering leader it’s this: if you want to keep your employees happy you have to make sure they’re productive. Happy means getting along with other engineering teams and unifying the culture vs creating a divisive one.
“With Great Power Comes Great Responsibility”
As a team we are all responsible for the software that we put into production, not just ops or developers. This means we have to create an incentivision structure in place that gives both ops and developers a common goal. Devops has many meanings to different people, but to us at ShareThis it means there is a high-overlap between the functional areas of operations, development and QA. So in order for developers to gain the power to make changes on production they have to be responsible for their actions. This is why we believe that it’s critical that developers have an on call rotation should anything go awry. Developers should experience what it’s like to be alerted when their code causes unexpected problems. This shares the burden of solving problems at off hours across the team and not solely on the shoulders of operations or a single team.
SLA and SLA Allowance
One the most important activities is to define an SLA for services. This defines the expectations of stability in a system to the customers, developers and ops. But the goal should never be to have 100% uptime because it could mean that you’re moving too slow and not taking enough risks. A startup is all about taking calculated risks and moving fast. As to be more quantitive about risk we use SLA allowances. We can calculate our risk tolerance by taking the SLA and subtracting 1. So if you have a 99.9% uptime SLA we can calculate an SLA allowance by the following 1 – 0.999 = .001 which is our SLA allowance. This means the site can have 43m of downtime before you’ve used up the agreed upon uptime. By using this, developers can deploy as frequently as desired as long as they don’t hit their SLA allowances. Once they do, no new features can be deployed and instead the team should focus on stability. This puts an emphasis on automated testing & monitoring to verify that their applications won’t cause issues in production. This also puts a healthy balance between speed & risk tolerance which can be agreed upon beforehand. Throughout the office we’re building dashboards that can quickly show any anomalies so we can revert any changes before it causes SLA violations.
Since we’ve been working in “DevOps” culture we’ve seen a dramatic increase collaboration between teams and also an increase in productivity. The total ownership of applications from development to production has spread throughout the entire team and we’re now able to get up servers and infrastructure within seconds vs weeks. For our customers this means new features and service stability. For our internal stakeholders this means predictable timelines. Most importantly, for our team, these changes give them a sense of ownership and productivity.
If this type of work environment is something you’d be interested please visit our opportunities page: