1,000 Jenkins Jobs: A DevOps Journey to Nirvana

shaked-askayo
shaked-askayo

As the head of DevOps, my frustration was approaching Will Smith slapping Chris Rock levels. Here’s what happened.

My life heading a DevOps team theoretically should have been a dream. We all were in high-demand and paid incredibly well for simply being better at Googling stuff than most people. And the best part is that no one really knew we existed...at least until things didn’t work. 

And therein lay the rub. Nothing ever works all the time. So our Slack on-call channel was like a booby-trapped warzone, literally exploding with repetitive messages such as:

  • “Metrics aren’t reporting to…”
  • “We need a new server”
  • My deployment is stuck”
  • “Where are my logs
  • “I need admin access to…”
  • “How much did we spend on EC2”
  • “VPN is not working…”
  • “Why did my pipeline fail”
  • “Can I get a firewall rule for..”
  • “What does this error mean”
  • “Help me onboard”
  • “Need to update this manifest file in the repo”
  • “My dog is stuck in a tree” (ok, we didn’t get that one, but I wanted to make sure you are still reading this)

Sound familiar? If you are a DevOps (or SRE, Platform Engineer and the like), then I’m guessing that unfortunately it does. But fear not, this tale eventually does get better.  

Days of toil and zero innovation

As a DevOps practitioner, my primary job is to drive organizational innovation and efficiency. So of course, I had big plans for improving infrastructure, replacing legacy systems with more modern ones, designing sophisticated alerting to know what was happening in our systems at any given time and more. 

Instead, my team and I found ourselves drowning in a sea of “help me”. Everyone - from R&D to Finance and HR - wanted a piece of us to provision cloud resources, trigger and track complex workflows, generate cost reports, onboard new employees and more. And of course, we always needed to investigate and grant the appropriate permission levels for each request. 

This left us with zero bandwidth as our days, nights and weekends were endlessly overloaded with context switching, repetitive requests and “super-urgent” tasks.

But what happened to “you build it, you run it”?

“You build it, you run it”. This DevOps approach attributed to Werner Vogel, Amazon CTO describes how Amazon improved the quality of their services and the speed with which they were released, by erasing the separation between developers and operations. This also should have eased the workload on ops teams, allowing them to to focus on innovation. 

However, the DevOps reality for many companies is far different as evidenced by countless articles and Reddits. I’ve personally spoken to dozens of DevOps in different organizations and the story is the same. While developers are keen on coding and creating the next best thing, they are much less interested in learning and managing all the underlying infrastructure their applications run on. 

Did you try automating it all?!

If that’s what you are wondering, then yes, as any self-respecting DevOps or SREs we created automations for practically everything we could. But we found that even end-users like the R&D team, who often had domain expertise in the tools we were using (and we used many tools such as Airflow, Argo, GitHub Actions, Helm Charts, Jenkins, Pulumi, Terraform, and more), were often unable to easily navigate their way around our automations. 

Despite being fairly common tools, the unique way they were setup in our organization made it nearly impossible for the end-user to figure out whether the automated workflows we had created were actually the right ones for what they needed. 

Additionally, developers were usually not granted the requisite permissions to directly access many cloud resources, so even if they would know exactly what to do, they still could not do so without getting permission from our team. 

Bottom line, we were right back where we started. An endless stream of requests to help trigger the right automated workflows. 

Slackbotting

All this grief led to me creating a slackbot with around one thousand hard-coded workflows that end-users could choose from. I first made sure to give each flow a very clear and descriptive name so end-users would be able to figure out what it was meant for. 

I then connected the slackbot to all our tools and associated workflows and end-users could now use a simple command to list all workflows. 

To illustrate how the slackbot was used, say one particular ECS service had a memory leak, the responsible developer could now trigger a workflow to query AWS and find out exactly what the full name of the service was (the developers usually would not know that sort of thing). They could then use the full name (which was required) in a different workflow that would restart the specific service. (Of course we “rarely” would restart services to patch memory leaks and only did this when we didn’t have time for a proper root cause analysis.) 

All in all, the slackbot solved both the lack of domain expertise and also acted as a proxy with guardrails for the developers, allowing them to do what they needed without over-permissioning. And most importantly it helped reduce my team’s toil by 70% while eliminating the long delays end users typically experienced.  

However, there were many drawbacks to that slackbot.

Chatbots are kinda….robotic (and hard to maintain)

The slackbot I created was not exactly user-friendly. It forced users to learn and choose from a static list of workflows and use pre-determined words or slash commands. It certainly could not handle anything not already included in its rule based, canned interaction. These “out of scope” requests would leave end-users empty handed, until of course they came knocking on our DevOps door. 

But what was far worse was the maintenance. I tried to enforce a standard programming language for each workflow, but with so many to create and many DevOps cooks in the kitchen, this proved to be impossible. If one workflow broke, figuring out all the dependencies and how to fix it took way too much sweat, blood and tears. If I wanted to add a brand-new workflow, it also required a very significant effort.

Conversational AI and DevOps Nirvana

My personal experience (along with first-hand reports from many DevOps I've spoken to) drove me to explore the use of conversational AI for solving DevOps toil and ultimately launch Kubiya.

At Kubiya we have brought to life, an AI-driven virtual assistant that lives in Slack, MS Teams and CLI. Nicknamed Kubi, this virtual assistant provides end-users with secure access to pretty much anything (DevOps related) that they ask for (and in plain English no less). This allows DevOps team to innovate all day and party all night, without having to spend insufferable amounts of time on Slackbot maintenance or worse, manually handling all the repetitive requests they used to get. 

Get free access today and discover how Kubiya can literally change your DevOps life.

This article was originally published in DZone.
shaked-askayo
shaked-askayo