AI for DevOps - A Practical View

amit-eyal-govrin
amit-eyal-govrin

The concept of generative AI describes machine learning algorithms that can create new content from minimal human input. The field has rapidly advanced in the past few years, with projects such as the text authorship tool ChatGPT and realistic image creator DALL-E2 attracting mainstream attention.

Generative AI isn't just for content creators, though. It's also poised to transform technical work in the software engineering and DevOps fields. GitHub Copilot, the controversial "AI pair programmer," is already prompting reconsideration of how code is written, but collaborative AI's potential remains relatively unexplored in the DevOps arena.

In this article, we'll look towards a future where generative AI empowers DevOps teams to eliminate tedious repetition, strengthen their automation, and condense complex workflows into simple conversational actions. But before all that, let's dive into the DevOps issues that generative AI can make better.

What's Wrong with DevOps?

DevOps is far from being a solved problem. While adoption of DevOps mentalities is growing rapidly year-over-year, the process itself remains dependent on a large number of tools, a limited talent pool, and repetitive tasks that are only partially automated.

DevOps engineers can spend too much time on menial work that doesn't contribute significant business value, such as approving deployments, checking the status of environments, and scaffolding basic config files. Although unavoidable, these jobs are chores that don’t directly contribute to the final product. They're also great candidates for generative AI to handle, with ChatGPT and Copilot (or OpenAI Codex upon which Copilot is built) all potentially able to alleviate some of the stress:

  • They can populate common config files and templates so engineers don't have to.
  • They help team members gain new skills by suggesting contextually relevant snippets. This provides assistance when it's needed, lessening the learning curve during upskilling.
  • They reduce the time taken to scaffold new assets and improve their consistency, helping to improve maintainability.

However, existing systems are limited by their narrow focus on content generation. DevOps assistants are more powerful if they also offer intent- and action-based experiences to trigger workflow steps and apply state changes. Imagine the experience if you merged Copilot's code authorship with a bi-directional conversational interface:

  • You could ask the assistant to start processes on-demand, then be prompted to supply inputs at the time they're required.
  • Developers would have self-service access to potentially sensitive tasks, such as requesting a deployment to production. AI would safely perform the action on their behalf, minimizing the risk of errors and establishing a safety barrier between the developer and the infrastructure. The AI assistant could also request a review from relevant team members before it commits to the procedure, to ensure everyone’s kept informed of platform changes.
  • AI could alert you in real-time as monitoring metrics change. You'd receive a message with a choice of immediate actions when deployments fail, a security breach is detected, or performance deviates from the baseline.

Importantly, these capabilities aren't replacing humans, nor are they fundamentally changing their role. This form of AI augments engineering abilities by handling the mundane and consistently enforcing safety mechanisms. It frees up DevOps teams to complete more meaningful work in less time.

The Future of DevOps with Generative AI

There's huge potential for generative AI to redefine how DevOps works. Here are three specific areas where it will dominate.

1. Automatic Failure Detection, with Suggested Remedies

Failures are a constant problem for developers and operators alike. They're unpredictable interruptions that force an immediate context switch to prioritize a fix. This hinders productivity, slows down release schedules, and causes frustration when remedial work doesn't go to plan.

AI agents can detect faults and investigate their cause. They can combine their analysis with generative capabilities and knowledge of past failures to suggest immediate actions, within the context where the alert's displayed.

Consider a simple Kubernetes example: The assistant notices that production is down; realizes the Pod has been evicted due to resource constraints; and provides action buttons to restart the Pod, scale the cluster, or terminate other disused resources. The team can resolve the incident with a single click, instead of spending several minutes manually troubleshooting.

2. On-Demand Code/Config Generation and Deployment

Generative AI's ability to author code provides incredible value. Layering in conversational intents makes it more accessible and convenient. You can ask an AI agent to set up a new project, config file, or Terraform state definition by writing a brief message into a chat interface. The agent can prompt you to supply values for any template placeholders, then notify appropriate stakeholders that the content's ready for review.

After approval's been obtained, AI can inform the original developer, launch the project into a live environment, and provide a link to view the deployment and start iterating upon it. This condenses several distinct sequences into one self-service action for developers. Ops teams don't need to manually provision the project's resources ahead of time, allowing them to stay focused on their own tasks.

3. Prompt-Driven On-Demand Workflow Management

The next generation of AI agents go beyond simple text and photo creation to support fully automated prompt-driven workflows. Bi-directional AI lets you start processes using natural language to interact with your cloud or other resources. AI doesn’t need to be told which platform you’re using, or the specific steps it should run.

At Kubiya.ai for example, we are already taking full advantage of this and now offer our customers the option to create any DevOps workflow via with simple English prompts. You can type in for example, "trigger a Lambda function" and Kubiya's generative AI will build out the workflow literally in seconds, which you can then fine tune in a drag-and-drop builder or in our yaml-based domain specific language.

Using a simple English prompt, Kubiya builds a workflow in 10 seconds flat (feel free to time it).

These virtual agents’ language models are trained against the vocabularies of your cloud services. When you ask for a cluster to be restarted, the agent interprets your words using its domain knowledge. For example, it knows that your “production” cluster runs on AWS and that it must retrieve the cluster’s details, then make the correct API calls to restart it, such as ecs.UpdateService, etc. Your words are directly translated into fully functioning workflows.

Furthermore, the bi-directional aspect means the AI agent becomes even more capable over time. Once you’ve started running your workflows, the agent trains against them too, allowing it to suggest similar processes for future scenarios as well as describe what each workflow actually does. 

This approach lets devs do more, without involving ops teams. The AI agent mediates between humans and infrastructure platforms, allowing anyone to initiate workflows consistently and without compromising security. As part of the workflow, the agent can prompt for input at relevant points, such as requesting you to select a cloud account, datacenter region, machine type, and pricing tier when you ask it to “add a new virtual machine.”

The Takeaway: Generative AI Safely Accelerates Your Work

DevOps use cases for generative AI accelerate primary tasks, while increasing accessibility, security, and reliability. They empower developers to focus on moving forwards with new functionality, instead of repeatedly running familiar processes and having to wait for results.

Agents that are intelligent enough to sustain a conversation act like another member of your team. They support developers who could be unfamiliar with certain tools, while ensuring that the organization's security and compliance policies are fully adhered to. These safeguards protect the codebase and give developers the confidence that they can initiate any workflow. Reducing the number of interactions with the DevOps team enhances efficiency, tightening the feedback loop.

Generative AI isn't a static experience either. It gets better over time as it analyzes interactions to more accurately establish user intent. If recommendations aren't suitable the first time you type a query, you can expect them to be improved as you and others repeat the request and take different courses of action.

AI agents support missing human knowledge too. They let developers start processes even when they're unfamiliar with some of the steps, tools, or terms involved. AI can fill the gaps in questions such as "Which instances have failed?" to work out that you're referring to the Kubernetes Pods in your production cluster. These capabilities let AI effectively supplement human abilities, rendering it a source of supportive hints for the team.

ROI is Critical With Generative AI 

Organizations that make regular use of AI are likely to have the best results because their agents will become more adept at anticipating their requirements. However, it's also important not to overreach as you add AI to your workflows. The most successful adoptions will be focused on solving a genuine business need. Assess your processes to identify where bottlenecks exist, such as between dev and ops teams, then target those repetitive use cases with AI.

The solution you select should help you reach your KPIs, such as closing more issues or resolving incidents faster. Otherwise, the AI agent will be underused and could even hinder your natural operating procedures.

Summary

Generative AI is one of today’s most quickly maturing technologies. ChatGPT has attained a degree of virality as more researchers, consumers, and organizations begin exploring its capabilities. DALL-E2 has delivered similarly spectacular results, while GitHub Copilot was used by over 1.2 million developers during its first 12 months.

All three technologies demonstrate clear revolutionary potential, but it's the mixed and highly complex workflows of DevOps that could benefit the most in the long term. DevOps combines the creation of new assets, such as code and configs, with sequential processes like deployment approvals and review requests. 

Contrary to some outsider projections, generative AI for DevOps will go beyond mere templating of common file snippets to offer full workflow automation. Using simple conversational phrases, you'll be able to instruct your agent to take specific actions on your behalf, from provisioning new cloud resources to checking performance in production. The agent will provide a real-time bi-directional feedback loop that improves collaboration, boosts productivity, and reduces the everyday pressures faced by devs. 

This article was originally published in Dzone.

amit-eyal-govrin
amit-eyal-govrin