6 min Applications

Datadog aims for ‘best in show’ on cloud efficiency

Datadog aims for ‘best in show’ on cloud efficiency

Cloud computing monitoring and security platform company Datadog has detailed updates designed to make the operational control end of cloud more efficient and cost-effective. Datadog Kubernetes Autoscaling is a management tool that automates cloud resource optimisation to automatically scale Kubernetes environments based on real-time and historical utilisation metrics. In an enterprise technology landscape where ‘traditional’ Application Performance Management (APM) has been superseded by observability as a carte blance term to denote cloud control, is Datadog able to differentiate itself and win best in show?

First and foremost, Datadog insists that its Kubernetes autoscaling advancement makes it the first observability vendor to enable users to make changes to their Kubernetes environment directly from the platform i.e. previous tweaking and control would need to happen closer to the backend control plane and provisioning layer used for the cloud environment itself.

Idle containers at 83%

Because Kubernetes deployments are very often overprovisioned (to avoid infrastructure capacity issues from impacting end users), we know that a large amount of wasted cloud compute exists. The company’s State of Cloud Costs 2024 report claims to have found that 83% of container costs are associated with idle resources. Whether it’s 38% or 83% doesn’t matter too much, we get the point, K8S gets overprovisioned and needs finer-grained control to optimise infrastructure performance, which is clearly what Datadog is attempting to deliver. 

But optimising infrastructure is only one part i.e. IT teams still need to be able to ensure applications remain performant with enough resources to scale. 

Right-sized resources

Yrieix Garnier, VP of product at Datadog says that Datadog Kubernetes Autoscaling continuously monitors and automatically rightsizes Kubernetes resources. Businesses like to talk about return on investment (ROI), but that conversation usually gravitates around capital investments, products or people. Instead, this is pushing the ROI conversation down into the IT stack to examine ROI on container assets.

“Customers are able to identify workloads and clusters with a high number of idle resources, implement a one-time fix through intelligent automation or enable Datadog to automatically scale the workload on an ongoing basis,” explains Garnier. “Containers are a leading area of wasted spend because so many costs are associated with idle resources, but organizations also can’t risk degrading performance or not having enough resources to scale. The key for businesses is to find a balance between control and automation where they can automate actions when they are ready.”

He suggests that Datadog Kubernetes Autoscaling provides this balance by connecting automated Kubernetes rightsizing with real-time cost and performance data.

Democratised optimisation?

Although many of us (technical and non-technical) might quite quickly agree that optimisation technologies need a push towards automation, the notion of optimisation democratisation might sound more difficult to grasp. But, that’s what Datadog is advocating and the organisation says it is making this possible so that teams can get make use of (okay, they said “leverage”, you know it) a unified user interface view that displays Kubernetes resource utilisation and cost metrics, making it easier for any team member to understand and scale resources.

Teams can also unify monitoring and resource management says Garnier. Datadog’s platform gives organizations full visibility into how rightsizing impacts their workload and cluster performance, backed by high-resolution trailing container metrics, so teams can take action based on this context. Datadog Kubernetes Autoscaling is now in beta

Datadog Live Debugger 

In line with its Kubernetes autoscaling advancement, Datadog also announced the launch of Live Debugger, a tool that enables developers to step through code in production environments and find the exact root cause of coding errors. Live Debugger requires no downtime and enables developers to work directly in production environments instead of spending hours of trial and error to reproduce production issues in development environments.

“Debugging can be a slow and inefficient process which requires extensive manual data collection and the ability to reproduce bugs in perfectly reconstructed conditions. These constraints negatively impact developer productivity and, ultimately, the end user experience,” said Hugo Kaczmarek, director of product at Datadog. “We are taking the guesswork out of debugging, minimising the friction experienced by developers and creating a tool that inherently supports rapid issue resolution while maintaining the highest standards of code quality and security.”

Breakpoint headaches

The slow and inefficient processes that Kaczmarek refers to often manifests itself as a result of the fact that traditional debugging techniques require time for developers to manually set up breakpoints throughout their code base. They also have to worry about combing through unfamiliar code and documentation written by others, as they try and reproduce the exact same production issue with the exact same conditions in their local environment.

“Live Debugger greatly simplifies this process by aggregating the necessary information from the live production environment and integrating it directly into the user’s Integrated Development Environment (IDE),” said Kaczmarek. “The product accelerates root-cause analysis with AI-generated exception summary and one-click test creation that accurately reproduces all bug conditions based on production data. Using Live Debugger not only improves the developer experience, it also dramatically reduces the time it takes to resolve issues, freeing up engineers to spend more time delivering business value.”

Live Debugger features an exception replay function so that developers can step through the execution flow of their code and see local variable values that were captured live when the exception was thrown without needing to run code. It also features visualisations and context i.e. Datadog delivers the observability context needed to troubleshoot issues and provides an AI-powered summary of the code’s ‘executional context’, a starting hypothesis for the root cause of the issue and visualisations of data flows between services and where the interaction between them occurred in the code.

Good dog – belly rub?

Do all the tools and services provided via the Datadog platform herald a new epiphany for cloud efficiency, an end to overprovisioning, a new way to circumvent the lack of legacy code documentation artefacts and annotations… and a new clarity of vision to make the ‘always more complex that it should be’ typical Kubernetes deployment as simple as rolling over a nice belly rub (apologies for dog-related analogy, it was impossible to avoid) then? 

Yes, somewhat, but arguably not to the point of perfection, so we may still have more visits to the vets as we move forward. Doggy chew anyone?