Introducing “Automated Diagnostics”
The challenge for many companies is that the deeper data needed to make accurate diagnoses is locked away in production environments, and requires specialists to extract because of their knowledge, skills, and access privileges.
To answer the questions of “What went wrong?” and “Who can fix it?” a first responder has to summon on average at least 3 other engineers to pull information to which only they have proper access.
The bottom line is too much time and too many people hours are spent repeatedly gathering diagnostic data. Automating this repetition will speed up MTTR by at least 15 minutes and reduce costs and interruptions by at least 50%.
According to Gartner, “by 2023, 40% of all enterprise workloads will be deployed in cloud infrastructure and platform services, which is an increase from 20% in 2020.” This quote further drives home the reality that cloud adoption continues to be a top priority for enterprises looking to further digitize their services and backend infrastructure.
Cloud services give you unprecedented scale, agility, and speed of innovation, but teams face increasing complexity and ever-growing dependencies across systems, processes, and their organizations. This complex situation threatens to put the customer and employee experience — not to mention revenue — at risk.
As organizations migrate to the cloud and deploy cloud-native architectures, the increased complexity can cause more (expensive) incidents. Many organizations run in complex cloud architectures containing several interconnected services — many existing ephemerally — that are deployed across different availability zones and accounts. When incidents happen, it can take a long time to resolve them without understanding the root cause or who has the proper access privileges and subject matter expertise. This means lots of escalations and developers being pulled away from high-value work.
Incidents can get expensive — really expensive. A major retailer can lose upwards of $200K per minute in revenue every minute the site is down. Incidents also incur productivity costs, as engineers are working on fixing the problem instead of building new features and focusing on innovation. A poor customer experience because of or during an incident can further cost an organization dollars in the form of brand reputation. And when you add up all of those factors, the cost of an incident is much higher than you may have accounted for.
Resiliency is essential to ensuring that your customers enjoy their digital experiences with little to no interruption. The uncomfortable reality? Things will inevitably break and services will go down. It happens to all of us. What really matters is how fast you can recover and get your services back in the green, in addition to ensuring similar incidents like that don’t occur again in the future. Ensuring you have full visibility across your hybrid infrastructure and making sure you can detect and diagnose issues quickly is essential to continuity of your business and all your services.
Resilience doesn’t just happen, it’s a shared responsibility. Customers have to set up their infrastructure, operations, and people in a way that helps them endure and quickly respond to incidents. Defining clear ownership and accountability by having teams build and own their services is an essential part ensuring that you can have focused, real-time incident response.
PagerDuty empowers teams with end-to-end incident response and advanced automation capabilities that quickly and accurately orchestrate the right response, every time. Process Automation helps teams to quickly diagnose and resolve incidents by significantly reducing the number of escalations and MTTR so engineering teams can focus on continuous improvement and innovation.
Too many humans, too little time
Modern cloud architectures for AWS customers are composites of some 250 AWS services and 25,000 SaaS workflows available in the market, combined with in-house developed software and other legacy systems.
When incidents occur in these complex cloud environments, access to full cloud stack expertise is often needed to determine probable root cause, rule out other possibilities among dependencies, and check for false positives. This may require a first responder to escalate to several expert engineers to gather these diagnostics to determine who the ultimate resolver should be.
First line responders often lack know-how and access to gather diagnostic content in AWS environments. Many first line responders are generalists, and lack technical knowledge of what investigations are needed to diagnose specific issues in services. First responders also lack superuser access to be able to execute technical investigations due to security policies.
This means first responders typically must escalate to multiple experts to get the data they need to triage an incident, consuming more staff time to resolve the incident and interrupting more team members. For serious outages this needlessly extends the length of time it takes to resolve an incident, takes engineers away from high-value work, and increases the overall cost of an outage. Automation can play a key role in not only resolving incidents faster, but in arming first responders with the diagnostic data they need to resolve incidents on their own, thus safeguarding valuable engineering time.
Automated Diagnostics for AWS
With Automated Diagnostics for AWS, incident responders can quickly triage incidents themselves, reducing the need to escalate for help, speeding up resolution for customers, and increasing operational efficiency. Automated Diagnostics for AWS in PagerDuty provides frequently used, pre-built diagnostics job templates for commonly used services, including Amazon EC2, AWS Lambda, Amazon ECS, Amazon RDS, and more. Customers can easily configure these template jobs to work in their specific environments and extend the diagnostics steps in a workflow. Automated Diagnostics for AWS also allows customers to quickly design their own diagnostic jobs for AWS, and corrective automation for mitigation and remediation that can be invoked by responders within PagerDuty Incident Response, or triggered by PagerDuty Event Intelligence.
Customer Service teams and stakeholders are coordinated with real-time status information to deliver a better customer support experience. Automation helps internal teams operate more efficiently by shaving 25 minutes off MTTR, reducing the number of people required to resolve an incident and decreasing the number of escalations by 40%, saving time and money while improving the customer experience.
Automated Diagnostics for AWS:
- Empowers first responders with the power to triage, mitigate, and resolve incidents, improving MTTR across the board.
- Reduces escalations to engineers by using pre-built job templates and plugin integrations to critical AWS tools and services
- Enables teams to continuously improve the efficiency of incident response within their AWS environments giving time back to engineers
If you don’t already have a PagerDuty account, sign up for a free Guided Trial today