Process hangover: L1 responders aren’t able to resolve issues
NOCs used to be the command center for technology issues. They functioned like a brain, sending out signals to relevant appendages. Issue with networking? Route to networking. Issue with security? Route to security. The NOC’s central function was to involve the correct SME to resolve an issue. This meant digging through spreadsheets (and sometimes physical contact books!) to figure out who was responsible for what.
When everything was on premise and in person, this made sense. There were fewer services, and incidents could be neatly separated by departments. If the database was having an issue, you could call up the database on-call responder. The responder (who would likely be in office or close enough to respond in person) could then go to the datacenter and take a look.
Now, in the remote work, cloud era, where organizations have hundreds or thousands of services maintained by dozens or even hundreds of teams spread across the globe, the rolodex method has outlived its purpose. It’s next to impossible to maintain accurate spreadsheets to know which teams are responsible for which services. And, as the organization changes, records grow stale quickly. Services can move between teams. Teams change as people move between them, or leave/join the company. Now, an L1 responder has to work too hard to identify the right person in an efficient and timely manner.
Organizations need a way to remove these manual steps to find the right person and route incidents directly to SMEs who can jump in to respond to any issues. This can happen in a variety of ways. For some organizations, a DevOps service ownership model is the right path forward. Those who write the code are assigned to respond and fix the service during an incident. The alert is routed directly to the on-call person on the development team that supports the service, and the SME takes it from there.
For other organizations, it might make sense to have a hybrid approach where L1 responders serve as the first line of defense before escalating to distributed, on-cal teams for their services. L1 responders shouldn’t be a routing center that connects the issue with another team. Instead, they should be empowered to resolve an incident themselves. You can set up your L1 responders to be more effective by enabling them with the ability to both troubleshoot and selectively resolve incidents. Access to automation and resources like runbooks can empower L1 responders to help accelerate the diagnosis and remediation process, oftentimes without needing to disrupt the subject matter experts that are in charge of X service via an escalation. By putting automation in the hands of L1 responders, organizations can avoid unnecessary escalations and empower L1s to resolve issues faster.
Process hangover: Major incidents aren’t called or are called too late
We’ve heard it before: time is money. And when NOCs were the primary method of ensuring incidents were responded to, they had an additional responsibility. An NOC needed to ensure that resources were well managed. This meant no unnecessary personnel responding to problems. NOCs often took the blame if they called a major incident too soon and interrupted people for a minute problem. These disruptions took SMEs away from their work innovating. So it was crucial for NOC responders to only call major incidents when it was clear there was a much bigger issue at play.
But now, time isn’t money, uptime is money. The cost of a major incident that’s flown under the radar is larger than the cost of tagging in some extra help. Imagine you’re an online retailer and your shopping cart function is down. Every minute your customers can’t add items to their cart, you’re losing hundreds of thousands of dollars. Plus, customer expectations have increased over the last few years. Customers expect that their app, tool, platform, streaming service, etc. works without interruption. And it erodes customer trust when it doesn’t. In fact, according to PWC, 1 in 3 customers would stop doing business with a brand they loved after one bad experience.
Organizations need to call major incidents sooner to mitigate customer impact. Yes, this may mean waking someone unnecessarily once in a while. But, that’s far less likely with service ownership. SMEs responsible for a service have a better understanding of when to call a major incident than an L1 responder would. So there are fewer false alarms.
Process hangover: Come-and-go war rooms
NOCs often serve as the communication hub for a major incident. This helps responders working to resolve an issue keep on task. Back when many companies had everything (and everyone) on-premise, there was a war room. People came there and the NOC coordinator kept everyone up to date. Now, with distributed teams and systems, physical war rooms are a thing of the past. Many companies instead have virtual war rooms with a video conferencing bridge or chat channel that remains open during an incident.
Other stakeholders may want to treat this war room like a physical one, dropping in as they please. But, in this virtual world, this means that these stakeholders are asking the incident responders questions. This delays the resolution. Companies with come-and-go virtual war rooms may experience more miscommunications and frustration. Responders feel frustrated by interruptions and stakeholders feel frustrated with the lack of communication.
One way to mitigate this is to close the war room to non-participants. If someone isn’t a part of the incident response team, they don’t need access to the response team’s virtual war room. Instead, what they need is an internal liaison. This is a designated communicator from the incident response team.
The internal communication liaison consolidates incident information and relays it to relevant stakeholders. To make this easier, communication liaisons can use status update notification templates. These templates dictate how to craft communications for a specific audience. They ensure that stakeholders receive any information necessary to make decisions. And no responders have to stop working on the incident at hand to share updates.
Hangovers aren’t fun, but they always end
NOCs are a tried and true way of managing incidents for many organizations. But NOC methods become out of date when moving into this era of digital transformation. Seamless communication and rapid response are key to preserving customer trust. Looking forward, teams will involve SMEs immediately and call major incidents sooner rather than later. They’ll also communicate with key stakeholders throughout an incident while setting boundaries.
And often teams need a digital operations platform to help support this transition. PagerDuty allows teams to bring major incident best practices to their organization, resolving critical incidents faster and preventing future occurrences. Try us for free for 14 days.