Incident Response
Share This Article

With various types of cyberattacks and physical service disruptions becoming more common, it’s no longer a matter of if but when a major incident will affect your organization. It’s a nightmare to think about, but incident response is something to plan for in order to keep your business up and running, and your customers satisfied.

The quality of a cloud or colocation provider’s processes for major incident response can make or break your compliance and also your industry reputation, so it’s important to be aware of their road to resolution. Today’s blog outlines everything you need to know about major incident response and how to identify a reliable partner for your business continuity needs.

Unfortunately, we often see businesses wait to begin looking for incident response resources in the midst of an outage or incident. If that is the case , please contact one of our experts for tailored advice on how to proceed.

What Constitutes a Major Incident?

While the definition of a major incident varies from provider to provider, at LightEdge, we define a major incident as any service outage that affects two or more customers. It’s also important to ask for this definition when shopping for a cloud or colocation provider because a provider may boast very few major incidents but their threshold for what constitutes a major incident may be wildly higher (think 15+ customers affected) than competitors. If you don’t ask this question early on, you might be frustrated with your uptime later down the line.

Consistency Is Key for Reliable Incident Response

It’s no secret that we live in a 24-hour world and the expectation is that you and your customers should be able to access your network at any time of day or night. The same should be said about your incident response team. Whether your major incident occurs at two in the afternoon or two in the morning, the incidence response team on each shift should be following the same process.

This is also critical because it allows for easy handoff and knowledge transfer, should resolution be in-process as a shift changes. The team coming into the facility should be able to seamlessly pick up where the previous team left off, not losing any precious time to get up to speed.

What Processes Should I Watch For?

Now that we’ve identified that the process should be consistent, what should actually be included in a provider’s incident response plan? Day or night, here are the steps that need to happen to make sure your IT environment is put to rights and you’re back to business as usual.

1. Identify the Impact

When the alarm bells start to ring, the first step any provider needs to assess is to be able to quickly identify the extent and impact of the outrage (impacted environments, customers, etc). It’s critical a provider has a system such as inventory management or BMS that can help them see the impact of an outage at scale. This is also the step where your provider will determine whether this is a major incident or a singular customer experiencing an outage, which will trigger all of the following steps as they work toward resolution.

2. Open communication with affected customers

Once your provider identifies which customers were affected there should be frequent and clear communication regarding the incident. The provider should include information regarding the nature of the incident as well as the steps they’re taking to resolve the outage.

3. All-Hands on Deck

This is the part that customers seldom see but is critical to an expeditious incident resolution. Having a meeting where all the facts are laid out and team members have actionable steps they can take to resolve their incidents. Additionally, this is where the internal culture of your provider comes into play. These meetings should not be a place where team members feel like they are being blamed for an incident. These meetings are for laying all the facts out to develop the fastest path to resolution.

4. Resolution

After the all-hands meeting, it’s important that resolution becomes everyone’s number one priority. When resolution seems near, it’s not time for the response team to get lax with their efforts. Resolution typically happens in two phases: suspected and confirmed. Once resolution is suspected, it needs to be confirmed through testing and additional verification processes before the case can be officially closed.

5. Root Cause Analysis

This is the final, but arguably the most important step in the incident response and resolution process. A root cause analysis establishes a timeline of events from causation to resolution, assesses the impact to customers, and suggests process improvements moving forward. This is the time for preventative action items to be assigned to owners in order to reduce the likelihood of an identical incident occurring in the future.

Moving Forward from a Major Incident

Another key indicator of a high-performing cloud or colocation provider lies in how they adjust their incident response plan after they recover from their most recent incident. A stagnant response plan is a huge red flag when you’re shopping around for a provider or assessing a move away from your current hosting partner. If they’re not talking about how they’ve adjusted their processes due to the nature of recent incidents, they’re not a partner you want to trust with your mission-critical IT.

At its core, these adjustments aren’t just about the response to incidents as they happen—it’s also about how to prevent similar incidents from occurring in the future. They may choose to implement routine checks, preventative safeguards, additional redundancy or other tailored solutions for the specific issues that caused the outage.

LightEdge’s Expert Incident Response Teams are Ready to Help You Emerge Stronger

We know every element of your business’s IT hinges on the dependability of your technology to deliver what you need, when you need it. LightEdge gives you a team of local, experts who delivers fully integrated data protection, disaster recovery services and workplace recovery facilities to ensure your business is always fully covered and operational.

Whether you’re considering options for business continuity or are already in crisis, we’d love to set up a time to talk with you about the challenges you’re facing and what your goals are for uptime and incident management in the future. Our team of experts is waiting to answer your questions and get you on the road to complete business continuity.

 


Share This Article
Brian Gibson
With fifteen years of experience working in data centers, Brian Gibson takes a customer-centric approach to data security, infrastructure, and operations management. In his role as Director of Customer Care, his focus is making sure LightEdge clients can enjoy the reliability, security and scalability they’ve come to expect when leveraging LightEdge services.
For fun, Brian is an avid outdoor enthusiast and enjoys hunting, fishing, and spending time with his family.