Your plan for the next system outage can be built on wishful thinking


Canadian CIOs have spent a decade building programs to stop data breaches. They did well. Now the calling that can define a career is completely different.

Systems are down.

Six hours later, the customer-facing app is still in the dark, and the team trying to recover it can’t tell anyone when it will be back. Curtis Simpson, chief strategy officer at Gambit Security, told technology leaders this year CIO Association of Canada Peer Forum that work has been rewritten and most resilience programs are built for the old one.

“You generally don’t get penalized for losing data,” Simpson said.

“Today, businesses, enterprises, organizations and governments are penalized for outages (and) loss of availability.”

When something knocks an organization’s systems offline, whether it’s a cyber attack, a botched software update, a data center fire, or a failing cloud provider, the tech team’s job is to get everything back up.

This work is called recovery and how quickly it happens determines how much damage the business will absorb.

Splunk’s The Hidden Costs of Downtime 2026 Study found that unplanned outages cost Global 2000 companies an average of $300 million each year, with the group’s overall impact roughly 50% over two years.

Cloud outages are no longer rare cases. Amazon Web Services (AWS), Microsoft Azure and other major providers have all had recent failures that have left customers offline for hours.

of October 2025 AWS Discontinuance approximately 2,000 organizations offline for close to 15 hours, including Lloyds Banking Group, Coinbase, Snapchat and the London Stock Exchange.

Simpson said CIOs treated a major regional AWS outage as rare enough that it didn’t need to factor into their planning.

“Those days are gone,” he said.

Many of the same cloud providers serve Canadian organizations, and how long any of them would last in a similar event is based on how well their recovery plans hold up.

What Simpson described is the consistency part that is done very quickly, before everyone starts refreshing the crisis table.

The four questions every technology leader should be able to answer

Most executives can’t answer all four with measured numbers, and that’s where the recovery plan breaks down long before the next disruption tests it.

The first question is about what the tech team is trying to restore. A customer-facing application, such as online banking, sits on top of many pieces of infrastructure, including servers, databases, and network connections. Most teams report recovery for each part separately.

The client interacts with the application and any broken part in this underlying infrastructure leaves them locked out.

“Nobody cares whether a specific system, asset or host is recoverable,” Simpson said. “That doesn’t matter.”

How quickly the application can be turned around is the next question. Most organizations have a target written down somewhere, whether it’s an hour, a day, a time frame that the business has previously agreed upon.

Simpson estimates that 95% of organizations are not testing end-to-end recovery on a consistent basis. The result are recovery targets that read well on paper but are never tested against how a real outage unfolds. The recovery objective is what the board will want to hear during the outage. What customers will care about next is how long they’ve been closed and if they come back at all.

The third question is how much it would cost the company to go offline. When CIOs report risk to the board, they answer two questions, how likely is a disruption and how much would it cost. The first usually has data after it. The second has often been a guess, based on rough guesses about lost transactions or customers who leave.

“I’ve mostly been measuring and managing likelihood and telling stories about impact,” said Simpson, who previously served as Global CISO at Sysco and Armis.

Without a real cost number, the board can’t decide whether the recovery plan needs a $2 million investment or $50 million.

Finally, how much downtime can the business sustain before the real damage is done? Simpson said executives often wait for the business to give them that number, and waiting is the wrong call.

Technology leaders should go in with a number already worked out by continuity planners, risk managers or finance teams and ask the business to confirm or push back.

“The reality is somebody knows, somebody has knowledge of it,” Simpson said.

A recovery plan can pass audit and fail outage

Recovery testing is supposed to work like a school fire drill.

You don’t wait until the smoke clears the hallway to find out if the exits are blocked, the alarm goes off, or half the class thinks the meeting point is by the soccer field while the rest meander aimlessly.

The same thing happens in technology.

Many test the parts separately, the backup works, the secondary system works, and the connection to the backup data center works. Everyone gets a passing grade and the board looks perfect.

Then a real disruption happens and all those parts have to work together at the same time, in real time.

That’s the part Simpson said many organizations still don’t know. A recovery does not end when a server comes back or a backup is restored. It is finished when the customer can use the application again.

Testing fails in part because legacy systems didn’t disappear when the cloud arrived.

Many companies in banking, manufacturing and inventory operations still use AS/400s, a class of IBM business computers dating back to the late 1980s, along with modern cloud applications sitting on top.

“We didn’t move to the cloud. We added the cloud,” Simpson said. “Many are still using AS/400s, mainframes, etc., interacting with middleware platforms that are interacting with the cloud.”

Joseph Ruck, head of field architecture at Gambit Security, calls the disconnect between what companies think their recovery looks like on paper and what it would look like in a real outage, “the petri dish paradox.”

Companies must prove to auditors that they can recover. These validations are usually a meeting between people who want to say that the plan works.

Customers of two US credit unions found out the hard way what an untested recovery plan looks like.

The June 2024 ransomware attack on California-based Patelco Credit Union, with about 530,000 members and roughly $9 billion in assets, took most banking services offline for more than two weeks. Members could not use their money.

Patelco’s own reports more than $39 million in quarterly losses related to the incident, most of it covered overdrafts during the outage. or $7.2 million class action settlement it is now awaiting court approval.

Ruck said the same pattern appears in every outage he’s worked. Organizations whose self-assessment of their readiness ends up being “absolutely certain, but absolutely wrong.”

VyStar Credit Union’s 2022 shutdown is the second. The Florida credit union was upgrading its core banking software, a three-day project that turned into weeks of customers being locked out of basic services, with some features unavailable for more than six months.

US Consumer Financial Protection Bureau fined VyStar $1.5 million at the end of 2024 on what he called a failed performance. Ruck described it as a self-inflicted hiatus.

“They were their own disruptors,” Ruck said.

Both cases point to the same problem. A recovery plan can exist on paper and still fail when customers need the system.

This is a call that no CIO wants to spend six hours on an outage that won’t end, with customers locked out and no honest recovery time that the bridge can provide.

“We, the board, the executives, we don’t care what caused the outage,” Simpson said. “It could be a cyber attack, it could be an AI-based outage, it could be an infrastructure failure. Nobody cares.”

For many Canadian organizations, the next disruption is a matter of when, not if. The more difficult question is whether the recovery plan on file has been honestly tested or only shown to an audience.

The last shots

  • The recovery time the team commits to is only useful if it has been tested in real-world conditions.
  • Recovery needs the CIO and CISO to work from the same plan before an outage turns into another after-action review.
  • Component tests can make auditing feel neat. Full recovery testing shows whether customers can come back again.

Digital Journal is the national media partner for the CIO Association of Canada.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *