The Reality of Cyber Resilience and Restoring Critical Services

Most organisations accept that cyber security cannot be built around the assumption that every attack will be stopped at the perimeter. Controls remain essential, but the operating environment has changed. Modern day systems are now more connected than ever, supply chains are more digital, and a disruption in one part of an estate can rapidly affect services, people and decisions elsewhere.

That has made cyber resilience a more useful concept than cyber protection alone. As a result, the overarching question is no longer one of whether an organisation can prevent an incident but whether it can continue delivering vital services when parts of its technology estate are unavailable, untrusted or operating under severe constraint.

The difficulty is that resilience is often discussed in broad and somewhat agreeable terms, but implemented through disconnected plans. Security teams own incident response. IT owns disaster recovery. Business continuity teams own fallback arrangements. Senior leaders retain decision-making authority. Suppliers operate critical parts of the environment. Each area may be competent in isolation, but cyber disruption does not respect those organisational boundaries.

A prepared, resilient organisation brings all those pieces together long before an incident happens.

Service not technology

A common mistake is to begin resilience planning with the technology estate: the systems in scope, the backup schedule, the recovery runbooks and the security tooling. While all of these aspects matter, in most circumstances they are not the starting point.

The starting point is the organisation’s critical services and functions – what must continue, even in a compromised and degraded form? What would create unacceptable operational, financial, safety, regulatory or public consequences if it stopped? Which activities can pause temporarily, and which cannot?

For a public-sector body, that may mean maintaining access to a public-facing service, protecting sensitive case information or ensuring a critical decision-making process can continue. For an organisation supporting defence or critical national infrastructure, it may mean sustaining a mission-essential operational function, preserving secure communications or continuing the flow of information between dependent teams and suppliers.

This, then, needs to go beyond a generic business impact assessment. The objective is to define a realistic minimum viable operation: the smallest credible version of the service that can continue during a serious cyber incident.

That means being specific. Which people need to work? What information do they need? Which systems must be available, and which can be replaced with a controlled manual process? What communications channels remain trusted? How long can the organisation operate in this reduced state before the impact becomes unacceptable?

Until all those critical questions have clear answers, recovery priorities will remain vague and in an incident, that translates to delay.

Dependencies are where plans unravel

An organisation may believe it understands its most important systems, but resilience depends on a wider set of relationships than an application list can show.

A critical service may rely on identity services, network connectivity, cloud platforms, third-party support, specialist hardware, secure communications, endpoint devices, payment systems, data feeds and a small group of people with particular access or knowledge. The service may also rely on assumptions that are rarely tested, such as the availability of Microsoft 365, remote access, privileged administration accounts or the ability to contact a supplier through normal channels.

These dependencies are important because they define what can be restored and how quickly. A backup may exist, for example, but restoring data is of little use if the environment it will return to cannot be trusted. A fallback process may be documented, but it may depend on a team being able to access a shared drive, use corporate email or reach a supplier whose own services are affected.

This is why dependency mapping needs to be treated as an operational exercise rather than a technical inventory task. It should follow the delivery of a critical service from end to end and identify the points at which it could fail under cyber pressure.

That includes dependencies outside the organisation’s direct control. Contractual accountability does not make a supplier dependency disappear. A third party may be responsible for its own security, but the operational consequences of its failure will still be felt by the organisation relying on it.

Instead of checking if a supplier has security controls, organisations need to know exactly how they will keep running when that supplier goes dark, gets compromised, or fails them when it matters most.

Assume the environment is untrustworthy

Traditional disaster recovery planning often assumes a clean technical failure: a service becomes unavailable, a component is replaced, data is restored and operations return to normal. Cyber incidents however, are different. The environment may still be compromised, identities may no longer be trustworthy, and the attacker’s level of access may not yet be fully understood.

That in turn, significantly changes the nature and process and potentially, duration of recovery.

The aim is not simply to restore systems quickly but to restore critical services safely, into an environment that can be trusted. This requires clear decisions about what can be recovered, what needs to be rebuilt, what evidence must be retained and what conditions have to be met before users return to normal ways of working.

That means considering much more than backup coverage. Organisations need to understand whether privileged access can be recovered securely, whether recovery artefacts are protected and accessible, whether clean infrastructure can be established, and whether logging and monitoring will be in place from the point that services are brought back online.

It also means recognising that a recovery environment is a capability in its own right. It cannot be designed for the first time during a major incident. The relevant technology, documentation, contacts, credentials, communications routes and decision authorities need to be prepared in advance.

A recovery plan stored in a compromised collaboration platform is not a recovery capability. Neither is a set of technical runbooks that depend on the same identity service, network connectivity and supplier support that have been disrupted.

Incident response, business continuity and disaster recovery

Organisations quite often have all the right plans in place, but are not joined up.

Incident response focuses on understanding, containing and eradicating the threat. Disaster recovery focuses on restoring systems and data. Business continuity focuses on sustaining operations while normal ways of working are disrupted. Cyber resilience depends on all three, alongside clear leadership and risk decisions.

This distinction is an important one to make as a strong incident response process does not automatically mean the organisation can continue operating. Equally, a technically successful recovery does not mean services can safely resume. There may be a myriad of unresolved questions around data integrity, around access rights, third-party dependencies, customer communications and the suitability of temporary workarounds.

The most effective resilience programmes establish a shared operating model across these disciplines. They define how recovery decisions will be made, what information leaders need to make them, and how technical and operational priorities will be balanced when both cannot be addressed at once.

That is especially important when there are difficult trade-offs – which is more often than not, the reality. Should a service return quickly with reduced functionality, or remain unavailable until it can be restored in full? Can a manual workaround be used without creating unacceptable information security, safety or compliance risks? Which business function takes priority when two services depend on the same constrained technology or specialist resource?

Those decisions should not be improvised in the middle of a crisis. It is of course unlikely that an organisation can predict every potential scenario, but it can establish its principles, thresholds and decision rights in advance.

Resilience decisions before disruption

The technical aspects of cyber resilience are significant, but many failures during a major incident are caused by uncertainty and thus, delays, rather than technology.

Leaders may be unclear about who can authorise a controlled reduction in service. Teams may not know whether they are permitted to use an alternative communications channel. Security teams may be focused on containment while operational leaders are under pressure to restore services before the environment is ready. A supplier may be waiting for an instruction that no one has been formally authorised to give.

These are not edge cases, rather, predictable consequences of operating without clear resilience governance.

A mature approach identifies the decisions that will be required during disruption and deals with them ahead of time. This includes recovery sequencing, authority to invoke continuity measures, criteria for returning systems to service, communications responsibilities, supplier escalation routes and the circumstances in which risk can be accepted temporarily.

It also requires senior leadership to engage with cyber resilience as an operational issue. The CISO cannot own business continuity alone, just as the business cannot assume that technology teams will resolve every consequence of a major cyber incident. Resilience is shared because the impact is shared.

Exercises should test the organisation

Many organisations have plans that look credible when reviewed in isolation, however the real test is whether they work when people are under pressure and information is incomplete.

Meaningful excercises are key, but these exercises must go beyond a simple discussion about whether people know whom to call.

A useful cyber resilience exercise tests the actual operating conditions the organisation may face, such as what happens when corporate email is unavailable? How do the senior leaders make decisions when the usual reporting systems cannot be trusted? Can priority teams access the information and communications they need? Can a supplier be engaged through an alternative route? How long can a critical service continue using its fallback process before quality, safety or public confidence is affected?

The strongest exercises involve the people who would genuinely be responsible during an incident: operational leaders, technical teams, communications teams, procurement, legal advisers and relevant suppliers. They should expose friction, uncertainty and dependencies rather than allowing participants to work around them through assumptions.

An exercise that identifies a gap is not a failure, it’s evidence that the organisation has found an issue in a controlled setting rather than during a live, uncontrolled incident.

Measuring resilience through capability

Cyber resilience is difficult to improve when it is measured through broad assurances alone. A policy may exist. A plan may have been approved. A recovery time objective may be recorded. Yet none of that proves the organisation can sustain and restore its critical services under realistic conditions. The more useful measures are practical.

Can the organisation identify the services that matter most and the dependencies they rely on and has it defined how those services will operate in a degraded state? Has it tested whether the relevant people can access recovery information without the normal environment? Has it demonstrated that it can restore priority systems into a trusted state? Has it exercised decision-making, communications and supplier coordination under pressure?

These measures are far more demanding than a compliance checklist, but they create a clearer picture of where the organisation is genuinely prepared and where it remains exposed.

They also help direct investment – time and resource. Resilience does not require every system to be restored immediately or every process to be duplicated. It requires deliberate choices about what matters the most, what level of disruption can be tolerated and where additional capability will make the greatest difference.

Cyber resilience through practical choices

Cyber resilience can sound like a broad ambition, particularly when it is framed as an enterprise-wide responsibility. The reality is that its built through a series of practical upfront choices and testing whether the organisation can function under the pressure of cyber attack.

As covered here, those choices need to be made far in advance of any disruption, while there is time to challenge assumptions and resolve competing priorities.

The organisations that handle cyber incidents best are the ones that have already worked through a potential scenario and subsequently identified what matters most, what they can operate with and without and how they will restore trust when normal ways of working are no longer available.

That is the difference between having a recovery plan and having real, effective cyber resilience.


Related Links: