Kolton is the co-founder and CEO of Gremlin, the chaos engineering company that helps the world build a more reliable Internet.
Other contributions from this contributor
- A former chaos engineer offers 5 tips for dealing with online disasters remotely
The failures at RBSTSB and Visa have left millions of people unable to deposit their paychecks, pay their bills, get new loans, and more. As a result, the House of Commons Treasury Select Committee (TSC) launched an investigation into the UK financial industry and found that "the current level of IT outages in financial services is unacceptable". The Bank of England (BoE), the Prudential Regulation Authority (PRA) and the Financial Conduct Authority (FCA) then decided to take action and set a standard for operational resilience.
While policies often feel stressful and detached from reality, these policies are sensible steps any company in any industry can take to improve the resilience of their software systems.
The BoE standard is divided into these five steps:
- Identify key business services based on those that end users rely on most.
- Set a tolerance level for downtime during an incident that is acceptable for this service based on the utility provided by the service.
- Test whether the company is able to stay within this acceptable period in real scenarios.
- Involve management in reporting and approving these thresholds and tests.
- Take measures to improve the reliability against the various scenarios, if possible.
Following this process corresponds to best practices in the architecture of fail-safe systems. Let's break down each of these steps and discuss how chaos engineering can help.
Identify critical business services
The Operational Resilience Framework recommends focusing on the services that serve external customers. While internal applications are important for productivity, this customer focus is good advice to determine a starting point for reliability efforts. While it is ultimately up to the company to weigh the criticality of the various services offered, those required to make payments, access payments, invest or insure themselves against risks are recommended priorities.