As I pointed out in my previous post, application downtime is a mission critical issue for enterprises and downtime costs companies billions of dollars every year. Many things contribute to application downtime but I’ve attempted here to pick the ten most important reasons, taking into account – unscientifically – both frequency and impact. Here’s my top 10 list:
- Heterogeneous environments
IT landscapes are increasingly heterogeneous and complex. New technologies and applications – including “shadow IT” applications – are constantly being introduced to run alongside legacy applications. Meanwhile, older environments don’t get phased out at the same pace – there are environments out there with critical applications running on hardware that has been out of support for 20 years! So we’re left with multiple hardware platforms, operating systems, databases, SaaS and on-premise applications, IaaS and data center infrastructures – a corporate IT environment that’s extremely complex and difficult to navigate.
- Multiple single points of failure
Every part of the IT environment has to work together smoothly, but the many layers and components that have to work together don’t make for stable, reliable application performance. Here’s a simplified graphic for typical enterprise applications:
Each layer in turn comprises multiple components. For example, a server includes a CPU, memory, disks, network interface cards, network cables, power etc. If any of these malfunction or fail, the dependent application will become unreliable or unavailable. While High Availability (HA) architectures with redundant hardware are designed to protect against infrastructure failures, they cannot address problems caused by other layers. Further, HA environments have to be set up and maintained, which is time consuming and requires scheduled downtime. I have even experienced system outages caused by the HA setup itself, due to inadequate implementation.
- Multiple application interfaces
When the application landscape is complex, it is common to see many point to point interfaces setup for data exchange across applications, from ftp to one or more middleware solutions. It doesn’t take a rocket scientist to equate more interfaces with a higher possibility of system failure when just a single application fails.
- Inadequate monitoring
Typically, most IT environments have some form of system monitoring in place. However, the setup is usually not comprehensive enough to cover all applications and dependent infrastructure layers. Undetected issues or failures, no matter how small, can snowball and manifest later on, resulting in lowered efficiency and higher costs for your company.
- Resource bottlenecks
When applications run, they sometimes exhaust system resources like CPU, memory/swap, file system and database space. This can occur due to unexpected system load, a runaway process, memory leaks, inadequate system sizing or continuous growth. Undetected resource issues can result in system outages.
- Team silos
Support organizations are typically structured around technical competencies.
The network, storage, server, database and application administration teams may report to different managers or even be in different departments. It’s not uncommon for application administration itself to be broken into further silos in a mixed environment – different teams for SAP, Oracle, Salesforce, etc. Further, some teams may be outsourced, and sometimes there are multiple outsourcing partners. Coordinating activities such as correlating alert information, troubleshooting and root cause identification become cumbersome and challenging.
- Job failures
A lot of business processing happens in batch mode. When a critical job fails, corrective procedures are sometimes not well defined. Also, it’s difficult to assess the impact because the key information needed – type of failed job, business impact, retry options, etc. – may be missing, outdated or difficult to locate.
- Network issues
You simply can’t guarantee service levels when the Internet is used for data transfer. Even with private WAN links, occasional network issues are common due to geographical spread. When these failures occur, the applications at either end of the data exchange may not be tolerant to failures, and manual intervention to analyze and restart the failed processes.
- Password expiry or locked accounts
In complex environments, several accounts are used to ensure communication between systems. Managing the passwords and authorizations for these accounts, especially with SOX-related requirements, is challenging to say the least. Sometimes when someone leaves the company, termination of her account could trigger an application failure. Though not very common, these failures can have serious impact on application availability and be very difficult to troubleshoot.
- Employee attrition
Complex IT landscapes require knowledge distributed across team members. When an employee leaves – or an outsourcing partner re-assigns someone to another client – it can leave a big skill set gap in the team. Some skills aren’t easy to replace, and the new person still ends up with a learning curve to come up to speed on the environment and the team.
Increasingly, enterprises are paying more attention IT operations management to reduce inefficiencies and divert spending from failures to growth initiatives. In a perfect world, a single monitoring tool would cover the entire IT landscape, with the intelligence to detect and alert the IT operations teams ahead of time – or even better, predict and prevent problems altogether. I don’t see us ever attaining this holy grail, but I am encouraged by the promise of a new generation of learning-based IT Operations Analytics software – software that will ingest monitoring data from infrastructure, applications or the multitude of monitoring tools and apply data science techniques to predict failure and prescribe, or in some cases take, corrective action.
Maybe one day the word “downtime” will belong on the ash heap of history.