Availability Management Tutorial

3.1 Availability Management

Welcome to learning unit 3 on Availability management process. Let us begin with the agenda in the next slide.

3.2 Availability Management

In the learning unit we will discuss about Availability management’s purpose, objectives, scope, activities, key concepts, triggers, inputs & outputs, challenges, risks, CSFs and KPIs of Availability Management process Let us begin with the purpose and objectives in the next slide.

3.3 Availability Management - Purpose and Objectives

The purpose of the availability management process is to ensure that the level of availability delivered in all IT services meet the agreed availability needs and or service level targets in a cost-effective and timely manner. Availability management is concerned with meeting both the current and future availability needs of the business. Availability management defines, analyses, plans, and measures and improves all aspects of the availability of IT services, ensuring that all IT infrastructure, processes, tools, roles etc. are appropriate for the agreed availability service level targets. It provides a point of focus and management for all availability-related issues, relating to both services and resources, ensuring that availability targets in all areas are measured and achieved. The objective of availability management is to: • Produce and maintain an appropriate and up-to-date availability plan that reflects the current and future needs of the business • Provide advice and guidance to all other areas of the business and IT on all availability-related issues • Ensure that service availability achievements meet all their agreed targets by managing services and resources-related availability performance • Assist with the diagnosis and resolution of availability-related incidents and problems • Assess the impact of all changes on the availability plan and the availability of all services and resources • Ensure that proactive measures to improve the availability of services are implemented wherever it is cost-justifiable to do so. Availability management should ensure the agreed level of availability is provided. The measurement and monitoring of IT availability is a key activity to ensure availability levels are being met consistently. Availability management should look to continually optimize and proactively improve the availability of the IT infrastructure, the services and the supporting organization, in order to provide cost-effective availability improvements that can deliver business and customer benefits. Like any other process, in the next slide we will look at the scope of availability management.

3.4 Availability Management - Scope

The scope of the availability management process covers the design, implementation, measurement, management and improvement of IT service and component availability. Availability management commences as soon as the availability requirements for an IT service are clear enough to be articulated. It is an on-going process, finishing only when the IT service is decommissioned or retired. The availability management process includes two key elements namely reactive activities and proactive activities. Reactive activities involve monitoring, measuring, analysis and management of all events, incidents and problems involving unavailability. These activities are principally performed as part of the operational roles. Proactive activities involve the proactive planning, design and improvement of availability. These activities are principally performed as part of the design and planning roles. These activities will be discussed again in the later slides. The availability management process should include: • Monitoring of all aspects of availability, reliability and maintainability of IT services and the supporting components, with appropriate events, alarms and escalation, with automated scripts for recovery • Maintaining a set of methods, techniques and calculations for all availability measurements, metrics and reporting • Actively participating in risk assessment and management activities • Collecting measurements and the analysis and production of regular and adhoc reports on service and component availability • Understanding the agreed current and future demands of the business for IT services and their availability • Influencing the design of services and components to align with business availability needs • Producing an availability plan that enables the service provider to continue to provide and improve services in line with availability targets defined in SLAs, and to plan and forecast future availability levels required, as defined in SLRs • Maintaining a schedule of tests for all resilience and fail-over components and mechanisms • Assisting with the identification and resolution of any incidents and problems associated with service or component unavailability • Proactively improving service or component availability wherever it is cost-justifiable and meets the needs of the business. The availability management process does not include business continuity management (BCM) and the resumption of business processing after a major disaster. The support of BCM is included within ITSCM. However, availability management does provide key inputs to ITSCM, and the two processes have a close relationship, particularly in the assessment and management of risks and in the implementation of risk reduction and resilience measures. Let us now look at the availability management value to business in the next slide.

3.5 Availability Management - Value to the Business

The availability management process ensures that the availability of systems and services match the evolving agreed needs of the business. The role of IT within the business is now pivotal. The availability and reliability of IT services can directly influence customer satisfaction and the reputation of the business. This is why availability management is essential in ensuring IT delivers the levels of service availability required by the business to satisfy its business objectives and deliver the quality of service demanded by its customers. In today’s competitive marketplace, customer satisfaction with service(s) provided is paramount. Customer loyalty can no longer be relied on, and dissatisfaction with the availability and reliability of IT service can be a key factor in customers taking their business to a competitor. Availability can also improve the ability of the business to follow an environmentally responsible strategy by using green technologies and techniques in availability management. In the next slide we will talk about the policies of availability management.

3.6 Availability Management - Policies

As a matter of policy, the availability management process, just like capacity management, must be involved in all stages of the service lifecycle, from strategy and design, through transition and operation to improvement. The appropriate availability and resilience should be designed into services and components from the initial design stages. This will ensure not only that the availability of any new or changed service meets its expected targets, but also that all existing services and components continue to meet all of their targets. This is the basis of stable service provision. The service provider organization should establish policies defining when and how availability management must be engaged throughout each lifecycle stage. Policies should also be established regarding the criteria to be used to define availability and unavailability of a service or component and how each will be measured. An effective availability management process, consisting of both the reactive and proactive activities, can ‘make a big difference’ and will be recognized as such by the business, if the deployment of availability management within an IT organization has a strong emphasis on the needs of the business and customers. To reinforce this emphasis, there are several guiding principles that should underpin the availability management process and its focus: • Service availability is at the core of customer satisfaction and business success: there is a direct correlation in most organizations between service availability and customer and user satisfaction, where poor service performance is defined as being unavailable. • Recognizing that when services fail, it is still possible to achieve business, customer and user satisfaction and recognition: the way a service provider reacts in a failure situation has a major influence on customer and user perception and expectation. • Improving availability can only begin after understanding how the IT services support the operation of the business. • Service availability is only as good as the weakest link in the chain: it can be greatly increased by the elimination of single points of failure or an unreliable or weak component. • Availability is not just a reactive process. The more proactive the process, the better service availability will be. Availability should not purely react to service and component failure. The more often events and failures are predicted, pre-empted and prevented, the higher the level of service availability. • It is cheaper to design the right level of service availability into a service from the start, rather than try and ‘bolt it on’ subsequently. Adding resilience into a service or component is invariably more expensive than designing it in from the start. Also, once a service gets a bad name for unreliability, it becomes very difficult to change the image. Resilience is also a key consideration of ITSCM, and this should be considered at the same time. Let’s now look at the key concepts of availability management in the next slide.

3.7 Availability Management - Basic Concepts 1of3

Availability is the ability of a service, component or CI to perform its agreed function when required. It is often measured and reported as a percentage. Note that downtime should only be included in the following calculation when it occurs within the agreed service time (AST). However, total down time should also be recorded and reported. Therefore availability percentage is equal to agreed service time minus downtime divided by agreed service time and multiplied by hundred. Next, is reliability. Reliability is a measure of how long a service, component or CI can perform its agreed function without interruption. The reliability of the service can be improved by increasing the reliability of individual components or by increasing the resilience of the service to individual component failure (i.e. increasing the component redundancy, for example by using load-balancing techniques). It is often measured and reported as the mean time between service incidents (MTBSI) or mean time between failures (MTBF). Therefore reliability (that is mean time between service incidents in hours is equal to available time in hours divided by number of breaks. Similarly reliability (that is mean time between failures) is equal to available time in hours minus total downtime in hours divided by number of breaks. Let’s look at maintainability next.

3.8 Availability Management - Basic Concepts 2of3

Maintainability is a measure of how quickly and effectively a service, component or CI can be restored to normal working after a failure. It is measured and reported as the mean time to restore service (MTRS) and should be calculated using the following formula where maintainability(that is mean time to restore services in hours) is equal to total downtime in hours divided by number of service breaks. Next is serviceability. Serviceability is the ability of a third-party supplier to meet the terms of its contract. This contract will include agreed levels of availability, reliability and, or maintainability for a supporting service or component. Lastly vital business function. The term vital business function (VBF) is used to reflect the part of a business process that is critical to the success of the business. An IT service may support a number of business functions that are less critical. For example, an automated teller machine (ATM) or cash dispenser service VBF would be the dispensing of cash. However, the ability to obtain a statement from an ATM may not be considered as vital. This distinction is important and should influence availability design and associated costs. The more vital the business function generally, the greater the level of resilience and availability that needs to be incorporated into the design required in the supporting IT services. For all services, whether VBFs or not, the availability requirements should be determined by the business and not by IT. The initial availability targets are often set at too high a level, and this leads to either over-priced services or an iterative discussion between the service provider and the business to agree an appropriate compromise between the service availability and the cost of the service and its supporting technology.

3.9 Availability Management - Basic Concepts 3of3

The Availability Management process has two key elements: • Reactive activities: the reactive aspect of Availability Management involves the monitoring, measuring, analysis and management of all events, incidents and problems involving unavailability. These activities are principally involved within operational roles. • Proactive activities: the proactive activities of Availability Management involve the proactive planning, design and improvement of availability. These activities are principally involved within design and planning roles. We will learn these activities in detail in the coming slides.

3.10 Availability Management - Reactive Activities

The reactive aspect of availability management involves work to ensure that current operational services and components deliver the agreed levels of availability and to respond appropriately when they do not. The reactive activities include: Monitoring, measuring, analysing, reporting, reviewing service and component availability. Secondly Investigating all service and component unavailability and instigating remedial action. This includes looking at events, incidents and problems involving unavailability. These activities are primarily conducted within the service operation stage of the service lifecycle and are linked into the monitoring and control activities and incident management processes Let us understand the proactive activities in the next slide.

3.11 Availability Management - Proactive Activities

The proactive activities of availability management involve the work necessary to ensure that new or changed services can and will deliver the agreed levels of availability and that appropriate measurements are in place to support this work. They include producing recommendations, plans and documents on design guidelines and criteria for new and changed services, and the continual improvement of service and reduction of risk in existing services wherever it can be cost-justified. These are key aspects to be considered within service design activities. Proactive activities include: • Planning and designing new or changed services • Determining the VBFs, in conjunction with the business and ITSCM • Determining the availability requirements from the business for a new or enhanced IT service and formulating the availability and recovery design criteria for the supporting IT components • Defining the targets for availability, reliability and maintainability for the IT • infrastructure components that underpin the IT service to enable these to be documented and agreed within SLAs, OLAs and contracts • Performing risk assessment and management activities to ensure the prevention and or recovery from service and component unavailability • Designing the IT services to meet the availability and recovery design criteria and associated agreed service levels • Establishing measures and reporting of availability, reliability and maintainability that reflect the business, user and IT support organization perspectives Risk assessment and management is a proactive activity. And now let’s understand what exactly it does. Its main objective is to determine the impact arising from IT service and component failure in conjunction with ITSCM and, where appropriate, reviewing the availability design criteria to provide additional resilience to prevent or minimize impact to the business and then implementing cost-justifiable countermeasures, including risk reduction and recovery mechanisms. The other activities involves : • Reviewing all new and changed services and testing all availability and resilience mechanisms • Continual reviewing and improvement. • Producing and maintaining an availability plan that prioritizes and plans IT availability improvements. In the next slide we will look at the techniques used in Availability management.

3.12 Availability Management - Techniques

Availability Management’s main objective is to ensure the agreed availability being provided to the customer or the end users. But due to many reasons this assurance may get affected. There are some techniques that’s being used to understand the reason for the disruption of availability. Let us begin with: Component Failure Impact Analysis Component Failure Impact Analysis (CFIA) can be used to predict and evaluate the impact on IT service arising from component failures within the technology. The output from a CFIA can be used to identify where additional resilience should be considered to prevent or minimize the impact of component failure to the business operation and users. Single Point of Failure analysis A Single Point of Failure (SPoF) is any component within the IT infrastructure that has no backup or fail-over capability, and has the potential to cause disruption to the business, customers or users when it fails. It is important that no unrecognized SPoFs exist within the IT infrastructure design or the actual technology, and that they are avoided wherever possible. Fault Tree Analysis Fault Tree Analysis (FTA) is a technique that can be used to determine the chain of events that causes a disruption to IT services. FTA, in conjunction with calculation methods, can offer detailed models of availability. This can be used to assess the availability improvement that can be achieved by individual technology component design options. Using FTA: • Information can be provided that can be used for availability calculations • Operations can be performed on the resulting fault tree; these operations correspond with design options • The desired level of detail in the analysis can be chosen. Moving on, let us look at the triggers of availability management in the next slide.

3.13 Availability Management - Triggers

As we have already discussed in the last process that to initiate a process you need a trigger. Unless and until a process is being triggered it won’t carry the activities defined. There are many triggers that will initiate Availability Management activities. These include: • New or changed business needs or new or changed services • New or changed targets within agreements, such as SLRs, SLAs, OLAs or contracts • Service or component breaches, availability events and alerts, including threshold events, exception reports • Periodic activities such as reviewing, revising or reporting • Review of availability management forecasts, reports and plans • Review and revision of business and IT plans and strategies • Review and revision of designs and strategies • Recognition or notification of a change of risk or impact of a business process or VBF, an IT service or component • Request from SLM for assistance with availability targets and explanation of achievements. So far, we have discussed about activities and triggers of availability management, in the next slide let us look at the inputs and outputs of this process.

3.14 Availability Management - Inputs and Outputs

A number of sources of information are relevant to the availability management process. Some of these are as follows: • Business information: From the organization’s business strategy, plans and financial plans, and information on their current and future requirements, including the availability requirements for new or enhanced IT services • Business impact information: From BIAs and assessment of VBFs underpinned by IT services inputs can directly come to availability management • Reports and registers Previous risk assessment reports and a risk register provides an input also Other inputs would be • Service information From the service portfolio and the service catalogue • Service information From the SLM process, with details of the services from the service portfolio and the service catalogue, service level targets within SLAs and SLRs, and possibly from the monitoring of SLAs, service reviews and breaches of the SLAs • Financial information From financial management for IT services, the cost of service provision, the cost of resources and components • Change and release information From the change management process with a change schedule, the release schedule from release and deployment management and a need to assess all changes for their impact on service availability • Service asset and configuration management Containing information on the relationships between the business, the services, the supporting services and the technology • Service targets From SLAs, SLRs, OLAs and contracts • Component information On the availability, reliability and maintainability requirements for the technology components that underpin IT service(s) • Technology information From the CMS on the topology and the relationships between the components and the assessment of the capabilities of new technology • Past performance From previous measurements, achievements and reports and the availability management information system (AMIS) • Unavailability and failure information From incidents and problems • Planning information From other processes such as the capacity plan from capacity management. Next, let’s discuss the Outputs. The outputs can be • Availability Management Information System (AMIS) • The Availability Plan for the proactive improvement of IT services and technology • Availability and recovery design criteria and proposed service targets for new or changed services • Service availability, reliability and maintainability reports of achievements against targets, including input for all service reports • Component availability, reliability and maintainability reports of achievements against targets • Revised risk analysis reviews and reports and an updated risk register • Monitoring, management and reporting requirements for IT services and components to ensure that deviations in availability, reliability and maintainability are detected, actioned, recorded and reported • An Availability Management test schedule for testing all availability, resilience and recovery mechanisms • The planned and preventative maintenance schedules • The Projected Service Outage (PSO) in conjunction with Change and Release Management • In addition to the above, Details of the proactive availability techniques and measures that will be deployed to provide additional resilience to prevent or minimize the impact of component failures on the IT service availability and • Improvement actions for inclusion within the SIP are also the outputs of availability management. Let us proceed to the next slide on the interfaces of availability management.

3.15 Availability Management - Interfaces

The key interfaces that availability management has with other processes are: • Service Level Management process relies on availability management to determine and validate availability targets and to investigate and resolve service and component breaches. • Incident and problem management are assisted by availability management in the resolution and subsequent justification and correction of availability incidents and problems. • Capacity management provides appropriate capacity to support resilience and overall service availability. The process also uses information from demand management about patterns of business activity and user profiles to understand business demand for IT services and provides this information to availability management for business-aligned availability planning. • Change management leads to the creation of the PSO with contributions from availability management. When changes are proposed to a service, availability must assess the change for availability-related issues including any potential impact on achievement of availability service levels. • IT service continuity management (ITSCM), Availability management works collaboratively with this process on the assessment of business impact and risk and the provision of resilience, fail-over and recovery mechanisms. Availability focuses on normal business operation and ITSCM focuses on the extraordinary interruption of service. • Information security management (ISM) If the data becomes unavailable, the service becomes unavailable. ISM defines the security measures and policies that must be included in the service design for availability and design for recovery. • Access management, Availability management provides the methods for appropriately granting and revoking access to services as needed. As we have an understanding of the interfaces, in the next slide we will learn about the CSFs and KPIs.

3.16 Availability Management - CSFs and KPIs

Each organization should identify appropriate CSFs based on its objectives for the process. Each sample CSF is followed by a small number of typical KPIs that support the CSF. These KPIs should not be adopted without careful consideration. Each organization should develop KPIs that are appropriate for its level of maturity, its CSFs and its particular circumstances. Achievement against KPIs should be monitored and used to identify opportunities for improvement, which should be logged in the CSI register for evaluation and possible implementation. Let us consider the first CSF as” Manage availability and reliability of IT service” Corresponding KPIs would be Percentage reduction in the unavailability of services and components, Percentage increase in the reliability of, services and components, Effective review and follow-up of all SLA,OLA and underpinning contract breaches relating to availability and reliability, Percentage improvement in overall end to-end availability of service, Percentage reduction in the number and impact of service breaks, Improvement in the MTBF, Improvement in the MTBSI, KPI Reduction in the MTRS. Lets consider the next CSF as “Satisfy business needs for access to IT services” Corresponding KPIs for this CSF would be Percentage reduction in the unavailability of services, Percentage reduction of the cost of business overtime due to unavailable IT, Percentage reduction in critical time failures – for example, specific business peak and priority availability needs are planned for, Percentage improvement in business, and users satisfied with service (by customer satisfaction survey results) Let us take one more CSF for some more clarity. CSF as “Availability of IT infrastructure and applications, as documented in SLAs, provided at optimum costs”. Corresponding KPIs would be Percentage reduction in the cost of unavailability, Percentage improvement in the service delivery costs, Timely completion of regular risk assessment and system review, Timely completion of regular cost benefit analysis established for infrastructure CFIA, Percentage reduction in failures of third party performance on MTRS/MTBF against contract targets, Reduced time taken to complete (or update) a risk assessment, Reduced time taken to review system resilience, Reduced time taken to complete an availability plan, Timely production of management reports and the last KPI would be Percentage reduction in the incidence of operational reviews uncovering security and reliability exposures in application designs. In the next slide we will discuss about information management.

3.17 Availability Management - Information Management

The availability management process should maintain an AMIS (pronounce as A-M-I-S)that contains all of the measurements and information required to complete the availability management process and provide the appropriate information to the business on the level of IT service provided. This information, covering services, components and supporting services, provides the basis for regular adhoc and exception availability reporting and the identification of trends within the data for the instigation of improvement activities. These activities and the information contained within the AMIS provide the basis for developing the content of the availability plan. In order to provide structure and focus to a wide range of initiatives that may need to be undertaken to improve availability, an availability plan should be formulated and maintained. The availability plan should have aims, objectives and deliverables and should consider the wider issues of people, processes, tools and techniques as well as having a technology focus. In the initial stages it may be aligned with an implementation plan for availability management, but the two are different and should not be confused. As the availability management process matures, the plan should evolve to cover the following: • Actual levels of availability versus agreed levels of availability for key IT services. Availability measurements should always be business and customer-focused and report availability as experienced by the business and users. • Activities being progressed to address shortfalls in availability for existing IT services. Where investment decisions are required, options with associated costs and benefits should be included. • Details of changing availability requirements for existing IT services. The plan should document the options available to meet these changed requirements. Where investment decisions are required, the associated costs of each option should be included. • Details of the availability requirements for forthcoming new IT services. The plan should document the options available to meet these new requirements. Where investment decisions are required, the associated costs of each option should be included. • A forward-looking schedule for the planned SFA assignments. • Regular reviews of SFA assignments should be completed to ensure that the availability of technology is being proactively improved in conjunction with the SIP. • A technology futures section to provide an indication of the potential benefits and exploitation opportunities that exist for planned technology upgrades. Anticipated availability benefits should be detailed, where possible based on business-focused measures, in conjunction with capacity management. The effort required to realize these benefits where possible should also be quantified. During the production of the availability plan, it is recommended that liaising with all functional, technical and process areas is undertaken. The availability plan should cover a period of one to two years, with a more detailed view and information for the first six months. The plan should be reviewed regularly, with minor revisions every quarter and major revisions every half year. Where the technology is only subject to a low level of change, this may be extended as appropriate. It is recommended that the availability plan is considered complementary to the capacity plan and financial plan, and that publication is aligned with the capacity and business budgeting cycle. If a demand is foreseen for high levels of availability that cannot be met due to the constraints of the existing IT infrastructure or budget, then exception reports may be required for the attention of both senior IT and business management. In order to facilitate the production of the availability plan, availability management may wish to consider having its own database repository. The AMIS can be utilized to record and store selected data and information required to support key activities such as report generation, statistical analysis and availability forecasting and planning. The AMIS should be the main repository for the recording of IT availability metrics, measurements, targets and documents, including the availability plan, availability measurements, achievement reports, SFA assignment reports, design criteria, action plans and testing schedules. Let us proceed to look at the challenges and risks of availability management in the next slide.

3.18 Availability Management - Challenges and Risks

Availability management faces many challenges, but the main challenge is to actually meet and manage the expectations of the customers, the business and senior management. These expectations are frequently that services will always be available not just during their agreed service hours, but that all services will be available on a 24-hour, 365-day basis. When they are not, it is assumed that they will be recovered within minutes. This is only the case when the appropriate level of investment and design has been applied to the service, and this should only be made where the business impact justifies that level of investment. However, the message needs to be publicized to all customers and areas of the business, so that when services do fail they have the right level of expectation on their recovery. It also means that availability management must have access to the right level of quality information on the current business need for IT services and its plans for the future. This is another challenge faced by many availability management processes. Another challenge facing availability management is the integration of all of the availability data into an integrated set of information (AMIS) that can be analysed in a consistent manner to provide details on the availability of all services and components. This is particularly challenging when the information from the different technologies is often provided by different tools in differing formats. Yet another challenge facing availability management is convincing the business and senior management of the investment needed in proactive availability measures. Investment is always recognized once failures have occurred, but by then it is really too late. Persuading businesses and customers to invest in resilience to avoid the possibility of failures that may happen is a difficult challenge. Availability management should work closely with ITSCM, information security management and capacity management in producing the justifications necessary to secure the appropriate investment. Some of the major risks associated with availability management include: • A lack of commitment from the business to the availability management process • A lack of commitment from the business and a lack of appropriate information on future plans and strategies • A lack of senior management commitment or a lack of resources and/or budget to the availability management process • Labour-intensive reporting processes • The processes focus too much on the technology and not enough on the services and the needs of the business • The AMIS is maintained in isolation and is not shared or consistent with other process areas, especially ITSCM, information security management and capacity management. This investment is particularly important when considering the necessary service and component backup and recovery tools, technology and processes to meet the agreed needs. With this we come to the end of learning unit 3, let us recap in the next slide.

3.19 Service Design - Availability Management Summary

In this learning unit we discussed about Availability management purpose, objective, scope, value, concepts, activities, interfaces, challenges and risks, inputs and outputs, CSFs & KPIs and information management. Let us now proceed to learn about IT Service continuity management in the next learning unit. Before that complete the Quiz questions in the next section!!!

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Work Email*
Phone Number*
Job Title*