System Management Tutorial

6.1 System Management

Hello and welcome to the module of System Management of the CompTIA Cloud Plus course offered by Simplilearn. In this module, we will discuss the best practices for management process in cloud production environment. Let us look at the objectives of this module in the next slide.

6.2 Objectives

By the end of this module, you will be able to: Discuss the policies and procedures for a cloud environment Explain performance terminologies and concepts Discuss tests to be conducted before deploying the cloud service In the next slide, we will discuss change management.

6.3 Change Management

Planning future growth and the ability to adjust compute resources on demand are the major benefits of a virtualized environment. There are parameters to be considered for network and IP planning. The situation changes when performing any change in the existing infrastructure. Change management is an integral (read as in-tea-grahl) part of the system management. Despite being an integral (read as in-tea-grahl) part, change management is ignored in cloud services since most of the cloud services focus on rapid deployment. Change management must look into the technology, strategic and development changes, and its impacts. In the next slide, we will discuss the steps of for change management process.

6.4 Change Management Process

To understand the change management in a better way, let us go through the steps involved. The first step is to plan the change. The change should be planned once the requirement for a change has been determined. It should be planned in terms of schedule and necessary resources, such as, testing environment, time, personnel, and budget. The next step is to test and validate the change. The testing starts with the validation of the proposed change of a summary laboratory. The summary laboratory validation contains all the cases that need to be incorporated. The goal of this test is to assess the feasibility and the cost of the change in terms of effort and resources. In the next slide we will continue with the steps for change management process.

6.5 Request for Change

The following step is to document the request for change (RFC). All the procedures for preparation, installation, verification, and back-out must be documented in detail. Back-out refers to the process of restoring a plan in case it fails. The impact and risk analysis of the change must also be recorded. For instance, in worst-case impact, the analysis of the failed change and back-out procedure must be documented. Other information related to the change must also be included in the documentation, such as prerequisites for the change, proposed schedule, required resources, engineering and design documentation, physical diagrams, etc. The next step is to create the request for change. Information is recorded during the change request initiation and as the RFC progresses through its lifecycle. In addition, some information is recorded directly on the RFC form. The change details may be recorded in other documents accessed from the RFC, for instance engineering documents and impact assessment reports. To encourage compliance with progress, it is better to keep RFC form simple at the beginning of the implementation of change management. Some of the points that are mentioned in the RFC are: the items to be changed, the reason for change, the back-out or remedial plan, etc. In the next slide we will continue discussing the steps for change management process.

6.6 Review Process

Let us move on to the next step, which is performing a technical review and sign-off. Review of the technical content of the change is not part of the change management assessment of the RFC. Consequently, each RFC must undergo a technical review even before them for a CAB review. The CAB or the Change Advisory Board is a team that provides support to the change management team by approving the requested change, and checking whether the change can be incorporated successfully. The review check involves the correctness of potential impact and side effects on other infrastructure and services; worst-case impact (both back-out procedure and change fail); change completion, documentation, and test procedures; and preparation, implementation, verification, and back-out procedures. Technical personnel well aware of the technical resources and current trend perform the review, such as architects, service managers, etc. The authorization post technical review should be a formal sign-off, recorded in the change log. The next step is to review the RFC. Here the reviewer checks and verifies the RFC and tries to figure out whether the RFC is impractical; or is it the repeated version of the previous RFC. If any point is positive, the RFC will be rejected or else it will be approved for next level. Next, we will discuss assessing and authorizing RFC.

6.7 Assessing and Authorizing RFC

The next step is to assess and evaluate the RFC. Here the assessment and evaluation of RFC is done with respect to the points like risk, effects on current operations, customer impact, etc. The next step is to authorize the RFC. The change authority validates a formal authorization for every single change. The authorizer may be a role, a person, or a group of people, which depends on the size of organization and the volume of changes. The level of authorization is decided or judged with respect to the type, size, or risk of the change. In the following slide, we will focus on planning and implementing changes.

6.8 Planning and Implementing Changes

The next step is to plan the updates. Multiple changes designed, tested, and released together. This is possible only when the business, service provider, and its customer can handle the changes involved and there is interference risk between the changes. Here, the team will ensure to keep a track of all the areas, whether it is a business aspect or a technical aspect, and they plan the updates accordingly. The next step is to implement the change. To build on the changes, authorized RFCs should be passed to the relevant technical groups. Best practices dictate that to track the changes a formal work order process or system should be used. It is the responsibility of change management to ensure that the changes are implemented as scheduled. In the following slide, we will focus on implementation process.

6.9 Implementation Process

If the implemented change becomes successful, the next step is to perform a post implementation review. The change is always done to get something better. However, if the implemented change fails, the back-out change operation must be performed. In the back-out change operation, the entire process done during the change management is rolled back in order to bring it to its original position, that is, the position before implementing the change. In post-implementation review, the results should be reported to those who are responsible for managing changes for further evaluation after completing the change. For the stakeholder approval, it should be presented as a completed change. The last step is to close the change, where, in the configuration database the change is closed and documented. Every step of the change process and status change of the RFC is documented in the configuration database. In the next slide, we will continue to understand the policies and procedures for a cloud environment.

6.10 Configuration Management

Configuration management focuses on establishing and maintaining consistency of performance over a lifecycle. In cloud computing, the performance over a lifecycle refers to the implementation of governance framework, and regulatory and legal concerns. This is done to achieve configuration standardization and documentation so that the future setups will adhere to the guidelines mentioned in the documentation. Cloud computing defines that the availability of resources is unlimited. But to maintain the privacy laws, it is essential to know where exactly the data resides. This is done because some of the laws state that the critical information like financial information must not be exposed outside the country border. Configuration management deals with ways to maintain consistency on these types of issues. To maintain the governance policy intact, now-a-days, more industries prefer SAS70. SAS70 is an auditing standard that enables an independent auditor to issue and evaluate an opinion on a service organization's controls. The audit report (the service auditor’s report) contains a description of the controls placed in operation, the auditor’s opinion, and description of the auditor's tests of operating effectiveness. In the following slide, we will discuss configuration management system in detail.

6.11 Configuration Management System

Configuration management and change management work together. This means that the performance of configuration management depends on change management. At the very start of the process implementation, configuration management is responsible for defining and documenting which assets of their IT environments should be managed as configuration items (CIs). For each CI, it must be possible to identify the instance of that CI in the environment. A CI should have a consistent naming convention and a unique identifier associated with it to distinguish it from the other CIs. Control changes to that CI through the use of a change management process. Record all the attributes of the CI in a configuration management database (CMDB). A CMDB is the authority for tracking all attributes of a CI. An environment may have multiple CMDBs that are maintained under disparate authorities, and all CMDBs should be tied together as part of a larger configuration management system (CMS). One of the key attributes that all CIs must contain is ownership. By defining an owner for each CI, organizations are able to achieve asset accountability. This accountability imposes responsibility for keeping all attributes current, inventorying, financial reporting, safeguarding, and other controls necessary for optimal maintenance, use, and disposal of the CI. The defined owner for each asset should be a key stakeholder in any CAB that deals with a change that affects the configuration of that CI, thus providing them configuration control. Audit periodically and verify the attributes, statuses, and relationships of any or all CIs at any requested time. In this way the approval process and configuration control is achieved. In the next slide, we will discuss capacity management.

6.12 Capacity Management

Capacity management is a process used to examine the kind of systems that are in place. It is also used to measure their performance, and to determine the pattern in usage that enables the capacity planner of an organization to predict the demand. Capacity management ensures that the information technology processing and storage capacity are adequate to the evolving requirements of the organization, as a whole in a timely and cost justifiable manner.

6.13 Benefits of Capacity Management

In this slide, let us look into the benefits of an effective and efficient capacity management process: The capacity management process ensures that the IT resources are planned and scheduled to match the current and future needs of the business. In capacity management process, capacity plan outlines the IT resources and funding (and cost justification) required for the business support. Capacity planner prepares the capacity plan, which contains information about required resources and its utilization strategy. Capacity management process also reduces the capacity-related incidents through pre-empting performance issues. It implements the corrective actions for capacity-related events. Further, it provides methods to tune and optimize the performance of IT services and configuration items. Configuration items are those components required for production environment, like software, documents, servers, data-center records, etc. In addition, capacity management process provides a structure for planning upgrades and enhancements. It also helps in estimating future requirements by trend analysis of current configuration item utilization and modeling changes in IT Services. It assures that the upgrades are planned, budgeted, and implemented before the service level agreements or SLAs (in terms of availability or performance) are breached. Capacity management process further helps in implementing the corrective actions for capacity-related events. It also provides financial benefits by avoiding 'panic' buying. Panic buying refers to a concept where unnecessary expenses are done to buy hardware or software by an organization. Once the baselines have been established, documented, and contractually agreed upon, it is then the goal of the service operations to do whatever is needed to maintain those baseline states. This maintenance requires a proper tool set as well as procedures to regularly and consistently monitor and measure the baseline and to understand the pattern of varying measurements over the course of time, known as trending. Additionally, the capacity management process identifies the capacity requirements on the basis of business plans, business requirements, SLAs, and MOUs or Memoranda of Understanding, and risk assessments. These will be further consulted in the development and negotiation of SLAs and MOUs. In the next slide, we will discuss life cycle management process.

6.14 Life Cycle Management Process

Life cycle management is the process in an organization to assist in the management, coordination, control, delivery, and support of their configuration items from requirement to retirement. In order to build supportable technical solutions that consistently deliver their intended value, documentation must be maintained at every step of the life cycle. Documentation of the business requirements for any proposed IT service additions or changes should be the first step in the life cycle, followed by documentation for the proposed technical design, continuing into implementation planning documents and support documentation, and coming full circle in the life cycle through documented service improvement plans. In the next slide, we will discuss phases of life cycle in detail.

6.15 Phases of Life Cycle

ITIL proposes five phases for life cycle or service management namely service strategy, service design, service transition, service operation, and continual service improvement. Each phase has input and output operations attached. Continual improvements are assessed through documented changes and implementations based on feedback from each of the life cycle phase. These improvements enable the organization to execute each of its service offerings as efficiently and effectively as possible, and ensure that each of those services provide value to the users. Maintenance windows in IT environment should be scheduled at periods of least potential disruption to the customer, and the customer should be involved in the maintenance scheduling process. The customer knows their patterns of business activity better than the system administrators. All technology upgrades and patches should utilize these maintenance windows whenever possible, and the timing of their implementation should always be reviewed as a part of the standard change management process by the CAB. In the following slide we will discuss performance terminologies and concepts.

6.16 Performance Terminologies and Concepts

We will begin with the Input Output Operations per second (IOPS) which is a performance measurement used to measure the performance of storage devices. This determines the speed of data operation with respect to the bandwidth of the network, which is connected to the storage device from the client system. Next is metadata performance, which describes how quickly the files and directories can be created and removed, and statuses can be checked besides focusing on other data functions. The increasing number of directories and files are making this aspect of storage performance more important. There are applications that can create deep and wide directory structures as well as applications that can produce millions of files in a single directory. The number of files and number of directories also increase as the number of cores increases, thus pressurizing the metadata performance of storage solutions. Next is file system performance, which analyzes and helps an administrator to decide which file system is to be used for which type of situation. Consider a scenario where the virtual machine files have to be stored. Here, since the files will be of a larger size and these files will be invoked via network through SOA principles, it is highly preferred to use VMFS or NFS file system. To recall, we have discussed VMFS and NFS file systems in the module 3 – Infrastructure. The next performance terminology and concept is caching. This is a strategy for keeping a copy of such processes or data that are used frequently in the operations work. Caching helps in increasing the performance. Consider a scenario where the user has a total of 10 virtual machines (namely VM1, VM2, VM3 and so on till VM10) registered with a service provider, but the user only invokes VM1 frequently. Then, the cache memory of the response server will maintain the information and store the virtual machine in the nearby location, from where the user normally logs in. This improves the performance of the service from the user’s point of view. We will look into other performance terminologies and concepts in the next slide.

6.17 Performance Terminologies and Concepts (contd.)

The next performance terminology and concept is called load balancing. This is a technology used to distribute the service requests to the resources. Load balancing is an optimization technique, which can increase the throughput, lower down the latency, and reduce the response time. Next is throughput, which is the measurement of transactions per second that any application can handle. Performing a load testing helps us to identify the throughput of any product or service. The next performance terminology and concept is called the latency, which is the value of response of the remote computer to which the request is been sent, which is yet to be processed. For instance, if user A invokes a web service or accesses a web page, there is a delay in the process of a request reaching the server apart from the processing time required. This delay is referred to as latency. This is the reason many organizations prefer those services that have low latency. Next is the response time, which is the amount of time taken by a system to process a request after it has received one. Typically, this is measured on the server’s end. Response time is the time which is initiated from the time a request is sent to the time a response is received. Let us consider a scenario where a user wants to have API, which is nothing but the application programming interface and he or she also wants to find out the amount of time taken by the API to execute once it is invoked. In this scenario, the user is in fact measuring the response time. In the next slide, we will continue with the discussion on performance terminologies and concepts.

6.18 Performance Terminologies and Concepts (contd.)

The next performance terminology and concept is called scalability. It is the measure of response when an additional hardware is included. The challenge involves assuring the design without any server affinity for the load balance to adjust the load across the servers. Scalability can be measured with load balancing tools. The performance counters can be monitored to check if the actual request load is balanced or shared across servers. Scalability also matters when we consider read/ (or) write files. Reading files in distributed mode is easy but the challenging part is writing into files in distributed mode. The next is hop count. This refers to the intermediate devices between the sender system, which initiated the request and the receiver server, which will process the request and provide the appropriate response. More hop counts lead to higher latency. The next is Bandwidth. Bandwidth is the measurement of the available or consumed data communication resources on a network. Performance of all networks is dependent on the available bandwidth. The next is Jumbo Frames. Jumbo frames are Ethernet frames with more than 1500 bytes of payload. These frames can carry up to 9000 bytes of payload, but depending on the vendor and the environment they are deployed in, there may be some deviation. In the next slide, we will continue focusing on performance terminologies and concepts.

6.19 Performance Terminologies and Concepts (contd.)

The next is QoS or Quality of Service. QoS is a set of technologies that can identify the type of data in data packets and divide those packets into specific traffic classes that can be prioritized according to defined service levels. To meet their service requirements for a workload or an application, QoS technologies enable administrators to measure network bandwidth, detect changing network conditions, and prioritize the network traffic, accordingly. Multipathing is the practice of defining and controlling redundant physical paths to I/O devices, so that when an active path to a device becomes unavailable, the multipathing configuration can automatically switch to an alternate path in order to maintain service availability. Scaling is the ability of a system or network to manage a growing workload in a proficient manner. All cloud environments are to be scalable, as one of the chief tenets (Read as: ten·?ts) of cloud computing is elasticity, or the ability to adapt to growing workload quickly. To scale vertically means to add resources to a single node, thereby making that node capable of handling more of a load within itself. To scale horizontally, more nodes are added to a configuration instead of increasing the resources for any one node. Diagonal scaling increases resources for individual nodes and adds more to the system achieving best configuration for a quickly growing, and elastic solution. So far, we have learnt about various performance terminologies and concepts. In the next slide, we will look into important things to test before the deployment of cloud service.

6.20 Optimizing Physical Host Performance

There are a number of best practices for the configuration of each of the computation resources within a cloud environment. A baseline and documentation has to be created using appropriate tools to ensure no error exists while performing the optimization process. Within the guest operating system, hypervisors have device drivers built in the host virtualization layer. A balloon driver is a part of this installed tool that can be observed inside the guest. To remove the invaluable guest from the operating system, the balloon driver communicates to the hypervisor to reclaim the memory inside the guest. If the host runs low on memory, to reclaim the memory from the guest, it will grow the balloon driver. Disk performance can be configured with different configuration options. Media type can affect performance, and administrators can choose between the most standard types of traditional rotational media or chip-based solid state drives. Disk tuning is the activity of analyzing what type of I/O traffic is taking place across the defined disk resources and moving it to the most appropriate set of resources. Virtualization management platforms enable the movement of storage, without interrupting the current operations, to other disk resources within their control. Disk latency is a counter that provides administrators with the best indicator of when a resource is experiencing degradation due to a disk bottleneck and needs to have action taken against it. Swap space is the disk space that is allocated to service memory requests when the physical memory capacity limit has been reached. While designing systems, administrators need to analyze input and output (I/O) needs from the top down, determining which resources are needed in order to achieve the required performance levels. This is called I/O tuning. I/O throttling does not eliminate disk I/O as a bottleneck for performance, but it can alleviate performance problems for specific virtual machines based on a priority assigned by the administrator. I/O throttling defines limits that can be utilized specifically for disk resources assigned to virtual machines to ensure that they are not performance or availability constrained when working in an environment that has more demand for disk resources than its availability. CPU time is the amount of time a process or thread spends executing on a processor core. For multiple threads, the CPU time of the threads is additive. While high CPU wait time can be alleviated in some situations by adding processors, these additions sometimes hurt performance as well. Let us discuss the occurrences of failures in cloud environment in the next slide.

6.21 Occurrence of Failures in Cloud Environment

The impact of configuration changes on the virtual environment depends on the type of hypervisor used for performing the processing operations. There are a number of failures that can occur within a cloud environment, and the system must be configured to be tolerant of those failures and provide availability in line with the organization’s SLA. Disk failures can happen for a variety of reasons, but they fail more frequently than the other compute resources because they are the only compute resources that have mechanical components. Failures like HBA failures, while not as common as physical disk failures, are to be expected and storage solutions need to be designed with them in mind. HBAs have the option of being multi-pathed, which prevents the loss of availability in the event of a failure. Network interface cards or NIC can fail in a similar fashion to other printed circuit board components like motherboards, controller cards, and memory chips. Memory failures, while not as common as disk failures, can be just as disruptive. Good system design in cloud environments will take RAM failure into account as a risk and ensure that there is always some RAM available to run mission-critical systems in case of memory failure on one of their hosts. CPU or processors fail for one of the three main reasons: they get broken while getting installed, they are damaged by voltage spikes, or they are damaged due to overheating from failed or ineffective fans. Damaged processors either take hosts completely off-line or degrade performance based on the damage and the availability of a standby or alternative processor in some models. We will discuss a scenario on optimizing physical host performance in the next slide.

6.22 Scenario on Optimizing Physical Host Performance

The administrator observes that there is a large number of page writes in the disk on the physical server. Which of the following would you recommend to decrease the dependency of the secondary memory? Increase the size of physical memory Increase the internet speed Switch off the server and start after an hour Let us see if you have got the right answer.

6.23 Scenario on Optimizing Physical Host Performance (contd.)

This situation is caused by a shortage of physical memory or RAM. Therefore, the best possible solution is to increase the size of the physical memory. In the next slide, we will focus on things to test before deploying cloud service.

6.24 Things to Test before Deploying Cloud Service

There are different types of testing that needs to be done before deploying the cloud services. Let us discuss the types one by one. Availability Testing: This is done to check whether the cloud services are available to the customer anytime and anywhere. This testing helps an organization to assure to their customers that all the points covered in their service level agreements will be provided. The report of this testing is always maintained by the pre-sales team to make customer aware of the uptime of the services. Security Testing: This ensures that there is no provision for any unauthorized access to the data. This can be with respect to the data partitioning or with respect to the encryption methods. -In addition, this test gives an assurance to the organization that, the data which is uploaded by the users will be highly secure, as it uses the public key cryptography or any other preferred and company approved security mechanism. Performance Testing: This is done to check the flexibility and scalability of cloud services to be offered. Flexibility refers to the working of customer self-provisioning feature within the control panel. Self-provisioning is a feature that provides the cloud service. Using this the customer has full authority to provision or de-provision the hardware resources instantly, with minimal management effort or service provider interaction. The feature of self-provisioning should be such that there is no downtime when the modifications (with respect to provisioning or de-provisioning of the resources) are done by the customer. Scalability refers to the hardware scale-up, scale-down, or scheduled maintenance from the service provider’s end without introducing downtime to their customers. Application performance testing is used to test an application’s performance and verify that the application is able to meet the organization’s service level agreements. After moving an application or application server to the cloud, testing of that application or server still needs to be performed at regular intervals. Service performance testing is used to test a service's performance to ensure the availability and quick response, and its turn- around time. Interoperability Testing: This is performed to check the cross-compatibility of the cloud services in multiple platforms. The customer may use Windows platform, Linux platform, Android platform etc. Irrespective of the platform that is being used by the customer, this testing ensures the service provider that the services offered will work on customer’s system without any hassles. In the next slide, we will continue to learn about the things to be tested before deploying the cloud service.

6.25 Things to Test before Deploying Cloud Service (contd.)

Disaster Recovery Testing: This is done to check whether the cloud services are available to the customer anytime and anywhere, even in case of a system failure that occurs from the service provider’s end. This mainly talks about the recovery mechanism available in the service provider’s end in case of an incident. It also talks about the kind of incidents that can be recovered and that cannot be recovered. These results can help a service provider to make a good and practical SLA; and then commit to the customer based on these SLAs. Multi-tenancy Testing: This is done in order to ensure that the multiple clients are thoroughly validated and they are always isolated from each other logically. This type of testing basically checks if the multi-tenancy feature is successfully achieved or not, at the service provider’s end. This also helps the service provider to understand the number of users that can be accommodated in their environment. Apart from that we need to test for replication, latency, bandwidth, load balancing, application servers, storage, and application delivery. In the next slide, we will continue to learn about the things to be tested before deploying the cloud service.

6.26 Things to Test before Deploying Cloud Service (contd.)

Penetration testing is the process of evaluating network security with a simulated attack on the network from both external and internal attackers. A penetration test involves an active analysis of the network by a testing firm that looks for potential vulnerabilities due to hardware and software flaws, improperly configured systems, or a combination of different factors. The test is performed by a person who acts like a potential attacker, and it involves the exploitation of specific security vulnerabilities. Once the test is complete, any issues that have been identified by the test are presented to the organization. A vulnerability assessment is the process used to identify and quantify the vulnerabilities in a network environment. It is a detailed evaluation of the network, indicating any weaknesses and providing appropriate mitigation procedures to help eliminate or reduce the level of the security risk. Separation of duties is the process of segregating specific duties and dividing the tasks and privileges required for a specific security process among multiple administrators. Let us move on to the quiz questions to check your understanding of the topics covered in this module.

6.28 Summary

Here is a quick recap of what was covered in the module: Change management must look into the technology, strategic and development changes, and its impacts. Configuration management establishes and maintains consistency of performance over a lifecycle. Capacity management ensures that the information technology processing and storage capacity are adequate to the evolving requirements of the organization as a whole, in a timely and cost justifiable manner. The various tests to be performed in cloud are availability testing, security testing, performance testing, interoperability testing, disaster recovery testing, and multi-tenancy testing.

6.29 Thank You

In the next module, we will be studying about business continuity in cloud.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Work Email*
Phone Number*
Job Title*