white paper performance storms sept 2012

WHITE PAPER

Managing Performance Storms in the Cloud

by Jagan Jagannathan, Founder and CTO, Xangati

www.xangati.com

Managing Performance Storms in the Cloud © 2012 Xangati, Inc. All rights reserved.

Page 1 of 5

As enterprises, service providers, healthcare organizations, government agencies and educational institutions migrate their data center to virtual and cloud infrastructures, management solutions have failed to provide the necessary information for this dynamic, volatile environment. Critical resources and applications are now shared and subject to spontaneous storms affecting the performance of applications and end users. This white paper will explore common storms affecting virtualization and cloud environments and the necessary management requirements to capture and manage performance storms.

Defining Performance Storms in the Cloud Performance storms are created by the unintended toxic interactions among cross-silo resources in the converged data center. A storm entangles multiple objects – VMs, hosts, end-users, applications, etc. – even if they are unrelated. The entanglement often has a dramatically adverse effect on their performance. Some of the most common performance storms include: •

Storage storms – typically occur when applications unknowingly and excessively share a datastore, which causes storage performance to deteriorate, often dramatically and spontaneously.

•

Memory storms – usually occur when you have multiple VMs trying to share insufficient amount of memory – or, in other cases, you might have a VM that is ‘hogging’ memory and not leaving enough for the others even with ballooning in place.

•

CPU storms – typically occur when there aren’t enough CPU cycles or virtual CPUs to go around in the sharing of processing resources, leaving some with more and some with less.

Managing Performance Storms in the Cloud © 2012 Xangati, Inc. All rights reserved.

Page 2 of 5

•

Network storms – usually occur when too many VMs are attempting to communicate at the same time on a specific interface or when a few VMs are ‘hogging’ a specific interface with traffic – limiting the ability of other VMs to send or receive data.

Solutions Built for a Pre-Cloud World Can’t Deal With Performance Storms With existing performance management solutions, cloud performance storms can take several hours to several days to identify and resolve, according to a recent IT survey we conducted with ZK Research. Why does it take that long? Two important reasons – first, existing solutions, at best, have a fidelity of multiple minutes which is fundamentally incompatible with performance storms that may start and finish within that time interval; second, existing solutions focus on silo-specific metrics that only help generate alerts. Unfortunately, alerts only identify effects of storms – they leave the all-important and often daunting ‘cause analysis’ to administrators to figure out on their own.

Capturing the Causes of Performance Storms Even in the best-run cloud infrastructures, performance storms are part of the new reality, and you must be able to accurately identify, track and resolve these disruptive and spontaneous occurrences in a timely and effective manner. To get to the cause of the problem, you need:

1. Insight into second-by-second interactions; 2. Visibility into both consumption and interactional object behaviors; and 3. Integration with capacity management.

#1 – Second-by-Second Insight Into Interactions Because the cloud is constantly in-flux, it is critical to be able to see interactions on a second-by-second basis in order to capture everything that is occurring within the environment. Equally important is tracking these second-by-second interactions to scale. Given the cloud’s nature, this live, continuous and highly scalable insight is Managing Performance Storms in the Cloud © 2012 Xangati, Inc. All rights reserved.

Page 3 of 5

essential to accurately identify the performance storm and can only be achieved through an in-memory based architecture.

An in-memory architecture allows the system to track what is happening at a precise moment – rather than averaging data out over a five or ten minute time period. The architecture enables seeing the multitude of simultaneous and fine-grain interactions that are responsible for surging metrics. In effect, it provides the critical context and understanding needed to identify patterns that characterize storms. How else do you find the source of a datastore latency storm unless you know which VMs are actually using that datastore at that exact moment in time?

#2 – Visibility into Both Consumption and Interactional Object Behaviors To see what is causing a performance storm, you need visibility not only into how objects are consuming cloud resources but also – and much more critical to determine the problem cause – how objects are interacting with others within the infrastructure. Consumptive silo-specific alerts (using a combination of system-learned and bestpractice thresholds) point to the effects of performance storms – an impacted application or VM, for example – while interactional cross-silo alerts give details that help accurately identify and resolve the source of the problem.

In order to deliver these interactional alerts – and reveal the toxic interactions that may be occurring between different objects – you must have a cross-silo view of the infrastructure – cutting across network, server and storage, as well as applications and end users.

Furthermore, this view needs to scale so that you can easily view the distant and proximate areas of impact for a given storm, as well as the source of contention and the resources affected. Only by seeing the cross-silo interactions can you accurately identify the patterns of interactions that are causing the storm.

Managing Performance Storms in the Cloud © 2012 Xangati, Inc. All rights reserved.

Page 4 of 5

#3 – Integration with Capacity Management The most common culprit for performance storms is under-provisioning of the cloud. Considering this, it seems logical that one would integrate performance and capacity management. Yet today’s virtualization management solutions do not, ignoring the intrinsic connection that exists between the two and dealing with capacity management as a completely separate and distinct entity.

Xangati uniquely believes that performance management must expressly inform capacity analytics; otherwise, you can’t identify the links that exist between performance storms and their intensity and capacity saturation alerts. This linkage leads to recommendations on how to solve the problems that cause storms, typically by either increasing resource capacity or by targeted resource load balancing.

To operate your cloud in an efficient and effective manner, you need the right tools to tackle the highly disruptive and hard-to-detect performance storms that are intrinsic to your cloud. The above three capabilities allow you to be successful in this endeavor. To summarize, they are (1) the live and continuous insight that you need to instantly spot spontaneous and transient storms; (2) the cross-silo visibility into interactional metrics that you need to help identify causes of storms instead of just chasing the effects – aka consumptive metric alerts; and (3) the linkage between performance and capacity management that you need to appropriately add or reallocate resources to avoid future storms.

Managing Performance Storms in the Cloud © 2012 Xangati, Inc. All rights reserved.

Page 5 of 5