A Good Week for Public Cloud Users

It’s only Wednesday and two of the three big public cloud players have announced significant price drops in core services (computing, storage, etc.) along with some functionality improvements:

Google Cloud Platform Live – Blending IaaS and PaaS, Moore’s Law for the cloud

Today, at Google Cloud Platform Live we’re introducing the next set of improvements to Cloud Platform: lower and simpler pricing, cloud-based DevOps tooling, Managed Virtual Machines (VM) for App Engine, real-time Big Data analytics with Google BigQuery, and more.

AWS Price Reduction #42 – EC2, S3, RDS, ElastiCache, and Elastic MapReduce

Effective April 1, 2014 we are reducing prices for Amazon EC2, Amazon S3, the Amazon Relational Database Service, and Elastic MapReduce.

It would not surprise me if we heard some announcements from Microsoft Azure soon, given that the big three tend to track each other rather well when it comes to the apples-to-apples elements of their service menus. Besides the financial savings, Google’s new offerings – while still in limited beta – increase parity between these players’ service offerings. For example, Managed VMs are now available in some form across all three companies (which give a bit more flexibility than pure PaaS options). All now offer Windows-based VMs too.

 

The Panacea for ‘We Want No Downtime’ – A Brief Conceptual Framework for High Availability Planning

In Greek mythology, Panacea was a goddess of Universal remedy. Panacea was said to have a potion with which she healed the sick. This brought about the concept of the panacea in medicine, a substance meant to cure all diseases1. The term is also used figuratively as something intended to completely solve a large, multi-faceted problem2. Unfortunately – when it comes to business computing applications – when someone says “We want no downtime” there is simply no panacea3.

About the only thing you can be sure of when doing high availability planning is that there are a lot of tools to consider using, a lot of decisions to get to making, and a lot of work to get to doing4. This is why a good conceptual framework is important. To make sure the right things are considered and the appropriate decisions are made.

In this post I’ve attempted to outline the framework I use. The aim is to help – in any given situation and over the life of a business application and the business itself – figure out which approaches make sense and what path to take to getting there. We’ll get to that framework in a moment, but let’s make sure we’re clear on one other important thing first.

There’s being proactive and then there’s luck …and then there’s being smart.

The “no downtime” request is perhaps somewhat akin to a patient telling a doctor “I want to be healthy” (which, I suppose, is typically driven by the desire to minimize downtime of a different sort). You can’t literally be healthy anymore than you can literally design for zero downtime. You can only control the inputs, manage which knobs you turn to minimize the likelihood of being unhealthy (or having an outage), and plan so that you are prepared for (or have options – or at least are willing to accept) the inevitable things you can’t prevent with 100% certainty. And you have to make some decisions along the way as to how much you’re willing to invest – time, energy, money, distraction.

Just as it’s possible to eat cheeseburgers your entire life and still live to see your 90s, it is possible to have no downtime with your web application without even making any investments in eliminating single points of failure. Single physical server hosting your web/app and database? No data backups? Experienced no downtime or data loss? Congratulations. Sometimes you just luck out. At the same time, that doesn’t make it a good strategy.

Minimize downtime by managing it

Having a business goal of managing downtime is a perfectly reasonable request, but as with most things involving technology, the requirement must be broken down and analyzed in a practical way before any actions can be made surrounding it. The following conceptual framework is about the closest to a universal way I know of to break down the meaning of “We want no downtime” into something meaningful and useful so that engineering decisions and investment decisions can be made surrounding it.

How to think about “managing downtime”

There are numerous facets to managing downtime – preventing it, minimizing its negative impact, handling it gracefully when it does occur, and having options for handling those really bad situations no one anticipated too. So let’s get to breaking these facets down with specificity:

1. Minimize downtime

…for all reasonable events

2. Speed up recovery time

…for all unreasonable to protect against events

3. Handle outages as gracefully as possible

…don’t leave users hanging (blind) even when the app becomes unavailable (e.g. continue to provide reduced functionality if possible or, when not possible, then provide a friendly outage message)

…provide options for response (see next item)

4. Have enough depth in the architecture so that there are multiple options when the unforeseen occurs

…have data stored in multiple repositories that are as independent as possible

…have various data rollback points

…understand the architecture/platform and individual elements well enough that these options can be used if need be

Important definitions

I threw out two seemingly straightforward terms above – reasonable and unreasonable – that can have very different definitions to different stakeholders (and at different points in time over the life of an application and organization). Defining these is paramount to getting this stuff right. I’d even go as far as to state that defining these well is the crux of getting high availability investments aligned with the business requirements.

What are “reasonable” events?

The definition of reasonable events:

  • What we can anticipate; or
  • What we can afford

What are “unreasonable” events?

The definition of unreasonable events:

  • What we can’t anticipate; or
  • What we can’t afford to prevent

The “or” between each of the above bullet points is important. We can’t always afford all the things we need or know we want. Thinking about “what-ifs” in the above context provides a conceptual framework which technologists and business sponsors can use to make informed decisions about how to proceed.

Once the above are defined, the particular situations / events that apply to a given business application can be discussed with clarity and the decisions made surrounding them.

The decisions made in the above categories drive the architecture and overall investment.

In other words

All of the above put another way:

  • We want to be able to sleep better at night
  • We want to prevent what we can
  • We want to manage what we can’t prevent as best as possible
  • We want to have options when the shit really hits the fan
  • We want to invest wisely
  • We want to be able to improve as tools mature, lessons are learned, business requirements change, and our resources increase

Getting there

As the saying goes, it’s not a question of if, only when.

That doesn’t mean we have to blindly spend money on every conceivable scenario. Nor does it mean we even can spend money on every possible scenario (i.e. unlimited resources is not a panacea, sorry). We can, however, get better as our business maturity demands it and as our resources permit it.

While every business and every situation is different, the analytical framework to make these decisions within is simple enough. Every conceivable scenario can be incorporated into the framework above5. Combined with a strong understanding of the capabilities of the infrastructure and people, and the resources available for investment, the development of an architecture to support your business application’s high availability requirements is completely do-able.

Or you can just wing it on a single server and pray to your favorite Greek goddess6.


  1. alas, does not exist as far as I know 

  2. Paraphrased from Wikipedia: http://en.wikipedia.org/wiki/Panacea 

  3. regardless of your spiritual beliefs and, incidentally, regardless of whether you outsource this problem or handle it all in-house 

  4. cloud offerings have increased the tools available and decreased the barriers to their use, but each service provider’s elements still cannot simply be adopted blindly if one hopes to achieve their organization’s particular business goals 

  5. I think; I’m not perfect, but apparently I don’t mind making bold claims. Ha! 

  6. to be clear: there’s nothing wrong with starting with a single server. Everybody has to start somewhere. Do make sure you have reliable data backups though. 

The 6.5 Stages of IT Monitoring Maturity: A Framework for Enterprise IT Stakeholders

3022261893_aff442bd9a_bAutomated monitoring in IT of any critical IT service goes through a fairly predictable cycle of refinement regardless of the size of organization and complexity of application. The problem is that monitoring – even once defined conceptually – is a still a very large area and the term itself is very generic.

I find it useful to understand the various key stages so that I can recognize where an organization is in its level of maturity, as well to introduce some tangible milestones around where the stakeholders want to go.

In an attempt to create a common understanding I’ve pulled together this reference post. It includes the various types of monitoring, main coverage points, and typical implementation approaches and protocols involved. I also mention a few of the most commonly overlooked items in each stage.

IT Monitoring and Management – Automation

I recently tried to clear up some confusion about monitoring versus management tools for the IT services overseen by organizational IT departments. Today I want to dive a bit into automation.

Most organizations go through a cycle of incremental improvement, when it comes to IT monitoring and management tools. Naturally a significant element of this improvement comes through increased automation and ongoing optimization. In other words, what is monitored and how it is monitored as well as how the elements supporting these services are managed.

IT Monitoring and Automation

Strictly speaking automation is not necessary for monitoring. In practice automation is heavily relied upon in monitoring.1

A lot of the checks that monitoring systems make are easily handled by computers, very uneventful 90%2 of the time, and best performed on a consistent and regular basis.3 A human layer – such as a help desk that doubles as a network operations center – then sits behind the automated monitoring. It is used to dig more intrusively into the events that automated monitoring tools detect, and also to investigate problems reported by users that haven’t set off any alarms in the automated tools.

IT Management and Automation

As for management and automation, it really varies by organization. Some focus more on standardized processes, while others focus more on simplified processes and the creation of specialized management tools that hide the underlying complexity.

Management of IT assets often starts out 100% manual. That is, there are few, if any controls or procedures in place, and most if not all tasks are performed directly on each device, server, or piece of software using whatever functionality is built in for doing so. Over time procedures are standardized and documented and automation is added. Automation means that specialized tools that reduce the oft repeated parts of commonly performed tasks are implemented, breaking these tasks down to their essential inputs4. This not only reduces costs, it also improves consistently5 and average change turnaround time.


  1. Though there are many environments, especially smaller and less mature IT organizations, which have little if any automated monitoring. In these cases, they rely almost entirely on users reporting problems. 

  2. -ish 

  3. such as every few minutes 

  4. and sanity checking those inputs 

  5. critical for meeting service level commitments and managing risk 

Monitoring vs Managing IT Services/Assets

1404828286_5dd50ece10

The monitoring and managing of critical IT services/assets is necessary in any modern organization. Not everyone, however, has the same thing in mind when they discuss monitoring and management improvement initiatives.

Leveraging IT services monitoring and management to meet business goals requires clarity about what these functions serve. Only then is it even possible to discuss what is needed for the organization, what is desired by the various stakeholders, what is possible given current resources, and where things should go if and when further resources are allocated.

Let’s attempt to bring some clarity to this critical IT function. Today I’m going to define monitoring and management and provide a few examples.

What gets monitored and managed?

IT services and assets that are typically monitored and managed include web and database servers, cloud and virtualization environments, network devices and connectivity, third-party providers1, and storage platforms and networks.2

What is monitoring, really?

Monitoring IT services/assets is primarily a passive activity. Its aim is to know (or tell you) what is going on and – if looking at historical data – when. It typically involves not only real-time data, but also historical data. Monitoring is safe and non-intrusive, with the exception of polling data too heavily or triggering a software bug. As more data is gathered and as the targets are refined, it can help you discern why something is going on, not just whether it is or when it is.

Monitoring is useful for detecting problems/events/changes, identifying patterns, planning changes3, identifying correlated events, and isolating root causes. In more mature installations, it can become a tool for accountability and the refinement of service level metrics.

What is management, really?

Managing IT services/assets is inherently not a passive activity. Its aim is make changes and fixes to meet business needs. These adjustments are made through the management functionality built in to the devices and software being managed4. This occurs either directly or using intermediate management tools.5. In addition to the management functionality and tools used, management is about optimizing the work flows6 for cost or responsiveness (to the business) and aligning the risks associated with making these changes and fixes with appropriate levels for the business.

Monitoring and Management Solution Examples

In its most basic form, monitoring consists of using an off-the-shelf solution such as PingdomWhatsUp Gold, or Nagios to see if your servers are online. A form of management would be something like a deploying a new server with a Chef recipe, using the software update sub-system of Microsoft System Center Configuration Manager to keep the software on your organization’s computers up-to-date and patched security-wise, or the work flow associated with even manually deploying a new server.

Sometimes monitoring solutions also include some management capabilities, but they’re really two different functions. Both concepts go hand-in-hand, but serve different purposes.


Photo Credit: http://www.flickr.com/photos/glsims99/1404828286/


  1. e.g. credit card processors, SaaS vendors 

  2. Anything IT is responsible for, should have visibility into, and is likely to be asked to make changes to. 

  3. e.g. When is the server least utilized? How much room do we have to grow? 

  4. e.g. web management interfaces, command-line interfaces, SNMP, APIs 

  5. Most environments have a bit of both. 

  6. e.g. processes, procedures, policies 

Download Wrappers and Unwanted Software are pure evil

Scott Hanselman writes:

Call it Adware, Malware, Spyware, Crapware, it’s simply unwanted. Every non-technical relative I’ve ever talked to has toolbars they apparently can’t see, apps running in the background, browser home pages set to Russian Google clones, and they have no idea how it got that way.

Here’s how they get that way.

Finally someone posted about this. I could should have written this. It’s also a big reason why I supported Jumpshot on Kickstarter, since acquired by Avast (now called avast GrimeFighter).

Microsoft, Past and Future

John Gruber writes:

“A computer on every desk and in every home” was incredible foresight for 1977. It carried Microsoft for 25 years of growth. But once that goal was achieved, I don’t think they knew where to go.

We can only presume that Satya Nadella was hired, in part, to help them figure that out. Only time will tell, but it’ll be fun to watch either way.