Automated monitoring in IT of any critical IT service goes through a fairly predictable cycle of refinement regardless of the size of organization and complexity of application. The problem is that monitoring – even once defined conceptually – is a still a very large area and the term itself is very generic.
I find it useful to understand the various key stages so that I can recognize where an organization is in its level of maturity, as well to introduce some tangible milestones around where the stakeholders want to go.
In an attempt to create a common understanding I’ve pulled together this reference post. It includes the various types of monitoring, main coverage points, and typical implementation approaches and protocols involved. I also mention a few of the most commonly overlooked items in each stage.
This article is not meant to be the be-all and end-all. My aim is that establishing guide posts improves discussions and moves all stakeholders forward. The point isn’t to create rules, but a framework upon which forward progress can be made.
This reference should be beneficial to technologists and those who work with technologists to support their businesses. It is slightly technical, but most of those details can be skimmed without a loss of benefit. Use this when planning monitoring improvements in your own organization. Don’t hesitate to shoot me any suggestions or thoughts. I’d love to hear how anyone uses this in their own planning or within their organization!
What’s important isn’t where you start or even are today, but where you end up. Generally, it makes business sense to move through the phases – downward on this list – as time, money, energy, and focus permit.
Before we get to the first Stage, I really should mention “Stage 0.” This is the stage of “we wait for people to call us and tell us something is down.” I don’t really consider this an implementation phase, but it’s important to mention since it is where most organizations start out. Some move quickly away from this …while others seem to stick with it despite the costs.1
Stage 1: Ping response checks
These are designed to answer the eternal query: “Is the server up?” Hopefully the response is “Yes, look the server is up!” but if it’s not you’ll be able to catch it quicker than if you waited around to notice it yourself or, worse, for a customer/user to contact you to report the outage.
The key word here is server. Basic ping server response checks are the bare minimum to be able to say you’re monitoring things. It’s hard to consider yourself a professional shop without having this2. Unless you hire an OCD technician who works around the clock and doesn’t mind repetitive strain injury that is3.
- Servers (Physical or Virtualized)
- Network Paths/Routes
- Network Endpoints
- Ping (built into every operating system)
- Any off-the-shelf basic monitoring solution (software or service)
- Hops just beyond the “next hop” (e.g. the point _just_ past your ISP’s directly connected router)
- Auxiliary but still important/vital servers (email, authoritative DNS servers, recursive DNS servers)
- Misc critical network endpoints (e.g. WAN links, Internet connections)
Stage 2: Service checks
This is designed to answer the imposing query: “Ah, the server is up, but is the service?” Hopefully the response here is “Yes, the web HTTP service is responding, so the web site is up!”4
This stage itself often goes through two sub-stages of refinement:
- Front-end (e.g. user facing services such as a web service)
- Back-end (e.g. a database service and any other services not directly accessed by users).
The service checks themselves are very basic. Their main focus here is on making sure a connection of the appropriate type is accepted on the TCP/UDP port associated with the service being monitored. If slightly smarter, the check may also look for an appropriate banner message or other response indicator (since sometimes ports connect, but the service behind them is non-responsive).
Service checks generally include some performance information (time to respond) as well, but it is rudimentary and you won’t have any control (yet) over what it’s really testing the performance of.
- Any TCP or UDP based service – e.g. HTTP, HTTPS, SMTP, DNS, SQL, etc.
- Any off-the-shelf basic monitoring solution (software or service) that does more than ping servers (i.e. needs to be able to connect to service ports to see if they answer)
- Auxiliary but still important/vital services (e.g. email, authoritative DNS servers, recursive DNS servers)
- Third-party or client-software used APIs
- Back-end services that are dependencies for front-end services (e.g. databases)
Stage 3: Interactive checks
Eventually the question from management shifts to: “Yes, the web site is up, but can users do stuff with it? Can they log-in? Buy stuff?” It doesn’t take very long5 for this question to come up. The end result is usually some frustration and embarrassment, followed by a period of overhauling the current monitoring solution.
- Any TCP or UDP based service – e.g. HTTP, HTTPS, SMTP, DNS, SQL, etc.
- Anything other than “basic only” off-the-shelf basic monitoring solution (software or service)
- Back-end services that are dependencies of front-end services (e.g. specific databases/queries, third-party or inter-application APIs)
- Auxiliary but still important/vital services (e.g. email sending, email receiving, authoritative DNS server queries, recursive DNS server queries)
- Any sort of user interaction beyond the basics of logging in
Stage 4: Server performance monitoring
Eventually when problems occur the question shifts to: “Why?” Or someone wants to do some capacity or upgrade planning. In these situations it helps to have deeper visibility – e.g. CPU use, disk I/O, and memory consumption – and to have a way of looking at the real-time and trending utilization. Having data to point at is the only way to build real business cases for investments.
- Disk I/O
- Swap space
- TCP connections
- …sometimes others… pretty much anything that can be pulled from any sub-system of the operating system or hardware (physical or virtualized)
- Some basic monitoring solutions
- Advanced off-the-shelf monitoring solutions
- OS specific
- In-house (scripts)
- Disk I/O
Stage 5: Network performance/utilization monitoring
Sometimes services go down because of network issues. Network links may go down, but that should have already been covered in Stage 1 above. Here we are concerned about critical links6 that may become heavily utilized unexpectedly.
If you are getting alerts, but all your servers and services appear fine, look at the network. Better yet, have your monitoring solution tell you it’s the network and not your application so you can get to work fixing it sooner. :-) This is also a good point reevaluate whether some devices are not being monitored that should be (e.g. random non-core switches) and beef up network related monitoring in general – namely logs and error counters for individual interfaces.
- Key WAN/LAN Hand-off points (other networks such as ISPs, switch trunks, and mission critical server hand-offs)
- Basic network device/link off-the-shelf monitoring solutions
- Any advanced off-the-shelf monitoring solutions
- SSH & Telnet
- Error counters on interfaces
- Logs, often containing leading indicators of potential problems …as well as lagging clues as to root causes of already known problems
- Layer 2 Switches
- Switch Ports
- Switch Trunk Links
- Unmanaged Switches7
- QoS policies
Stage 6: Application performance monitoring
Sometimes the problem is in your code (or someone else’s). Sometimes there isn’t a problem – yet – but if you only had visibility you’d know that a particular database query that gets regularly made was accounting for 80% of the the page load time for every visitor. That sort of thing. It’s a big deal and this is the holy grail of monitoring for most folks. Also can include things like log correlation to events.
- Your own apps
- Other people’s apps that you host/manage
- Language/platform specific application performance monitoring connectors/plug-ins/solutions
- Off-the-shelf multi-platform application performance monitoring solutions
Stage 6.5: Resilient and Reliable Monitoring
Somewhere amid the above stages the idea of the reliability of monitoring itself will become important.
One of the first problems that used to come up – when all monitoring was done from on-site/in-house by default (because it was the only choice) – is that the monitoring couldn’t really be trusted. That is, a user would call and say your very critical web server is down. You glance over at your fancy monitoring page/app and see the following:
- Web server: Green (good!)
- HTTP service: Green (good!)
- Internet link: Green (good!)
You conclude: must be a problem with the user. Only you’re wrong. Your monitoring is good, but it’s probing is not diverse enough. It’s coverage is only sufficient to tell you how things look from wherever it is probing from. That’s it. Unfortunately you have users located in other locations and on all sorts of nefarious Internet connections.
Things are good from where the monitoring is being done, but not from where your users are. Your monitoring is also 100% reliant on everything being good on the network it happens to be on and at the location it is at. If it’s running on a single host or cluster of servers, that’s also a single point of failure.
This is when you consider expanding the monitoring platform to include geographic diversity, Internet connectivity diversity (since different providers may have problems getting to your services at any given point in time, even if your own Internet link is “up”). This is also where you start to realize how valuable and critical monitoring has become to the business. Thus having some redundancy in the monitoring platform itself may be a good thing as well.
- Outages within your Internet provider(s) other than on the link directly connected to you
- Outages elsewhere on the Internet
- Weird network issues (e.g. MTU issues) that only arise “out there”
- Losing all visibility into your IT services/assets when your standalone/single point of failure monitoring solution goes offline or is inaccessible for whatever reason
- External monitoring services
- PoP to PoP
- Key Customers
- Partners (monitoring swaps)
- Whatever You Are Monitoring
- Different offices/locations
- Different Internet providers
- Top Internet providers typically used by customers/users
- Important Router-to-Router VPN links
Photo credit: http://www.flickr.com/photos/mogwai_83/3022261893/
Perhaps this framework will help them elevate things a bit. ↩
Frankly, these days the bar is higher than here, but everyone has to start somewhere and it’s better to do so here than put off having any monitoring of any type any longer ↩
even the most expensive monitoring solutions will be cheaper than finding someone like this, promise ↩
and then everybody heads out for beer or a graveyard shift meal at Denny’s, after another successful maintenance window completed! ↩
e.g. first partial outage where the web server stays up, but some strange sequence of mishaps results in items added to a shopping cart not showing up in the cart at all ↩
inter-connection / hand-off points between devices, servers, and organizations ↩
which probably don’t belong in any network where reliability is important anyway ↩