The primary focus of this post is to focus on the operational concerns of running a healthy UrbanCode Deploy infrastructure, the concepts are mostly generic but I will try and add my thoughts on some relevant points of interest as we step through an example setup and analysis for a linux server in a future post.
But we already have monitoring today, but do you…
The impetuous for sharing is the simple fact that it does not seem like these concepts are universally understood as is often falsely assumed. The unfortunate reality is that we end up falling down pretty quickly in a silo’d organization with many overlapping roles, handoff points, and generally the rise of service ticket driven operations. Being that UrbanCode delivers development and operations based tools we often overlook the simple facts that while this is a well understood problem space it is still pretty immature in terms of execution many times. Ok now that I am off my soap box, let’s get back to business.
This is the nuts and bolts stuff that should just be part of any managed service offering, and if this does not already exist in your “devops” team competency, this is probably the highest value thing to implement for your applications and secondly for your own devops infrastructure. Yes, even before you add your servers, if you have been around in any number of orgs you have seen you far share of shabby infrastructure issues. If you have monitoring day one, it really can help you identify and target issues sooner, a key part our challenge in implementing new development, change, deployment, and release practices is building trust and confidence. In the case where the tools are unreliable one or two outages is enough to undermine confidence in a new set of tools.
ction since this will not help you. You will need your IAAS/PAAS provides monitoring tools to do this level of monitoring. Here I mean you ability to track fundamental elements of your infrastructure the OS CPU usage, OS Memory usage, App CPU Usage, App Memory Usage, Disk I/O and Utilization, Network I/O, and general network connectivity are what I consider the basics here.
The simplest scenarios can be enabled and configured in a matter of minutes to start capturing data, building some method for consuming, processing, and making sense of the data is really where I want to focus here.
Advanced monitoring is a much more interesting topic for most application designers, support teams, and operations teams as this involves direct or indirect introspection into the application middleware, runtimes, and other pieces of the deployed applications. There are many tools available to do this for most available, middleware runtimes, databases, and web based fronting servers (load balancers, caching servers, gateways, firewalls ). The idea here is to provide an added level of intelligence to the monitored systems to correlate say a web request to an application request to the resultant db query and back out to the requestor.
This type of monitoring is great for deep problem identification and very targeted analysis of what is going on inside your application, but generally involves being able to support changes to application configurations to support this level of inspection and the applications must support key concepts like end-to-end transaction identifiers to help get this level of sophistication right. I have been playing around with some of these tools from various vendors and would like to come back to this, but that is for another post.
Setting Thresholds for Notification
Once we have some level of monitoring in place, the next level of maturity here is to put some threshhold indicators into your monitoring tool. We can use our monitoring baselines to identify where the system is normally operating in a steady state, extrapolating from that we want to look again at CPU, Memory, JVM Memory Usage, Disk I/O, and Network I/O. We can start conservatively and tune these to filter out any noise, in this case noise is generated from any spikes of resource utilization, which we fully expect to happen during deployments. So we need to be intelligent about or warnings/notifications, for example the system going to 100% cpu is not a high priority warning alone, however if the CPU stays at 100% utilization for 5-10 minutes we want to notify someone to take some corrective action. This same concept can be put in practice for Disk, Network, and Memory.
In our internal performance testing we have setup some some setup and tear down type steps to enable monitoring, run tests, disable monitoring, and package up the results from the various machines along with the test results. Giving us a good snapshot that we can inspect later to see how the machines are behaving. While fully contained monitoring is useful for this type of testing, a better approach is to implement a continuous monitoring solution that can provide a rolling view of the environment to help both to set a baseline and give you a clear view of what the system’s normal steady state is.