Using IBM UrbanCode Deploy(UCD) to grow your organization's delivery capabilities is a great start, but we have a follow on problem to our own success. As the deployment process is managed through UCD we now have to account for and support production application requirements as they are now dependent on the UCD server's availability and ability to perform. The topic of this article is to discuss the primary means of designing and architecting a scalable UCD deployment with high availability fail-over and disaster recovery capabilities.
- High-Availability versus Disaster Recovery
- Server/Agent Communication
- Disaster Recovery
- Real world Deployments
- Knowing when to Scale
High-Availability versus Disaster Recovery
One key here is to distinguish the capabilities of high-availability versus disaster recovery. In this context we mean *high-availability* to refer to providing horizontal scaling across multiple servers and or data centers to provide both a means to increase capacity and add fault tolerance when one or more UCD servers out of service or failing. In reference to a *disaster recovery* solution which is designed to deal with a catastrophic failure or "smoking hole" type scenario where the entire server or cluster of servers is unavailable, this is normally part of the business continuity plan for large and small organizations. Even with the overall improvement in data center reliability and existing physical and virtual machine redundancy that is available today a *disaster recovery* solution is still necessary, the reason being that as more production level applications are deployed through UrbanCode the path to production for the target applications dictates a new set of service level agreements for the uptime and availability of your deployment solution supporting them.
A helpful starting point to this discussion is to give a high-level architecture picture of UCD to help set some context. What is key is that this solution is purposely built for handling large scale deployments (in the 1000's of deployment targets) so it has been designed accordingly.
The major subsystems shown here are the configuration web tier, the workflow engine, server/agent communication, and the artifact and file management sub-system. We'll take a quick look at how each of these can be scaled up and how that is done.
Configuration Web Tier
The configuration sub-system is the face of the server and what clients access to configure and trigger deployments. This tier of the application exposes a set of RESTful APIs that are shared by the web, UrbanCode Deploy client, and API users for interacting with the system.
Being a web based solution this sub-system can be scaled to handle more concurrent access and improve throughput by scaling horizontally across multiple clustered UCD servers with a HTTP load balancer to distribute traffic.
This solution requires that you activate clustering and performance is highly dependent on the performance of the underlying shared database and filesystem.
The workflow sub-system is responsible for orchestrating deployments across all the deployment targets, this includes distribution and continuously processing updates from the agents as they perform the deployment steps.
As the number of managed agents and concurrent deployments increase, the amount of processing time and book keeping increases. A single UCD server is capable of handling hundreds of these transactions concurrently, and it is possible to continue to add processing capability to the server to support this, but at some point this solution is more practically handled by leveraging clustering and sharing the load across multiple UCD servers.
Scaling the workflow engine subsystem allows us to spread the load across our UCD cluster to increase throughput. In this case we leverage the JMS mesh technology to share messaging data across the servers. One caveat here is that we cannot use a load balancer to simply point to a shared URL as in the HTTP load balancer case. Since the connections from Agent to Server are persistent we recommend using a Round Robin DNS style solution.
Now we will be far outnumbering our servers with agents, due to both the distribution of agents and the scale of persistent connections required. UCD has a model that involves Agent Relays to provide local connection end-points to servers, this can be used to both minimize connections being held open on the server and to minimize the need for all agents to have access to the UCD server. Having less Agents directly connected to the server can also simplify your security rules and is analogous to the jump server in production that many organizations use today anyway to perform deployments to DMZ or other non-trusted environments.
This model has proven to scale to thousands of agents successfully and theoretically could scale to thousands of thousands. It provides and efficient way to leverage shared persistent connections to map a single server thread to each Agent Relay verses a thread per Agent.
Scaling Artifact Repository and Distribution
The Artifact sub-system is the final key area we'll talk about here, this handles the versioning and storage of our deployable artifacts. As stated already, the backup and sharing of this filesystem is crucial to scaling the solution in a clustered environment, or as below for a disaster recovery solution. It is recommended that you provide the UCD server(s) access to fast SAN storage, in order to minimize latency encountered in grabbing and serving plugins and artifacts to Agents.
As of UCD 6.1 a new feature to help alleviate some of this has been added.
Just note that this adds a new security requirement for Agents communicating with the Agent Relay. In this case with are already using JMS(7916) and most likely the HTTP Proxy Port (20080), but the new artifact caching adds a new port to the Agent Relay HTTP Proxy Port +1 (20081). The cache is aimed at providing a local copy of artifacts to avoid the server doing repetitive file transfers where possible. During configuration you specify the Agent Relay Cache size, so depending on your environment it can be a useful method to save bandwidth, especially when you are deploying the same artifact versions to a number of machines. This is another good reason for the use of Agent Relays in your deployed UCD architecture.
This is a special case where we assume that nothing from the original server has survived, let's ignore for the time being that it is possible to minimize the need for this scenario by having setup your primary UCD cluster, database, and shared filesystem that spans two or more data centers.
When we have a DR event, what we need to bring UCD back online and functioning is a pretty short list.
- The Database
- The Asset Repository filesystem
- Configuration directory of UCD installation
- A new server or cluster to run UCD
- Your UCD license(s)
- Your security rules in place for traffic ( HTTP/HTTPS, JMS, JDBC, licensing )
- A DNS switchover
In practice, you want to make sure you have at least nightly database backups and keep transaction logs to replay forward to failure. The filesystem should be synced/duplicated as close to real-time as you can, most SAN devices already provide something like this. The new servers can be pre-baked vm copies of production or built ad-hoc, this is where you will want to have the backed up server configurations used so you already have the correct SSL keys for client communication and secure properties stored in the database. One thing that I have seen over looked too many times is a license to run the tool, our best plans to bring up the server are in vain if it will just fail to run deployments without a license, so be sure that you are grabbing a DR copy of your production licenses for use here. Security rules can be a sticky point as well, so ensure that you have the required rules to get from your various endpoints to both the production server and to the DR site server to avoid fighting that battle during an actual outage. And finally some kind of name switch-over using your organizations name management solution.
Oh yeah, and testing this is a good idea. It is recommended that you test this in a disconnected network segment, so you can simulate a real DNS related cutover and ensure services do in fact re-connect without having to update every agent/agent relay. But if you intend on testing this on the same network there are some caveats, you really won't be able to test your global dns cut over, so you'll have to mitigate that. The other one, the server does know it's url, or at least what you told it to be and depending on what actions you are testing it may respond with this url and inadvertingly re-direct your tests to production. So see `Systems > Settings` and ensure the web urls are correct for your testing environment.
Real World Deployments
First let's start with a prototype of how I envision that an UrbanCode Deployment architecture could look fully scaled up. One note on the Round Robin DNS vs Load Balancer, it is currently possible to do these both with a Load Balancer but requires that we understand how to enable round robin, persistent connections, and do not attempt any kind of SSL off-loading. For that reason load balancing or use of a load balancer for JMS traffic is officially unsupported, but you should be able to get this working to accomplish RR DNS effectively with your choice of load balancer.
Example deployment #1 is a good representation of how many deployments look.
Example Deployment #2 is becoming the trend since the IBM acquisition as more large scale deployments roll out to enterprise customers. You see the addition of clustering and even redundancy at the Agent Relay level.
Knowing When to Scale
It is perfectly reasonable to expect hundreds of servers with dozens of daily deployments to be supported by a single UCD Server and a few Agent Relays with little to no tuning required. However achieving enterprise scale, performance, and reliability ultimately rely on implementing UCD clustering. Making the decision to cluster should be based on multiple factors, the first being a vision or plan the second being the current state.
Consider the planning activity to ensure that you have capacity to meet your deployments needs along with availability, performance, reliability requirements in your business continuity plans (aka if this deploys our production web portal should UCD be required to have the same availability requirements). Part of this is planning to understand the expected usage today, expected growth for the next 6 months, and the next year. I recommend that this is part of at least yearly reconciliation activity to ensure you are staying on target, most organizations do this anyway when they are doing yearly budgeting.
Second is reacting to a changing environment, this is a DevOps world and your solution today could be dramatically different in 3 months if you start producing a range of new products or shift workloads across different technologies. In this dimension ensuring that you have standard application monitoring for UCD are crucial, day one you can start with simple metrics like CPU Utilization, RAM Utilization, Disk Utilization, Disk IO, Net IO, Database Growth, and Database CPU Utilization. The load characteristics of your deployment will be dependent on the deployment workflows that you build and use in your organization so on this front ymmv, but standard server high-water marks can be your roadmap along with the understanding of the UCD component subsystems discussed above to help you identify what bottlenecks and scaling issues you may be facing.
Understanding how to grow your UCD solution and keep your end users needs satisfied is key to a successful UCD deployment. Support for out of the box clustering, connectivity to clustered databases, and building the solution with best practices and proven components can be combined to make UrbanCode Deploy a real heavy weight in enterprise level deployment automation. Albeit that keeping the server running and performing is only part of the puzzle it is a pretty important one that we want to make sure we have a solid plan for.