The Anti-converged System: 10 Steps to a Disaggregated Data Center
The Anti-converged System: 10 Steps to a Disaggregated Data Center
A disaggregated data center—a well-connected but intentionally decoupled IT system—is often more flexible and has fewer underutilized resources.
When the data center is viewed as a collection of resources, the workload that runs on those resources should be mobile and easy-to-move. Whether a workload runs in a container (e.g., Docker and Kubernetes), in a virtual machine or via a batch-processing framework, it should be responsive and largely independent of the hardware on which it runs. This allows the data center to migrate workloads and optimize resource utilization.
The network should be flexible enough to support workloads and hardware that are moving and changing. Typically, SDN and NFV are the go-to concepts in this area; as the environment shifts, the network should shift along with it, without the need for manual reconfiguration or human intervention.
The concept of "embarrassing parallelism" has flourished in regard to the disaggregated data center. In parallel computing, an embarrassingly parallel workload—or embarrassingly parallel problem—is one for which little or no effort is required to separate the problem into a number of parallel tasks. This often is the case where there exists no dependency (or communication) between those parallel tasks. In an environment where workloads are transportable and the network is flexible, it is imperative for service designers to build around the embarrassingly parallel aspects of their applications. How can a workload be divided up and distributed to a pool of data center resources that can be scaled up and down as load increases and decreases?
In any data center, failures happen. In a modern disaggregated data center, failures are to be expected. Just as a robust stand-alone application handles read/write failures, transient resource unavailability and unexpected shutdowns, services for the disaggregated data center must expect any and all resources to become temporarily unavailable, and be able to recover from and adapt to these changes in discrete resources.
The CAP Theorem and Eventual Consistency
Eric Brewer's CAP (consistency, availability and partition-tolerance) Theorem is taught in almost any good college-level distributed systems course, and should be battle-tested knowledge for any service architect. The theorem states that it is impossible for a distributed system to maintain all three CAP properties. While some debate exists as to how hard and fast these rules are, they are parameters by which service architects live and die. Eventual consistency is one concept that aids in both the consistency and availability space, and it tends to be more broadly applicable than many would argue. Partition-tolerance can be tricky, and resource and network partitions can wreak havoc on many eventual consistency implementations. However, robust service design with CAP in mind is paramount to the existence of the disaggregated data center.
Look South Into All Subparts of the Rack
The modern disaggregated data center is comprised of racks full of resources. It is imperative for service managers and designers to have the ability to programmatically look southbound into the rack–specifically, being able to enumerate, monitor and control all subparts of all components in that rack. This requires granular application programming interfaces (APIs) to allow access to this information. Ideally, a single, powerful southbound API should be made available, though in some cases, a variety of APIs are cobbled together (e.g. IPMI, SNMP, etc.). Beyond resources, placing sensors at rack-level and component-level also provides a valuable glimpse into what is going on in a rack in a disaggregated data center.
Look North of the Rack, Too
Granular southbound resource information is one thing, but it can turn into a firehose of information out of context without a northbound component to help put the pieces together. Resources in a disaggregated data center do not exist in a vacuum; workloads, environmental characteristics, and cross-rack and cross-data center factors all come in to play. A good way of looking north of the rack is to consider how to package and aggregate southbound information into a shape that is more easily consumable and actionable up the chain. However, let's not confuse this with monitoring and automation. Looking northbound really means determining what is needed to manage the resources and how to get that information there.
Most data centers involve varying degrees of automated monitoring; having a technician walking through the aisles with a clipboard just doesn't scale. In a disaggregated data center, it is critical to monitor every aspect of every resource: sensorification (which is surprisingly inexpensive), device-specific data points (which we get for free via device and OS APIs) and broader environmental characteristics (e.g., building management system and sensor data). The more that is monitored, the better a picture that may be drawn about the overall state of the data center: from heat maps, to resource utilization mapping, customer billing and failure postmortems. As more is monitored in a disaggregated data center, better, cheaper and more resilient services may be built.
Partial data center automation is not uncommon. Tools like Puppet, Chef and Ansible have removed the pain and manual labor aspect from part of the equation, but there is always room to further take costly and error-prone human decision-making out of the decision-making process. In the disaggregated data center, with the capabilities and principles defined in previous sections, it should be possible to automate everything from workload migration to environmental and building controls based on operational insight and measurements superior to the traditional, but useful, PUE (power usage effectiveness) metric, which equals Total IT Power divided by IT Equipment Power.
Intelligent Metric-Driven Decision Making
In a modern disaggregated data center, it becomes possible to focus on metric-driven decisions. When you have well-built services that expect failure and can be easily migrated, running on hardware and in an environment that is heavily instrumented and easily orchestrated, it becomes possible to assert and drive decisions around metrics, such as performance/per watt/per dollar. In the disaggregated data center, data comes from every aspect of the data center, and based on that, intelligent decisions can be made to determine how resources are used and managed.