Getting a grip
At some point, the abstractions of data and computation turn into the practicalities of software and hardware that require a lot of attention to work well. "Anybody whos been in a modern data center knows that there are hundreds of components, perhaps with hundreds of tuning parameters," said Steve White, senior manager for autonomic computing science, technology and standards at IBM Research, in Hawthorne, N.Y. "Getting them to work can take days; to work together can take weeks; to work really well can take a career."Current work with IBMs DB2 database, White said, demonstrates some of the possibilities, and eWEEK Labs has also found substantial progress toward this goal in Oracle Corp.s Database 10g. At the same time, White said, its critical to respect the expectations of the people who are charged with managing a system. "The really important thing will be transparency," he said. "The system administrator has to be able to say, I know that wont work, or perhaps, Hmm, I hadnt thought of that." Jim Sangster, an enterprise management software engineer at Sun, in Menlo Park, Calif., agreed that there is a need to keep automation within culturally acceptable bounds: "Fully automated data centers?" Sangster asked. "Thats not the angle were moving forward with. Thats not what customers particularly want, nor am I sure that its achievable." One example, Sangster said, can be found in customers reactions to automation in management tools designed for critical incident response. "Were about to ship Sun Cluster Geographic Edition for disaster recovery across a continent or around the world," he said. "We have yet to have a customer who wants that fully automated. They want a red-button confirmation. Is it really a full site failure, or is there a connectivity failure? ... You want to know the nature of the failure and the business loss and data loss that might result. All sorts of issues need a person to make decisions." IBMs White is nonetheless optimistic that complex tasks will become increasingly automated. "At some point, people will get tired of saying OK"and theyll let the automatic systems do more, he said. White further observed that this is part of a long-standing trend: "People used to spend a lot of time thinking about what sector would get used on a hard disk." Similar thinking was evident in eWEEK Labs conversations with Suns Ratcliffe, who described the "predictive self-healing" direction thats already apparent in the companys Solaris 10 operating system. "We can automate the handling of CPU and memory problems; customers get meaningful error messages and notification of how the system has mitigated the problem and what they should do in the longer run," Ratcliffe said. "We can migrate tasks off a CPU or memory or I/O subsystem before it becomes a problem." Thats just the beginning, continued Sun Distinguished Engineer Ken Gross at the companys physical sciences research facility in San Diego. Suns CSTH (Continuous System Telemetry Harness), launched in response to a European banks call for help about two years ago, is growing in capability and broadening its availability across the full range of Sun machines. The software-based CSTH relies on numerous physical sensors in a hardware package. "A blade has over a dozen [sensors] that were originally put in the computers for gross protectionto shut it down before theres a fire or permanent damage to components," said Gross. "Sun was the first company to poll those sensors continuously and monitor them with a pattern recognition algorithm." Gross said he expects that it will soon be commonplace, at least on well-instrumented hardware, to recognize anomalies and detect potential failures weeks in advance. Vital to the acceptance of these approaches, however, is minimization of false alarms that might shut down hardware without sufficient reason. The mathematical weapon to that end, said Gross, is MSET (Multivariate State Estimation Technique), a system developed for nuclear power plants and other such safety-critical systems and now widely considered a proven approach with a rare combination of sensitivity, reliability and efficiency. For enterprise site builders and managers who arent prepared to offer their head office an MSET seminar, the question is how the transition to self-management can be made both credible and cost-effective. For one thing, said IBMs White, these systems will be much more auditable, a criterion thats top of mind in many executive offices. "Walk onto the floor of a data center and ask, What do you have, and how does it interact? People dont know," White said. "The trend toward self-management will build a foundation of self-awareness. People will be able to ask the system, What are you, and what do you depend on? The system will have made an electronic contract, and it will know what its using." Enterprise sites may find that the statutory mandates of auditability, at least as much as the ROI (return-on-investment) prospects of improved manageability, are the levers that will pry loose the resources they need to acquire and deploy these innovations as they become mainstream products. Next Page: Code thats not secret.
Whites group is trying to tame this complexity at both the low level of core technology and the high level of organizational process. "We want to raise the level of interaction from the knob-and-dial level," he said. His goal is to let a system operator specify the ends rather than the means"to tell the database, Heres how fast I want you to be. If that means you need a larger buffer pool, do that," White said.