In the past week, various people have been telling me about their initiatives in both enterprise data mining and ultrawideband (UWB) wireless networks. Besides growing momentum, these IT domains share a common demand for a statistical approach to system design: data mining with its emphasis on inference and significance, UWB and other shared-spectrum techniques with their sensitivity to worst-case behavior of random events.
Statistics are themselves perhaps the worst-case example of a business or technical discipline thats both too rarely used and too often misused. As we build systems that are larger, that work in less controlled environments, and whose operation involves larger numbers of independent decision makers, its vital for IT professionals to get smarter about our use of statistical methods–and about the inferences we make from the numbers we collect.
Telecommunication carriers have developed a huge body of statistical expertise on systems like the wired telephone network, although this may prove misleading if we try to apply that experience to the new world of always-on unwired access. The transactions of Web services will be, for the most part, less sensitive to split-second timing than voice communications (which have to satisfy our wetwares idea of what sounds natural). On the other hand, our models of when people want to connect, and for how long, have already failed to keep up with even the cumbersome mode of dial-up Internet access. Well have a lot to learn about the new “normal” behavior when people can access the network at will–or when devices and software agents can access each other directly at any time.
Our need for broader statistical expertise is compounded by the tilt toward business intelligence and decision support–rather than mere bookkeeping–as the mission of new enterprise IT projects. Just as we would not willingly build a system that tried to do accounting with 8-bit integers, we owe it to business unit managers to warn them when theyre asking for a system that will mislead them with impressive but ill-founded analyses.
Page Two
Its a dangerously mixed blessing, moreover, that statistical tools are constantly getting easier to use. When anyone can find the best straight-line fit to a collection of data points, its inevitable that more relationships will be “found” that actually have no statistical significance–and even more whose statistical significance says nothing about a cause-and-effect connection.
The other major problem in statistics is everyday misuse of the expression, “the law of averages.” Im constantly surprised to hear people use this phrase to suggest that because things have gone one way for longer than we would expect, theyre soon bound to swing in the other direction to even things out. This is bad enough when people think that a coin that comes up “heads” four times in a row is somehow more likely to come up tails the next time; its worse when they dont even consider the possibility that this coin might have heads on both sides, and that the information theyve gathered so far about its behavior should be used to adjust their model of the process.
Ive previously suggested that biology, rather than physics or mathematics, may be the pure science that has the most to offer to our thinking about our future IT systems. If theres one thing that biology researchers can do, its define a “null hypothesis”–a statement that what they hope to find is not, in fact, the case–and then set themselves to the task of proving that null hypothesis wrong. Id rather see a software company, for example, prove that a system is not insecure, rather than demand that skeptics show them where its vulnerabilities lie.
Getting smart about risks, uncertainties, bursts of demand and burdens of inference from data is a professional challenge that deserves our determined response.