Thursday, December 9, 2010

The Coming Data Center Singularity: How Fabric Computing Must Evolve

The next generation in data center structure will be fabric-based computing, but the fabric will be two full steps beyond today’s primitive versions. First, the fabric will include network switching and protection capabilities embedded within. Second, the fabric will incorporate full energy management capabilities: electric power in, and heat out.
Ray Kurzweil describes the Singularity as that moment when the ongoing increase in information and related technologies provides so much information that the sheer magnitude of it overwhelms traditional human mental and physical capacity. Moore’s law predicts this ongoing doubling of the volume of available computing power, data storage, and network bandwidth, at constant cost. There will come a time when the volume of information suddenly present will overwhelm our capacity to comprehend it. In Dr. Kurzweil’s utopian vision, humanity will transcend biology and enter into a new mode of being (which has resonances with Pierre Teilhard de Chardin’s Noosphere).
Data centers will face a similar disruption, but rather sooner than Dr. Kurzweil’s 2029 prediction. Within the next ten years, data centers will be overwhelmed. Current design principles rely on distinct cabling systems for power and information. As processors, storage, and networks all increase capacity exponentially (at constant cost) the demands for power and the need for connectivity will create a rat’s nest of cabling, compounded with ever-increasing requirements for heat dissipation technology.
There will be occasional reductions in power consumption and physical cable density, but these will not avoid the ultimate catastrophe, only defer it for a year or two. Intel’s Nehalem chip technology is both denser and less power-hungry than its predecessor, but such improvements are infrequent. The overall trend is towards more connections, more electricity, more heat, and less space. These trends proceed exponentially, not linearly, and in an instant our data center capacity will run out.
Steady investment in incremental improvements to data center design will be overrun by this deluge of information, connectivity, and power density. Organizations will freeze in place as escalating volumes of data overwhelm traditional configurations of storage, processors, and network connections.
The only apparent solution to this singularity is a radical re-think of data center design. As power and network cabling are the symptoms of the problem, an organizational layout that eliminated these complexities would defer, if not completely bypass, the problem. By embedding connectivity, power, and heat (collectively called energy management) in the framework itself, vendors will deliver increasingly massive compute capabilities in horizontally-extensible units – be they blades, racks, or containers.
The next generation in data center structure will be fabric-based computing, but the fabric will be two full steps beyond today’s primitive versions. First, the fabric will include network switching and protection capabilities embedded within. Second, the fabric will incorporate full energy management capabilities: electric power in, and heat out. 

Wednesday, December 8, 2010

The Software Product Lifecycle

Traditional software development methodologies end when the product is turned over to operations. Once over that wall, the product is treated as a ‘black box’: Deployed according to a release schedule, instrumented and measured as an undifferentiated lump of code.
This radical transition from the development team’s tight focus on functional internals to the production team’s attention to operational externals can impact the delivery of the service the product is intended to deliver. The separation between development and production establishes a clear, secure, auditable boundary around the organization’s set of IT services, but it also discourages the flow of information between those organizations.
The ITIL Release Management process can improve the handoff between development and production. To understand how, let’s examine the two processes and their interfaces in more detail.
Software development proceeds from a concept to a specification, then by a variety of routes to a body of code. While there are significant differences between Waterfall, RAD, RUP, Agile and its variants (Scrum, Extreme Programming, etc) the end result is a set of modules that need to supplement or replace existing modules in production. Testing of various kinds assesses the fitness for duty of those modules. During the 1980s I ran the development process for the MVS operating system (predecessor of zOS) at IBM in Poughkeepsie and participated in the enhancement of that process to drive quality higher and defect rates lower. The approach to code quality improvement echoes other well-defined quality improvement processes, especially Philip Crosby’s “Quality is Free.” Each type of test assesses code quality along a different dimension.
Unit test verifies the correctness of code sequences, and involves running individual code segments in isolation with a limited set of inputs. This is usually done by the developer himself.
Function/component test verifies the correctness of a bounded set of modules against a comprehensive set of test cases, designed jointly with the code, that exercise both “edge conditions” and expected normal processing sequences. This step validates the correctness of the algorithms encoded in the modules: for instance, the calculated withholding tax for various wage levels should be arithmetically correct and conform to the relevant tax laws and regulations.
Function/component test relies on a set of test cases, which are programs that invoke the functions or components with pre-defined sets of input to validate expected module behavior, side effects, and outputs. As a general rule, for each program step delivered the development organization should define three or four equivalent program steps of test case code. These test cases should be part of the eventual release package, for use in subsequent test phases – including tests by the release, production, and maintenance teams.
(Note that we avoid the notion of “code coverage” as this is illusory. It is not possible to cover all paths in any reasonably complex group of modules. For example, consider a simple program that had five conditional branches and no loops. Complete coverage would require 32 sets of input, while minimal coverage would require 10. Moderately complex programs have a conditional branch every seven program steps, and a typical module may have hundreds of such steps: coverage of one such module with about 200 lines of executable code would require approximately 2**28 variations, or over 250 million separate test cases.
For a full discussion of the complexity of code verification, see “The Art of Software Testing", by Glenn Meyers. This book, originally published in 1979 and now available in a second edition with additional commentary on Internet testing, is the strongest work on the subject.)
System test validates operational characteristics of the entire environment, but not the correctness of the algorithms. System test consists of a range of specific tests of the whole system, with all new modules incorporated on a stable base. These are:
Performance test: is the system able to attain the level of performance and response time the service requires? This test involves creating a production-like environment and running simulated levels of load to verify response time, throughput, and resource consumption. For instance, a particular web application may be specified to support 100,000 concurrent users submitting transactions at the rate of 200 per second over a period of one hour.
Load and stress test: How does the system behave when pressed past its design limits? This test also requires a production-line environment running a simulated workload, but rather than validating a target performance threshold, it validates the expected behavior beyond those thresholds. Does the system consume storage, processing power, or network bandwidth to the degree that other processes cannot run? What indicators does the system provide to alert operations that a failure is imminent, ideally so automation tools could avert a disaster (by throttling back the load, for instance)?
Installation test: Can the product be installed in a defined variety of production environments? This test requires a set of production environments that represent possible real-world target systems. The goal is to verify that the new system installs on these existing systems without error. For instance, what if the customer is two releases behind? Will the product install or does the customer first have to install the intermediate release? What if the customer has modified the configuration or provided typical add-on functionality? Does the product install cleanly? If the product is intended to support continuous operations, can it be installed non-disruptively? Can it be installed without forcing an outage?
Diagnostic test: When the system fails, does it provide sufficient diagnostic information to correctly identify the failing component? This requires a set of test cases that intentionally inject erroneous data to the system causing various components to fail. These tests may be run in a constrained environment, rather than in a full production-like one.
The QA function may be part of the development organization (traditional) or a separate function, reporting to the CIO (desirable). After the successful completion of QA, the product package moves from development into production. The release management function verifies that the development and QA teams have successfully exited their various validation stages, and performs an integration test, sometimes called a pre-production test, to ensure that the new set of modules is compatible with the entire production environment. Release management schedules these upgrades and tests, and verifies that no changes overlap. This specific function, called “collision detection”, can be incorporated into a Configuration Management System (CMS), or Configuration Management Database (CMDB), as described in ITIL version 3 and 2 respectively.
Ideally this new pre-production environment is a replica of the current production one. When it is time to upgrade, operations transfers all workloads from the existing to the new configuration. Operations preserves the prior configuration should problems force the team to fall back to that earlier environment. So at any point in time, operations controls three levels of the production environment: the current running one, called “n”, the previous one, called “n-1”, and the next one undergoing pre-production testing, called “n+1”.
Additional requirements for availability and disaster recovery may force the operations team to also maintain a second copy of these environments – an “n-prime” copy of the “n” level system and possibly and “n-1-prime” copy of the previous good system, for fail-over and fall-back, respectively.
When operations (or the users) detect an error in the production environment, the operations team packages the diagnostic information and a copy of the failing components into a maintenance image. This becomes the responsibility of a maintenance function, which may be a separate team within operations, within development, or simply selected members of the development team itself. Typically this maintenance team is called “level 3” support.
Using the package, the maintenance function may provide a work-around, provide a temporary fix, develop a replacement or upgrade, or specify a future enhancement to correct the defect. How quickly the maintenance team responds follows from the severity level and the relevant service level agreements governing the availability and functionality of the failing component. High severity problems require more rapid response, while lower severity issues may be deferred if a work-around can suffice for a time.
Note that the number of severity levels should be small: three or four at most. Also, the severity levels should be defined by the business owner of the function. Users should not report problems as having high severity simply to get their issues fixed more rapidly; the number of high severity problems should be a small percentage of the total volume of problems over a reasonable time period.
The successful resolution of problems, and the smooth integration of new functions into the production environment, requires a high degree of communication and coordination between development, QA, and production. Rather than passing modules over a wall, they should be transferred across a bridge: The operations team needs access to test cases and test results along with the new code itself, for diagnostic purposes; and the development team benefits from diagnostic information, reports on failure modes and problem history, and operational characteristics such as resource consumption and capacity planning profiles. The ITIL release management function, properly deployed, provides this bridging function.

Sunday, September 26, 2010

The Blanchard Bone and the Big Bang of Consciousness

Found at a cave in southwestern France, the Blanchard Bone is a curious artifact. It appears to be between 25,000 and 32,000 years old. It is only four inches long. It is engraved with about 69 small figures, arranged in a sequence of a flattened figure eight. Archeologists tell us that the carving required twenty-four changes of point or stroke.

What is it? Looking closely at the carving, it seems that the 69 images represent the phases of the moon over two lunar months (image courtesy of Harvard University, Peabody Museum). It isn’t writing: That was still 20,000 to 27,000 years – over four hundred generations and two ice ages – in the future.

What would the night sky mean to our ancestors so long ago? The Sun was directly responsible for heat and light, and defined the rhythm of days. But the Moon moved so slowly, in comparison. What was it? What did our fathers and mothers think they were looking at, when the Moon rose and traveled across the sky, changing its shape from day to day, but always following the same pattern?

Yet the Moon’s travels had implications and meanings: The Sea responded to the Moon in its tides – or did the tides somehow pull the Moon along? How did that happen? What was going on between the Moon and the Sea?

The ancient artist/scientist/priest who carved this artifact carved what she saw – and more. The artifact was useful, for knowing when to plant, when the birds, the herds, or the fish were migrating, when it might be a good time to find a warm cave. The Moon measured fertility and gestation. When people speculated on this, they began to think – about what they saw, and what it meant.

Some wondered if the bone might be magically linked to the Moon and the Sea. Who among them could not be perplexed by the gymnastics, by the dance, of the Moon?

What would that inevitable nighttime procession inspire? How many nights could people look at the slow but predictable change, observe its correlations, and not be challenged to wonder? The first instances of human reasoning could have been inspired by this persistent phenomenon.

In “Hamlet’s Mill: An Essay Investigating the Origins of Human Knowledge and its Transmission through Myth,” Giorgio Desantillana and Hertha von Dechen propose that the myths, as Aristotle taught, are about the stars. The authors trace the myth of Hamlet back to Amlodhi, who owned a magical mill. Once it ground out peace, but it fell off its axle and on the beach ground out sand, and now it has fallen into the Sea where it grinds out salt. I the author’s essay, they reveal that this myth is a story to capture and preserve the observation of the precession of the equinoxes. This is a 25,950 year long cycle, during which the Earth’s North Pole traces a great circle through the heavens. Now the North Pole points to the Pole Star in Ursa Minor, but in 13,000 years it will point to Vega. Only a medium as persistent as a story could span the ages, capturing and preserving this observation.

When the Blanchard bone was formed, the sky was much as it will appear tonight. Between then and now we have passed through one Great Year. When the North Pole pointed to Vega last, our species was beginning to colonize the Western hemisphere, the ice age was capturing water and lowering the seas, and the Blanchard bone had been lost for ten thousand years.

Let us remember that ancient scientist/artist/priest, let us regard her qualities of observation, synthesis, and imagination with wonder: her discovery in the sky urged us to consciousness, communication, and endless wonders beyond.