Saturday, January 13, 2018

At Trend Micro's Sales Kick-Off in Vancouver on January 9, I participated in a panel on Industry Futures with Greg Young and Vicram Phatek (of NSS Labs) led by Sanjay Mehta. To close, I offered a commentary on the failure of some organizations' DevOps approach.

In the form of a rap: Here  


Raw unfiltered| code is the best
Don’t waste your resource| on docs or on test

Users deserve| pure code unalloyed 
DevOps will leave| these folks overjoyed

The user is tops| the user is king 
Custom crafted code, man| that’s the thing

Straight from the artisan| coder to you
No time to waste| on design or review

Change Review Boards| are for losers and fools
Spend on developers| not testers or tools

If the build’s nearly clean| then drop to the Web
Users will know your heart| rules your head

Farm-to-table code yeah man| that’s the deal 
Organic, all natural,| unfiltered, and real. 


Saturday, January 16, 2016

Daddy, What Does an Enterprise Architect do?

I am an Enterprise Architect. I help companies migrate data centers. That can mean moving from one city to another, or it can mean upgrading their existing data center to newer hardware, different or more current versions of applications and middleware; or it can mean moving from in-house platforms to the cloud. The title Enterprise Architect (EA) can be daunting; the concept is relatively new to our discipline. There isn’t much written  describing how it fits conceptually into the various roles that comprise normal IT operations. Over dinner, I described what my job is to a friend by using this analogy:
Suppose you wanted to move from one house to another. Typically you would pack everything up, rent a truck or schedule a moving van, load it up, drive to the new house, unpack everything, get settled in, and maybe go back to the old place and clean it out. The Moving Architect would figure out the number of different types of boxes to buy, for books, for clothing and bedding, for the china, for any art. The "bill of materials" would list the furniture, the number and sizes of the boxes, and any other items needing special handling. The Moving Architect would suggest how big a truck to rent, how many movers to hire, how long the move would take, and how much it would cost.
Moving a data center is more complicated.  Most organizations cannot tolerate an extended outage, so the challenge is more like moving from one house to another without disrupting the daily routines of the people living in the house, and the people who help out at either house like the landscapers, the realtors who want to show the old house, and the trash collectors whose vehicles can not be blocked.
The Enterprise Architect has to consider the family’s daily activities. When does the bus pick up the kids? On moving day, will the kids know to get on to the new bus to take them home to the new house? The EA needs to know which days are school days and which aren’t, and which days the kids might have after-school activities and how long they might take. To move non-disruptively, the EA will stock the fridge and the cupboard in advance, but some foods spoil over time, so that preparation step has to be timed to not waste resources.
Since moving furniture cannot happen instantaneously, the EA will have to fit out the new house with furniture, bedding, towels, and some clothing in advance. The EA has to make sure the utilities are on and the home is ready to occupy. And in preparation for the move, the EA has to lead the family in a dry run for the move, without interrupting their normal daily activities. The EA will provide documentation on how to use the new home features.
The Enterprise Architect has to understand the patterns of use of the IT resources across time, to create a safe, secure, recoverable plan to migrate work non-disruptively from one set of IT infrastructure to another. So the EA will ask questions of the workers that seem as trivial and pointless as asking the kids what they want for breakfast: if the answer is oatmeal, then the shopping list needs to be updated, the utilities have to be up and in place, the cookware that was used on the old gas range may have to be replaced to avoid damaging the new electric radiant heat ceramic stovetop, and the recipe may need to be updated to accommodate that new stove’s different heating and cooking times. Knowing how the IT resource is used in detail helps the EA guide the migration.

This analogy is structured specifically to evoke parallels to the Zachman Framework. What does an Enterprise Architect do? The EA generates an optimal isomorphic mapping of one instantiated Zachman Framework into another.

Wednesday, April 22, 2015

Begin Again, starring Mark Ruffalo and Kiera Knightley, written and directed by John Carney



First, a confession: I didn’t know that Kiera Knightley could sing. She can. And, I didn’t know that Adam Levine could act. He can. 
Begin Again starts with a seemingly random evening in an open-mike night at any Village dive. A British girl sings a pretty ballad, and a drunk gets transfixed by the song. Then the magic starts to happen. We jump back to the start of the drunk‘s day. He’s the co-founder of an independent music label, having a rough time. By the time he ends up at the dive, he’s been fired, punched, embarrassed in front of his 14-year-old daughter, and reminded that he can talk to God. “But what if God doesn’t answer?” He is close to ending his life. That guy is played by Mark Ruffalo. He hears her song and magic happens again: He sees what that pretty ballad could be with the right production behind it.
Then we jump back to the circumstances that brought the Brit to the dive. She’s the girlfriend and co-writer of a musician who has just broken through into the big time – record deal, fawning assistants from the label, US tour in the works, making an album. She gets deftly moved into the background as fame swallows up the singer. He has an affair, she leaves, and ends up at the dump belonging to a friend, where magic happens. He suggests she get out of the place and go hear him sing at this dive. There, he talks her into singing one of the songs she’s written, “For anyone who’s ever been alone in New York.” You will remember that song. That girl is Kiera Knightly. The unfaithful boyfriend is Adam Levine. I found that not only could he act, but he did such a good job that I really don’t like him – based on the character he portrayed! 
The story is exquisitely crafted – honest, with full characters in every role. I wanted to know more about each individual who spoke – there were no incidental interactions. There were wonderful moments of homage to great films, though. One of my favorites followed Kiera asking a seemingly innocent question of Mark, which caused him to freeze up and walk away. He steps outside, then stops as if to turn back and explain, then he turns forward and walks away. This beautiful moment captures and extends the moment in Love, Actually where Kiera Knightly realizes that her new husband’s friend doesn’t hate her, but actually is deeply in love with her. He nearly runs out of his apartment, then stops, turns, and turns away: Choreographically identical, emotionally powerful and honest – and artful. 
In a few words the rapper Troublegum (played by CeeLo Green) gives us the theme of the movie: “When a man like that falls on hard times, people forget who he is. They don’t give him the respect he deserves.”   
The title recalls Robert Preston and Mary Tyler Moore’s Finnegan, Begin Again, with echoes the lines from Finnegan’s Wake: “Us, then. Finn, again! Take.” The movie is strong enough to carry both references.  
The broader theme of the movie is Renewal. Mark’s character, with the help and support of his friends old and new, picks up the pieces and begins again, even stronger than he was before. Events transform him. His healing resurrects his family, his business, his friends, and the city of New York.  It is one of the best movies I‘ve ever seen. http://beginagainfilm.com/#/home

Sunday, December 22, 2013

How Fast Does Water Freeze?



Recently there have been some comments on Facebook and elsewhere explaining why hot water freezes more rapidly than cold water. Since hot water does not freeze more rapidly than cold water, these comments allow us to think about the nature of heat, and the use of a scientific theory. 

Heat and temperature are related but different phenomena. Heat refers to the quantity of energy of a particular kind in a lump of matter of a particular size. Temperature talks about the distribution of heat in a lump of matter regardless of size. That is, a large cup of coffee may have the same temperature as a small cup of coffee, but the large cup has more heat than the small one.  

In my high school physics class, our teacher explained that the amount of heat generated by an engine was a constant – regardless of where the heat went. One of the students asked, “Then how come drag racers put wide tires on the back? Wouldn’t the heat be the same whether the tires were large or small?” The class was stumped for a moment. The answer is that since the amount of heat is the same, distributing it through a broad tire would cause less of a rise in temperature than distributing it through a small tire. Since the total amount of heat was the same, putting it into a larger volume of matter would cause less of a temperature change – the bigger tire wouldn’t melt.

The same argument applies to the freezing phenomenon. The classic conundrum states that if you put a cup of hot water next to a cup of cold water in a freezer, they both freeze at the same time. This means that the hot water froze faster than the cold water.

This is an example of a flawed theory. A scientific theory offers an explanation of an observed phenomenon that can be disproved. In the case of the freezing water, the idea that the hot water freezes more quickly can be disproved simply by placing the hot water in the freezer, and seeing how long it takes to freeze; then putting the cold water in the freezer, and seeing how long it takes to freeze. While the hot water is cooling, at some point it will have the same temperature as the cold water. From that point, the question is how rapidly will two identical amounts of water, at the same temperature, take to freeze. The water’s history doesn’t matter. When the hot water and the cold water are in the same freezer, the hot water warms up the cold water while the freezer removes heat from all its contents.

A theory that cannot be disproved is not a scientific theory. The theory of evolution, like the theory of gravity, explains an observed set of phenomena and allows for predictions that can be either validated or disproved. Dismissing “a mere theory” because it is only a theory is fine. Dismissing a scientific theory as “a mere theory” is thoughtless.

Monday, June 17, 2013

Parables for Architects: Untangling Requirements

The business user comes to you and says, “I need you to transport something from Poland to New York.”
You ask, what is it?  
“It’s worth $30,000,000.”  
Can you tell me anything more about it?
“It is a form of carbon.”
Thanks for that. Are there any other constraints?
“I need it in New York within six months.”
So now you have enough information to begin … what? Nothing, in fact. If the client wants you to transport $30M of coal, you’ll need five cargo ships. If the client wants you to transport $30M of diamonds, you’ll need a bonded courier with a briefcase.
The business user describes what they feel are the most relevant aspects of the problem they want IT to help solve. The business user does not know what decisions IT actually has to make on the way to delivering the solution.
You, the IT architect, perform a complex role. You need to understand the comparative value of the pieces of your organization’s IT infrastructure. You need to understand the IT elements of the business problem the user has to solve. You need to translate the business requirement into the most fit-for-purpose technological solution. You need to flesh out the technical requirement, understand the user’s priorities, and you need to nail down the missing requirements. User input is crucial, but it is generally not sufficient. Users should not understand technology trade-offs. IT architects must understand technology trade-offs.
An IT architect that only knows one technological capability will add no value. Like a clock that’s stopped, this IT-architect-in-utero has only one answer to every question. An effective IT architect must evaluate technology alternatives based on experience and fact. An IT architect must go beyond lore, bias, or taste.
If the business customer has decided to deploy only one IT technology, they will not need an IT architect – there is no problem to solve. The advantage of this simplification is that it saves time and cost. The only disadvantage to this strategy is that all IT infrastructures are not the same. Misaligning the technology and the business gives the users a brittle and ineffective infrastructure. You, the IT architect, know that you don’t cut meat with scissors.
Even if the user does not ask, you, the IT architect, must develop a solution that meets the functional and non-functional requirements the user relies on. Users do not understand the difference between 99.9% availability and 99.9999% availability. They may ask for six-nines, but they may not realize that the cost of the solution may be an order of magnitude greater than the cost of the three-nines alternative. They may not understand the implications of rapid problem determination, or ease of migration, or resiliency, or fast failover. You, the IT architect, must use your mental checklist of systems and network management requirements to complete the user’s requirements before your job is done.
Some corporations deploy a configuration management system or CMDB to help track the infrastructure pieces they have; some even put together a service catalog listing groups of those services. A few have service catalogs that actually speak to users in user terms, but none have a robust, automated alternative to you, a knowledgeable and experienced IT architect. You build that mental checklist during your professional career. When you take on a new project, you will re-evaluate that checklist to identify obsolete capabilities and remove them, and learn new capabilities to add them. That is why you chose to become an IT architect: the job is self-renewing, the job demands your continuous awareness, clear thinking, reasoned analysis, and communications skills. No other industry in human history has the richness and continuously evolving knowledge base that IT offers. You are surfing the wave of the future. You, the IT architect, know that you don’t use five cargo ships to move a handful of diamonds.

Friday, June 7, 2013

Time to Prune those Processes

Anyone who has worked in IT for five years or more knows the truth of the statement: “If you automate a mess, you get an automated mess.” What can we do, and what should we do, about these productivity-sapping kludges? This discussion offers a few short-term actions and some useful longer-term approaches to get some of that productivity back.
One common automation shortcut is to transplant a paper-based process into an exact copy on web screens. The screens look just like the original paper forms, and the process reproduces the exact steps that people used.
The rationale for this relies on the “ease-of-use” and “proven design” fallacies. The business automated the process because it was inefficient, or overly burdensome, or got in the way of needed changes in other business activities. “Ease of use” in this case really means “no learning curve,” which is supposed to reassure the business area that none of their people will have to learn anything new or different. It presumes that the original system was easy to use. Often it was not.
“Proven design” in this case means, “we don’t want to do any business process analysis.” The savings in design and analysis will shorten the front-end of the implementation process, but by not making the design explicit, those well-worn processes remain obscure. This is a problem because unless the implementation team accurately describes the process, the translation from hard copy to soft-copy will create defects. How could that be? In manual processes, users learn to avoid many options they should not take. In computerized processes, the software must handle every possible decision a user could make. The implementation team does not understand that tribal lore the users follow. The implementers must guess what to do in every undocumented case.
That is, the implementation team ends up doing business process analysis under the worst possible conditions:
  •          They do not have the skills to do the job
  •          They do not have the time to do the job
  •          They do not have the tools or procedures to do the job
  •          They do not have formal access to user experts to help with the job, and
  •          They are not evaluated on how well they do the job

Another way to state this is to observe that every system has an architecture. The only question is, does the implementation team know what it is before they begin? The default architecture is inconsistent and incomplete.
Rushing to get from the old (paper-based) process to the new (automated) process on time, the IT organization abandons its core competence, which is to make complex processes clear. Fixing a kludge like this means first identifying and removing gross process inefficiencies, then second streamlining inputs and outputs.
One easy way to find and fix inefficiencies is to measure the elapsed time that process steps take, and use that measurement to extract any activities that are not adding value to the underlying business activity. For instance, some organizations automate approval processes, replacing a physical form with an e-mail link to an approval web page. Look at how long each approver spends on each request over a few weeks. If you find that a person always approves a certain type of request, and does so within 30 seconds, you can conclude that they are not adding any value to the process. They probably got approval rights because they wanted to be aware of requests. The solution is to replace the “action” link with an “informational” link – and not hold up the process by waiting for that rubber stamp.
Streamlining inputs and outputs can save lots of processing time and complexity. Some paper processes produced intermediate reports to measure a team’s workload. Automating these reports and – what is worse – evaluating a team based on that measurement, locks in the original inefficiencies. Every output should have a natural “consumer” who needs that information to do some other job. The other job may be in-line, meaning it waits for the output before it can start, or it may be cyclical, meaning that at some future time the other job starts using a collection of those outputs. Users regard the in-line type (sometimes called “straight-through processing”) as the real application. They may overlook the cyclical type because they do not usually do that part of the work, such as aggregating financial results, auditing, or quality assurance.
By giving the supervisor a measure of the activity, rather than a copy of the results, the process lets the supervisor focus on the business value of the activity rather than the sheer mass of the activity. This moves towards that oft-sought alignment between the IT organization and the business. If the business gets its job done with little extra activity, few procedural glitches, and optimal efficiency, the IT systems that support it are by definition aligned.
It is now spring 2013, a good time to consider pruning those processes.

Thursday, July 26, 2012

The Economic Failure of Public Cloud


The public cloud business will face severe economic challenges in 2014 and 2015, as the business model collapses. Three converging trends will rob the market of profits. First, the barrier to entry, that is, the cost of the technology that makes up public cloud, will continue to drop, following Moore’s Law. Second, the steady increase in personnel costs will attack margin performance. Finally, the commoditization of cloud services will inhibit brand loyalty. Cloud consumers will not want to become locked in to a specific cloud provider. Any attempt to distinguish one cloud from another weakens portability.

This will result in an economic model we are quite familiar with: airlines. As the larger, more mature airline companies sought better margin performance, they sold their planes and leased their fleets back from leasing companies. The airlines do not own the planes or the airports: they own information about customers, routes, demand, and costs. The largest cost airlines face is staff, and as staff longevity increases the cost of personnel steadily grows. So the airline business over the post-deregulation era consists of a regular cycle:

1. Mature airlines enter bankruptcy  
2. The industry consolidates 
3. A new generation of low-cost airlines arises 

All players experience a calm period of steady growth as lower aircraft cost, better fuel efficiency, debt relief from bankruptcy, and lower personnel costs from younger staff make the rejuvenated industry profitable for a while. 

Then the cycle starts again.

One significant difference between airlines and public cloud is the difference in the cost improvements each sector faces. Airlines improve costs in small amounts – a few percent in fuel efficiency, a few dollars more revenue from luggage, from food, increasingly extravagant loyalty programs, and so on. But technology costs have no lower boundary: Within ten years an individual consumer could buy a single computing platform with more storage and processing capacity than most current public cloud customers need.

It will be as though the aircraft leasing companies could lease each passenger their own plane, bypassing the airlines entirely.

So early entrants must cope with collapsing prices just when their potential market moves to a lower cost ownership model. Time-sharing met its market’s needs for a brief while – that moment when early demand for computing capacity far exceeded the consumer’s price range - then disappeared. 

Public cloud computing will dissipate within five years.

Monday, February 13, 2012

On the Use, and Misuse, of Software Test Metrics


“You will manage what you measure” – Fredrick W. Taylor
Testing verifies that a thing conforms to its requirements. A metric is a measurement, a quantitative valuation. So a test metric is a measurement that helps show how well a thing aligns with its requirements.

Consider a test you get in school. The goal of the test is to show that you understand a topic, by asking questions of you about the topic. Depending on the subject, the questions may be specific, fact-based (When did the USSR launch Sputnik?); they may be logic-based (Sputnik orbits the earth every 90 minutes, at an altitude of 250 km. How fast is it moving?); or they may be interpretative (Why did the Soviet Union launch the Sputnik satellite?)

Or they can be just evil: write an essay about Sputnik. Whoever provides the longest answer will pass the test.

Note that by asking similar questions we learn about the student's capabilities in different dimensions. So when a piece of software shows up, the purpose of testing should not be to find out what it does (a never-ending quest) but to find out if it does what it is supposed to do (conformance to requirements). The requirements may be about specific functions (Does the program correctly calculate the amount of interest on this loan?); about operational characteristics (Does the program support 10,000 concurrent users submitting transactions at an average rate of one every three minutes, while providing response times under 1.5 sec for 95 percent of those users as measured at the network port?); or about infrastructural characteristics (Does the program support any W3C-compliant browser?)

These metrics follow from the program's intended use. Management may use other metrics to evaluate the staff: How many bugs did we find? Who found the most? How much time does it take, on average, to find a bug? How long does it take to fix one? Who created the most bugs?

The problem with these metrics is they generally misinform managers, and lead to perverse behaviors. If I am rated on the number of bugs I write, then I have a reason to write as little code as possible, and stay away from the hard stuff entirely. If I am rated on the number of bugs I find, then I am going to discourage innovations that would improve the quality of new products. So management must focus on those metrics that will meet the wider goal - produce high quality, low defect code, on time.

Software testing takes a lot of thinking: serious, hard, detailed, clear, patient, logical reasoning. Metrics are not testing - they are a side effect, and they can have unintended consequences if used unwisely. Taylor advised care when picking any metric. Often misquoted as "you can't manage what you do not measure," Taylor's intent was to warn us. Lord Kelvin said "You cannot calculate what you do not measure" but he was talking about chemistry, not management. Choose your metrics with care. 

Friday, February 10, 2012

Beyond Risk Quantification

For too many years information security professionals have chased a mirage: the notion that risk can be quantified. It can not. The core problem with risk quantification has to do with the precision of the estimate.
Whenever you multiply two numbers, you need to understand the precision of those numbers, to properly state the precision of the result. That is usually described as the number of significant digits. When you count up your pocket change, you get an exact number, but when you size a crowd, you don't count each individual, you estimate the number of people.

Now suppose the crowd starts walking over a bridge. How would you derive the total stress on the structure? You might estimate the average weight of the people in the crowd, and multiply that by the estimated number of people on the bridge. So you estimate there are 2,000 people, and the average weight is 191 pounds (for men) and 164.3 pounds (for women), and pull out the calculator. (These numbers come from the US Centers for Disease Control, and refer to 2002 data for adult US citizens).

So let's estimate that half the people are men. That gives us 191,000 pounds, and for the women, another 164,300 pounds. So the total load is 355,300 pounds. Right?
No. Since the least precise estimate has one significant digit (2,000) then the calculated result must be rounded off to 400,000 pounds.

In other words, you cannot invent precision, even when some of the numbers are more precise than others.

The problem gets even worse when the estimates are widely different in size. The odds of a very significant information security problem are vanishingly small, while the impact of a very significant information security problem can be inestimably huge. When you multiply two estimates of such low precision, and such widely different magnitudes, you have no significant digits: None at all. The mathematical result is indeterminate, unquantifiable.

Another way of saying this is that the margin of error exceeds the magnitude of the result.

What are the odds that an undersea earthquake would generate a tsunami of sufficient strength to knock out three nuclear power plants, causing (as of 2/5/12) 573 deaths? Attempting that calculation wastes time. (For more on that number, see http://bangordailynews.com/2012/02/05/news/world-news/573-deaths-certified-as-nuclear-crisis-related-in-japan/?ref=latest)

The correct approach is to ask, if sufficient force, regardless of origin, could cripple a nuclear power plant, how do I prepare for such an event?

In information security terms, the problem is compounded by two additional factors. First, information security attacks are not natural phenomena; they are often intentional, focused acts with planning behind them. And second, we do not yet understand whether the distribution of intentional acts of varying complexity (both in design and in execution) follow a bell curve, a power law, or some other distribution. This calls into question the value of analytical techniques - including Bayesian analysis.

The core issue is quite simple. If the value of the information is greater than the cost of getting it, the information is not secure. Properly valuing the information is a better starting place than attempting to calculate the likelihood of various attacks. 

Thursday, December 9, 2010

The Coming Data Center Singularity: How Fabric Computing Must Evolve


Summary:
The next generation in data center structure will be fabric-based computing, but the fabric will be two full steps beyond today’s primitive versions. First, the fabric will include network switching and protection capabilities embedded within. Second, the fabric will incorporate full energy management capabilities: electric power in, and heat out.
Hypothesis:
Ray Kurzweil describes the Singularity as that moment when the ongoing increase in information and related technologies provides so much information that the sheer magnitude of it overwhelms traditional human mental and physical capacity. Moore’s law predicts this ongoing doubling of the volume of available computing power, data storage, and network bandwidth, at constant cost. There will come a time when the volume of information suddenly present will overwhelm our capacity to comprehend it. In Dr. Kurzweil’s utopian vision, humanity will transcend biology and enter into a new mode of being (which has resonances with Pierre Teilhard de Chardin’s Noosphere).
Data centers will face a similar disruption, but rather sooner than Dr. Kurzweil’s 2029 prediction. Within the next ten years, data centers will be overwhelmed. Current design principles rely on distinct cabling systems for power and information. As processors, storage, and networks all increase capacity exponentially (at constant cost) the demands for power and the need for connectivity will create a rat’s nest of cabling, compounded with ever-increasing requirements for heat dissipation technology.
There will be occasional reductions in power consumption and physical cable density, but these will not avoid the ultimate catastrophe, only defer it for a year or two. Intel’s Nehalem chip technology is both denser and less power-hungry than its predecessor, but such improvements are infrequent. The overall trend is towards more connections, more electricity, more heat, and less space. These trends proceed exponentially, not linearly, and in an instant our data center capacity will run out.
Steady investment in incremental improvements to data center design will be overrun by this deluge of information, connectivity, and power density. Organizations will freeze in place as escalating volumes of data overwhelm traditional configurations of storage, processors, and network connections.
The only apparent solution to this singularity is a radical re-think of data center design. As power and network cabling are the symptoms of the problem, an organizational layout that eliminated these complexities would defer, if not completely bypass, the problem. By embedding connectivity, power, and heat (collectively called energy management) in the framework itself, vendors will deliver increasingly massive compute capabilities in horizontally-extensible units – be they blades, racks, or containers.
Conclusion:
The next generation in data center structure will be fabric-based computing, but the fabric will be two full steps beyond today’s primitive versions. First, the fabric will include network switching and protection capabilities embedded within. Second, the fabric will incorporate full energy management capabilities: electric power in, and heat out. 

Wednesday, December 8, 2010

The Software Product Lifecycle

Traditional software development methodologies end when the product is turned over to operations. Once over that wall, the product is treated as a ‘black box’: Deployed according to a release schedule, instrumented and measured as an undifferentiated lump of code.
This radical transition from the development team’s tight focus on functional internals to the production team’s attention to operational externals can impact the delivery of the service the product is intended to deliver. The separation between development and production establishes a clear, secure, auditable boundary around the organization’s set of IT services, but it also discourages the flow of information between those organizations.
The ITIL Release Management process can improve the handoff between development and production. To understand how, let’s examine the two processes and their interfaces in more detail.
Software development proceeds from a concept to a specification, then by a variety of routes to a body of code. While there are significant differences between Waterfall, RAD, RUP, Agile and its variants (Scrum, Extreme Programming, etc) the end result is a set of modules that need to supplement or replace existing modules in production. Testing of various kinds assesses the fitness for duty of those modules. During the 1980s I ran the development process for the MVS operating system (predecessor of zOS) at IBM in Poughkeepsie and participated in the enhancement of that process to drive quality higher and defect rates lower. The approach to code quality improvement echoes other well-defined quality improvement processes, especially Philip Crosby’s “Quality is Free.” Each type of test assesses code quality along a different dimension.
Unit test verifies the correctness of code sequences, and involves running individual code segments in isolation with a limited set of inputs. This is usually done by the developer himself.
Function/component test verifies the correctness of a bounded set of modules against a comprehensive set of test cases, designed jointly with the code, that exercise both “edge conditions” and expected normal processing sequences. This step validates the correctness of the algorithms encoded in the modules: for instance, the calculated withholding tax for various wage levels should be arithmetically correct and conform to the relevant tax laws and regulations.
Function/component test relies on a set of test cases, which are programs that invoke the functions or components with pre-defined sets of input to validate expected module behavior, side effects, and outputs. As a general rule, for each program step delivered the development organization should define three or four equivalent program steps of test case code. These test cases should be part of the eventual release package, for use in subsequent test phases – including tests by the release, production, and maintenance teams.
(Note that we avoid the notion of “code coverage” as this is illusory. It is not possible to cover all paths in any reasonably complex group of modules. For example, consider a simple program that had five conditional branches and no loops. Complete coverage would require 32 sets of input, while minimal coverage would require 10. Moderately complex programs have a conditional branch every seven program steps, and a typical module may have hundreds of such steps: coverage of one such module with about 200 lines of executable code would require approximately 2**28 variations, or over 250 million separate test cases.
For a full discussion of the complexity of code verification, see “The Art of Software Testing", by Glenn Meyers. This book, originally published in 1979 and now available in a second edition with additional commentary on Internet testing, is the strongest work on the subject.)
System test validates operational characteristics of the entire environment, but not the correctness of the algorithms. System test consists of a range of specific tests of the whole system, with all new modules incorporated on a stable base. These are:
Performance test: is the system able to attain the level of performance and response time the service requires? This test involves creating a production-like environment and running simulated levels of load to verify response time, throughput, and resource consumption. For instance, a particular web application may be specified to support 100,000 concurrent users submitting transactions at the rate of 200 per second over a period of one hour.
Load and stress test: How does the system behave when pressed past its design limits? This test also requires a production-line environment running a simulated workload, but rather than validating a target performance threshold, it validates the expected behavior beyond those thresholds. Does the system consume storage, processing power, or network bandwidth to the degree that other processes cannot run? What indicators does the system provide to alert operations that a failure is imminent, ideally so automation tools could avert a disaster (by throttling back the load, for instance)?
Installation test: Can the product be installed in a defined variety of production environments? This test requires a set of production environments that represent possible real-world target systems. The goal is to verify that the new system installs on these existing systems without error. For instance, what if the customer is two releases behind? Will the product install or does the customer first have to install the intermediate release? What if the customer has modified the configuration or provided typical add-on functionality? Does the product install cleanly? If the product is intended to support continuous operations, can it be installed non-disruptively? Can it be installed without forcing an outage?
Diagnostic test: When the system fails, does it provide sufficient diagnostic information to correctly identify the failing component? This requires a set of test cases that intentionally inject erroneous data to the system causing various components to fail. These tests may be run in a constrained environment, rather than in a full production-like one.
The QA function may be part of the development organization (traditional) or a separate function, reporting to the CIO (desirable). After the successful completion of QA, the product package moves from development into production. The release management function verifies that the development and QA teams have successfully exited their various validation stages, and performs an integration test, sometimes called a pre-production test, to ensure that the new set of modules is compatible with the entire production environment. Release management schedules these upgrades and tests, and verifies that no changes overlap. This specific function, called “collision detection”, can be incorporated into a Configuration Management System (CMS), or Configuration Management Database (CMDB), as described in ITIL version 3 and 2 respectively.
Ideally this new pre-production environment is a replica of the current production one. When it is time to upgrade, operations transfers all workloads from the existing to the new configuration. Operations preserves the prior configuration should problems force the team to fall back to that earlier environment. So at any point in time, operations controls three levels of the production environment: the current running one, called “n”, the previous one, called “n-1”, and the next one undergoing pre-production testing, called “n+1”.
Additional requirements for availability and disaster recovery may force the operations team to also maintain a second copy of these environments – an “n-prime” copy of the “n” level system and possibly and “n-1-prime” copy of the previous good system, for fail-over and fall-back, respectively.
When operations (or the users) detect an error in the production environment, the operations team packages the diagnostic information and a copy of the failing components into a maintenance image. This becomes the responsibility of a maintenance function, which may be a separate team within operations, within development, or simply selected members of the development team itself. Typically this maintenance team is called “level 3” support.
Using the package, the maintenance function may provide a work-around, provide a temporary fix, develop a replacement or upgrade, or specify a future enhancement to correct the defect. How quickly the maintenance team responds follows from the severity level and the relevant service level agreements governing the availability and functionality of the failing component. High severity problems require more rapid response, while lower severity issues may be deferred if a work-around can suffice for a time.
Note that the number of severity levels should be small: three or four at most. Also, the severity levels should be defined by the business owner of the function. Users should not report problems as having high severity simply to get their issues fixed more rapidly; the number of high severity problems should be a small percentage of the total volume of problems over a reasonable time period.
The successful resolution of problems, and the smooth integration of new functions into the production environment, requires a high degree of communication and coordination between development, QA, and production. Rather than passing modules over a wall, they should be transferred across a bridge: The operations team needs access to test cases and test results along with the new code itself, for diagnostic purposes; and the development team benefits from diagnostic information, reports on failure modes and problem history, and operational characteristics such as resource consumption and capacity planning profiles. The ITIL release management function, properly deployed, provides this bridging function.

Sunday, September 26, 2010

The Blanchard Bone and the Big Bang of Consciousness


Found at a cave in southwestern France, the Blanchard Bone is a curious artifact. It appears to be between 25,000 and 32,000 years old. It is only four inches long. It is engraved with about 69 small figures, arranged in a sequence of a flattened figure eight. Archeologists tell us that the carving required twenty-four changes of point or stroke.

What is it? Looking closely at the carving, it seems that the 69 images represent the phases of the moon over two lunar months (image courtesy of Harvard University, Peabody Museum). It isn’t writing: That was still 20,000 to 27,000 years – over four hundred generations and two ice ages – in the future.

What would the night sky mean to our ancestors so long ago? The Sun was directly responsible for heat and light, and defined the rhythm of days. But the Moon moved so slowly, in comparison. What was it? What did our fathers and mothers think they were looking at, when the Moon rose and traveled across the sky, changing its shape from day to day, but always following the same pattern?

Yet the Moon’s travels had implications and meanings: The Sea responded to the Moon in its tides – or did the tides somehow pull the Moon along? How did that happen? What was going on between the Moon and the Sea?

The ancient artist/scientist/priest who carved this artifact carved what she saw – and more. The artifact was useful, for knowing when to plant, when the birds, the herds, or the fish were migrating, when it might be a good time to find a warm cave. The Moon measured fertility and gestation. When people speculated on this, they began to think – about what they saw, and what it meant.

Some wondered if the bone might be magically linked to the Moon and the Sea. Who among them could not be perplexed by the gymnastics, by the dance, of the Moon?

What would that inevitable nighttime procession inspire? How many nights could people look at the slow but predictable change, observe its correlations, and not be challenged to wonder? The first instances of human reasoning could have been inspired by this persistent phenomenon.

In “Hamlet’s Mill: An Essay Investigating the Origins of Human Knowledge and its Transmission through Myth,” Giorgio Desantillana and Hertha von Dechen propose that the myths, as Aristotle taught, are about the stars. The authors trace the myth of Hamlet back to Amlodhi, who owned a magical mill. Once it ground out peace, but it fell off its axle and on the beach ground out sand, and now it has fallen into the Sea where it grinds out salt. I the author’s essay, they reveal that this myth is a story to capture and preserve the observation of the precession of the equinoxes. This is a 25,950 year long cycle, during which the Earth’s North Pole traces a great circle through the heavens. Now the North Pole points to the Pole Star in Ursa Minor, but in 13,000 years it will point to Vega. Only a medium as persistent as a story could span the ages, capturing and preserving this observation.

When the Blanchard bone was formed, the sky was much as it will appear tonight. Between then and now we have passed through one Great Year. When the North Pole pointed to Vega last, our species was beginning to colonize the Western hemisphere, the ice age was capturing water and lowering the seas, and the Blanchard bone had been lost for ten thousand years.

Let us remember that ancient scientist/artist/priest, let us regard her qualities of observation, synthesis, and imagination with wonder: her discovery in the sky urged us to consciousness, communication, and endless wonders beyond.