Monday, February 13, 2012

On the Use, and Misuse, of Software Test Metrics

“You will manage what you measure” – Fredrick W. Taylor
Testing verifies that a thing conforms to its requirements. A metric is a measurement, a quantitative valuation. So a test metric is a measurement that helps show how well a thing aligns with its requirements.

Consider a test you get in school. The goal of the test is to show that you understand a topic, by asking questions of you about the topic. Depending on the subject, the questions may be specific, fact-based (When did the USSR launch Sputnik?); they may be logic-based (Sputnik orbits the earth every 90 minutes, at an altitude of 250 km. How fast is it moving?); or they may be interpretative (Why did the Soviet Union launch the Sputnik satellite?)

Or they can be just evil: write an essay about Sputnik. Whoever provides the longest answer will pass the test.

Note that by asking similar questions we learn about the student's capabilities in different dimensions. So when a piece of software shows up, the purpose of testing should not be to find out what it does (a never-ending quest) but to find out if it does what it is supposed to do (conformance to requirements). The requirements may be about specific functions (Does the program correctly calculate the amount of interest on this loan?); about operational characteristics (Does the program support 10,000 concurrent users submitting transactions at an average rate of one every three minutes, while providing response times under 1.5 sec for 95 percent of those users as measured at the network port?); or about infrastructural characteristics (Does the program support any W3C-compliant browser?)

These metrics follow from the program's intended use. Management may use other metrics to evaluate the staff: How many bugs did we find? Who found the most? How much time does it take, on average, to find a bug? How long does it take to fix one? Who created the most bugs?

The problem with these metrics is they generally misinform managers, and lead to perverse behaviors. If I am rated on the number of bugs I write, then I have a reason to write as little code as possible, and stay away from the hard stuff entirely. If I am rated on the number of bugs I find, then I am going to discourage innovations that would improve the quality of new products. So management must focus on those metrics that will meet the wider goal - produce high quality, low defect code, on time.

Software testing takes a lot of thinking: serious, hard, detailed, clear, patient, logical reasoning. Metrics are not testing - they are a side effect, and they can have unintended consequences if used unwisely. Taylor advised care when picking any metric. Often misquoted as "you can't manage what you do not measure," Taylor's intent was to warn us. Lord Kelvin said "You cannot calculate what you do not measure" but he was talking about chemistry, not management. Choose your metrics with care. 

Friday, February 10, 2012

Beyond Risk Quantification

For too many years information security professionals have chased a mirage: the notion that risk can be quantified. It can not. The core problem with risk quantification has to do with the precision of the estimate.
Whenever you multiply two numbers, you need to understand the precision of those numbers, to properly state the precision of the result. That is usually described as the number of significant digits. When you count up your pocket change, you get an exact number, but when you size a crowd, you don't count each individual, you estimate the number of people.

Now suppose the crowd starts walking over a bridge. How would you derive the total stress on the structure? You might estimate the average weight of the people in the crowd, and multiply that by the estimated number of people on the bridge. So you estimate there are 2,000 people, and the average weight is 191 pounds (for men) and 164.3 pounds (for women), and pull out the calculator. (These numbers come from the US Centers for Disease Control, and refer to 2002 data for adult US citizens).

So let's estimate that half the people are men. That gives us 191,000 pounds, and for the women, another 164,300 pounds. So the total load is 355,300 pounds. Right?
No. Since the least precise estimate has one significant digit (2,000) then the calculated result must be rounded off to 400,000 pounds.

In other words, you cannot invent precision, even when some of the numbers are more precise than others.

The problem gets even worse when the estimates are widely different in size. The odds of a very significant information security problem are vanishingly small, while the impact of a very significant information security problem can be inestimably huge. When you multiply two estimates of such low precision, and such widely different magnitudes, you have no significant digits: None at all. The mathematical result is indeterminate, unquantifiable.

Another way of saying this is that the margin of error exceeds the magnitude of the result.

What are the odds that an undersea earthquake would generate a tsunami of sufficient strength to knock out three nuclear power plants, causing (as of 2/5/12) 573 deaths? Attempting that calculation wastes time. (For more on that number, see

The correct approach is to ask, if sufficient force, regardless of origin, could cripple a nuclear power plant, how do I prepare for such an event?

In information security terms, the problem is compounded by two additional factors. First, information security attacks are not natural phenomena; they are often intentional, focused acts with planning behind them. And second, we do not yet understand whether the distribution of intentional acts of varying complexity (both in design and in execution) follow a bell curve, a power law, or some other distribution. This calls into question the value of analytical techniques - including Bayesian analysis.

The core issue is quite simple. If the value of the information is greater than the cost of getting it, the information is not secure. Properly valuing the information is a better starting place than attempting to calculate the likelihood of various attacks.