On September 9, 2010, a gas transmission pipeline rupture in San Bruno, California, killed eight people and caused extensive property damage. The subsequent National Transportation Safety Board (NTSB) investigation revealed that some records in the operator’s Geographic Information System (GIS) were inaccurate. The finding is significant because federal law requires pipeline operators to perform data-driven risk analysis as a part of federally mandated pipeline integrity management. The errant data may have invalidated the operator’s risk analysis, and potentially affected events leading to the rupture. As a result, on January 3, 2011, the National Transportation Safety Board (NTSB) issued a series of urgent pipeline safety recommendations calling on operators to perform extensive verification and validation of their pipeline records.
“It’s tough to make predictions, especially about the future.” – Yogi Berra
On September 9, 2010, a gas transmission pipeline rupture in San Bruno, California, killed eight people and caused extensive property damage. The subsequent National Transportation Safety Board (NTSB) investigation revealed that some records in the operator’s Geographic Information System (GIS) were inaccurate. The finding is significant because federal law requires pipeline operators to perform data-driven risk analysis as a part of federally mandated pipeline integrity management. The errant data may have invalidated the operator’s risk analysis, and potentially affected events leading to the rupture. As a result, on January 3, 2011, the National Transportation Safety Board (NTSB) issued a series of urgent pipeline safety recommendations calling on operators to perform extensive verification and validation of their pipeline records.
Federal regulatory agencies are deeply concerned about pipeline industry data management practices. At the April 18, 2011 National Pipeline Safety Forum, NTSB chairman Ms. Debbie Hersman posed the following:”Unfortunately in the San Bruno accident, we found that the company’s underlying records were not accurate… My question is that if your many efforts to improve safety are predicated on identifying risk, and if your baseline understanding of your infrastructure is not accurate, how confident are you that your risks are being assessed appropriately?”
This troubling question leads to a deeper set of questions for GIS professionals:
1) Why is critical data sometimes missing?
2) How does incorrect data enter the GIS?
3) What can we do to improve data quality?
4) Why are our predictive models fragile (why can’t they tolerate incorrect data inputs)?
5) What can we do to make our predictive models robust in the face of “fuzzy” data?
Let’s examine these questions in light of lessons learned from San Bruno, recognizing the answers may prove generally useful to GIS practitioners.
“The map is not the territory.” – Alfred Korzibski
Korzibski’s famous quote answers the first question. Humans are masters of abstraction; we filter reality through the lens of language. GIS data modelers extend abstraction, reducing complex real world entities to attributed points, lines and polygons on a digital map. A GIS grossly simplifies fine-grained reality, presenting a reductionist view of the physical world.
With pipelines, GIS reductionism is exacerbated by how the GIS is initially populated. Following common practice, data for the San Bruno GIS was captured from existing paper maps and construction report documents. These in turn summarized “raw” as-built construction records. The existing maps were drafted at a particular scale; objects too small for display at that scale never made it into the GIS. Consequently, the individual lengths of pipe that failed at San Bruno were not explicitly represented in the operator’s GIS. Only generalized properties for the pipe were captured in the GIS; detailed data for individual pipe lengths, or even batches of pipe, was not accessible through the GIS.
Metallurgical analyses of several failed San Bruno pipe segments indicate they were substandard relative to surrounding pipe. (The analyses also revealed manufacturing defects in their seams, which contributed to the rupture.) GIS reductionism aside, the use of lower quality pipe in certain areas seems like important information. One wonders why special note was not made of this in the construction documents. The answer is simple, and sobering. The lower quality pipe employed at San Bruno was deemed sufficient at construction time. In fact, that pipe survived nearly six decades of service. The engineers who built the San Bruno line were not clairvoyant. They could not predict which bits of information would prove critical almost sixty years later. We are not clairvoyant, either. By extension, we may not know which bits of information will prove critical sixty years from now.
A generalized GIS data manufacturing loop.
Data that enters the GIS is only the tip of the data iceberg; the data iceberg itself is only a simulacrum of the real iceberg existing in physical reality. The most prudent course is to store every bit of information possible, even the seemingly irrelevant. We should assume everything is potentially important. We should fight the reductionism inherent in GIS technology. This is a philosophical approach that reflects attitude and outlook, not technology or process. Properly embraced, it guides how we design and build processes that gather, validate, verify, audit, store, analyze, and distribute data. Attention to detail must become a passion.
“If you can’t describe what you are doing as a process, you don’t know what you’re doing.” – W. Edwards Deming
Let’s turn to the question of how bad data enters the GIS, using San Bruno as an example. The generalized pipe attribution in the San Bruno GIS was incorrect. The GIS depicted the pipe as seamless, but the pipe actually had double longitudinal seam welds. (Seamless pipe is formed by extrusion; seam-welded pipe is formed by bending one or two steel plates around a mandrel, and then welding them together longitudinally to form the pipe.) Seam-welded pipe can be weaker than seamless pipe, and seams are more prone to manufacturing defects. Clearly, pipe seam type is a critical bit of information. At San Bruno, incorrect data in the GIS rendered the operator blind to critical properties of the pipe in the ground.
The NTSB investigation determined that the pipe seam type stored in the GIS was sourced from summary project documents related to original construction. A pipe type code value was misinterpreted during GIS data conversion; the resulting error was never detected. Obviously, there was a problem with quality assurance and control. Many operators are now scrambling to make sure they can provide a positive answer to Ms. Hersman’s query. Many would also tell you, sadly, this is not their first round of GIS data clean up. Some operators have been through multiple rounds of data clean up in recent years, indicating lack of focus on process uniformity and control.
If bad data stems from lack of process control, the answer to our third question is better process control. Data Governance is the marriage of traditional IT data management technology with modern business process management methodologies. In fact, it’s primarily about process management. If we think of data as a manufactured product, then it’s clear that data defects arise from deficiencies in the data manufacturing process. Fortunately for pipeline operators and GIS practitioners, decades of process management expertise may be borrowed from manufacturing to improve GIS data governance.
Three prominent schools of manufacturing process management are applicable to GIS data governance; they may be grossly summarized as follows:
Black swans can be frightening.
Space does not permit discussion of each, but one thing is common to all: they concentrate on defect prevention. Correcting bad data after it enters your GIS is terribly inefficient. The idea is to detect and correct defects before they enter your GIS. It’s far less expensive to correct data defects at, or close to, the point of collection, than it is to ferret out and correct bad data long (in some cases decades) after the fact. An ounce of prevention is worth a pound of cure.
“Things always become obvious after the fact.” – Nassim Nicholas Taleb
Let us now consider questions four and five. Most pipeline risk models calculate probability of failure stemming from various threats, and then combine the result with an estimate of consequence of failure to arrive at an overall relative risk score. By definition, these models calculate the most probable risk value. Most operators perform sensitivity analysis to determine which variables dominate risk scores. Most also account for unknown inputs, applying conservative estimates in the face of unknown data. However, very few take into account the fact that a certain percentage of their data is incorrect.
At San Bruno, the operator’s risk results indicated external corrosion as the primary threat of concern. The operator’s threat assessment methods, preventive and mitigative measures all targeted external corrosion, and controlled it effectively. Threats from manufacturing defects, including seam defects, are addressed by pipeline risk models. However, because of the incorrect seam type in the San Bruno GIS, the operator’s risk results did not reflect seam defects as a potential threat of concern. The pipe assessment methods used by the operator could not detect faulty pipe seams. As a result of faulty GIS data, the operator’s risk analysis was flawed, and actions carried out on the basis of that analysis had no mitigative effect on the true threat of concern, seam defects.
A recent movie might cause one to associate black swans with neurotic ballerinas. This is unfortunate. The term was coined in relation to Scottish Enlightenment philosopher David Hume’s work on the problem of induction. During much of the 17th century, an Englishman could seemingly state with confidence, “all swans we have seen are white; therefore all swans are white.” Black swans were discovered in Australia in 1697, exposing inductive logic’s flaw. In his seminal treatise, Black Swans: the Impact of the Highly Improbable, Nassim Taleb characterizes a black swan as any event, positive or negative, that is highly improbable, and results in nonlinear consequences. The black swan is an outlier beyond the realm of expectation; nothing in our past experience convincingly points to its possibility. Human nature being what it is, we concoct an explanation after the fact, and convince ourselves the black swan was predictable.
Tragic incidents like that at San Bruno often result from several improbable factors combining in the worst possible location. We’ve already identified two at San Bruno: 1) the use of several lengths of lower quality pipe containing seam defects, which were still strong enough to last decades, and 2) errors in the GIS that rendered the operator blind to the seamed pipe. A third wildcard was a 2008 sewer replacement project that used a standard pneumatic fracturing technique which likely damaged the nearby suspect pipe. The confluence of these three factors is highly unlikely; the San Bruno incident was unforeseeable. Like stock market crashes, many pipeline accidents are explainable only in hindsight. As with the market, despite predictive risk models, pipeline accidents remain inherently unpredictable. They are black swans.
Taleb emphasizes that black swans are creatures of chance; the best we can hope for is to make ourselves less vulnerable to them. One technique Taleb uses to assess potential outliers is brute force statistical modeling via Monte Carlo simulation. Given a range of potential inputs for an analytical model, Monte Carlo simulation calculates all possible outcomes. Given some notion of the level of error (uncertainty) in the input data, when applied to a pipeline risk model Monte Carlo simulation will output a probability distribution of relative risk values, rather than the single number output by risk models currently in use. More certain input data narrows the probability distribution; less certain input data widens it. Correctly applied to situations like San Bruno, Monte Carlo simulation might tell us we don’t know what we think we know.
A last word of caution: We must consider the impact of the unknown. Like our GIS databases, our risk models are reductionist. They are incapable of addressing factors outside the domain of the model itself. Even if the risk model inputs are well constrained, and the Monte Carlo risk probability distribution narrow, we are still at risk. Black swan events may still emerge from circumstances beyond the scope of our models.