analyzing transformer replacement policies: a simulation approach to reducing failure...

Analyzing Transformer ReplacementPolicies: A Simulation Approach to

Reducing Failure Risk

Daniel P. Chen

Advisor: Professor Warren B. Powell

Submitted in partial fulfillment

of the requirements for the degree of

Bachelor of Science in Engineering

Department of Operations Research and Financial Engineering

Princeton University

April 14th, 2014

I hereby declare that I am the sole author of this thesis.

I authorize Princeton University to lend this thesis to other institutions or individuals

for the purpose of scholarly research.

Daniel P. Chen

I further authorize Princeton University to reproduce this thesis by photocopying or

by other means, in total or in part, at the request of other institutions or individuals

for the purpose of scholarly research.

Daniel P. Chen

Abstract

PSE&G and utilities nationwide face a considerable amount of operational and finan-cial risk from the possibility of widespread transformer failure. The current policythat PSE&G uses for transformer replacement does not replace transformers untilthey are close to failure and is not sufficient to protect PSE&G from significant fail-ure risk. This paper implements several replacement policies to reduce failure risk. Itfocuses on policies that utilize both chronological age and the number of faults expe-rienced as criteria for replacement. The results show that these policies are effectiveat reducing failure risk while incurring significant opportunity costs. The final partof this paper explores the trade-off between failure risk and opportunity cost in orderto inform the future decisions of the utility.

Acknowledgements

First and foremost, I would like to thank my advisor, Professor Warren Powell,without whom this thesis would not have been possible. Thank you for introducingme to the problem of transformer replacement as well as consistently pushing me toexplore the problem in new and interesting ways. I am grateful for your constantencouragement, especially during the many times I came to you with what I thoughtwere unsolvable obstacles throughout this process.

I would also like to thank Richard Wernsing and Angela Rothweiler from theAsset Strategy team at PSE&G. Thank you for taking the time to respond to themany inquiries I had and helping me understand the basics of operating a utilitycompany. I hope that the results of this thesis will be half as helpful to you as youhave been to me.

I would be remiss if I did not acknowledge the many friends who have not onlymade this thesis process bearable but also provided an incredible source of supportand hilarity throughout these past 4 years. Thank you for putting up with me.

To the Princeton Tower Club and members of the Centennial Room, for alwaysproviding lively dinner table conversation and an endless supply of coffee. To theORF crew, for ensuring that I was never alone throughout my many late nights andsupplying much needed distractions to keep me from being too productive. And tothe gentlemen of Dod 1S, for being the best roommates a guy could ask for. I can onlyhope that we will get the chance to continue our adventures after graduation.

Specifically, I would also like to thank Ashley Chiang, Shreya Nathan, MedhaRanka, and Satyajeet Pal for their help with the editing process. Thank you fortaking the time out of your busy schedules to read my thesis.

And finally, to my brother, parents, and grandparents who have made everythingat Princeton possible. Thank you for always being there for me throughout my lifeand believing that I can achieve anything I set my mind to.

To Mom, Dad, and Patrick

Contents

1 Introduction 11.1 PSE&G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 The Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Transformer Lifetime . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Transformer Aging . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Handling Transformer Failure . . . . . . . . . . . . . . . . . . 5

1.3 Information from Diagnostics . . . . . . . . . . . . . . . . . . . . . . 61.4 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 The Stochastic Model 122.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Transformer Location Correlation . . . . . . . . . . . . . . . . 122.1.2 Aging Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.3 Failure Threshold . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.4 Failure and Replacement . . . . . . . . . . . . . . . . . . . . . 14

2.2 The Unobservable Model . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Initialization of Model . . . . . . . . . . . . . . . . . . . . . . 142.2.2 The State Variable . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 Exogenous Information . . . . . . . . . . . . . . . . . . . . . . 152.2.4 Transition Functions . . . . . . . . . . . . . . . . . . . . . . . 17

3 Model Selection 193.1 Transformer Location Initialization . . . . . . . . . . . . . . . . . . . 193.2 Number of Faults Per Year . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . 213.2.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.3 Fitting the Poisson Distribution . . . . . . . . . . . . . . . . . 243.2.4 The Negative Binomial Distribution . . . . . . . . . . . . . . . 25

3.3 Location of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Transformer Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Failure Times and Fault Magnitude . . . . . . . . . . . . . . . . . . . 29

3.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.2 The Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5.3 Failure Times . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5.4 Magnitude of Faults . . . . . . . . . . . . . . . . . . . . . . . 333.5.5 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Policies 384.1 The Observable Model . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.1 The State Variable . . . . . . . . . . . . . . . . . . . . . . . . 394.1.2 Decision Variables . . . . . . . . . . . . . . . . . . . . . . . . 394.1.3 Exogenous Information . . . . . . . . . . . . . . . . . . . . . . 404.1.4 Transition Functions . . . . . . . . . . . . . . . . . . . . . . . 424.1.5 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Base Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.1 Simulating DGA Testing . . . . . . . . . . . . . . . . . . . . . 45

4.3 Pure Aging Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.4 Variance Reduction Policy . . . . . . . . . . . . . . . . . . . . . . . . 474.5 Threshold Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 Lookahead Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.6.1 Overview of Lookahead Policies . . . . . . . . . . . . . . . . . 494.6.2 1 year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.6.3 2 year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.7 Policies To Estimate the Value of Information . . . . . . . . . . . . . 524.7.1 Lookahead Plus Policy . . . . . . . . . . . . . . . . . . . . . . 534.7.2 Perfect Information Policy . . . . . . . . . . . . . . . . . . . . 54

5 Results 575.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.1 Opportunity Cost . . . . . . . . . . . . . . . . . . . . . . . . . 575.1.2 Cost of Failure . . . . . . . . . . . . . . . . . . . . . . . . . . 595.1.3 Empirical Objective Function . . . . . . . . . . . . . . . . . . 605.1.4 Failure Risk vs. Opportunity Cost per Replacement . . . . . . 61

5.2 Comparing Policy Performance . . . . . . . . . . . . . . . . . . . . . 625.3 Policy Specific Results . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3.1 Base Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3.2 Pure Aging Policy . . . . . . . . . . . . . . . . . . . . . . . . 675.3.3 Variance Reduction Policy . . . . . . . . . . . . . . . . . . . . 695.3.4 Threshold Policy . . . . . . . . . . . . . . . . . . . . . . . . . 725.3.5 Lookahead Policy . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Comparison Across Different Risk Measures . . . . . . . . . . . . . . 775.4.1 Case Study: Zero Risk Tolerance . . . . . . . . . . . . . . . . 80

5.5 Limitations of Policies . . . . . . . . . . . . . . . . . . . . . . . . . . 825.5.1 Chronological Age vs. Weighted Ranking . . . . . . . . . . . . 835.5.2 Variance in Fault Magnitude . . . . . . . . . . . . . . . . . . . 845.5.3 Expected Amount of Aging Per Year . . . . . . . . . . . . . . 87

5.6 Value of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.6.1 Estimating the Value of Information . . . . . . . . . . . . . . 89

5.6.2 Impact of Measurement Noise . . . . . . . . . . . . . . . . . . 91

6 Conclusion 956.1 Areas for Further Research . . . . . . . . . . . . . . . . . . . . . . . . 966.2 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

List of Figures

1.1 A map of PSE&G’s service area . . . . . . . . . . . . . . . . . . . . . 21.2 PSE&G’s Maintenance Policies . . . . . . . . . . . . . . . . . . . . . 71.3 Summary of gas concentrations from DGA (Hamrick, 2009) . . . . . . 81.4 Gas concentration over time from Karlsson (2007) . . . . . . . . . . . 91.5 The basic structure of the U.S. electric grid (Council of Economic Ad-

visers, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Histogram showing the ages of current transformers . . . . . . . . . . 203.2 Historical number of unique incidents per year . . . . . . . . . . . . . 233.3 Comparison of the Poisson and Negative Binomial distributions. The

green line shows the p.m.f. of the fitted Negative Binomial distributionand the blue line shows the p.m.f. of the fitted Poisson distribution . 26

3.4 Distribution of the number of transformers affected per fault on average 283.5 The number of failures per year in the Base Model . . . . . . . . . . 313.6 Sample paths of failure rates using the fitted stochastic model . . . . 37

5.1 Values of the objective function of policies with different values for thetunable parameters. The orange line represents the minimum valueacross all policies and the labels above each policy represent the mini-mum value within each policy. . . . . . . . . . . . . . . . . . . . . . . 62

5.2 The best and worst values of the objective function within each policy,with costs broken down by opportunity cost and failure cost . . . . . 63

5.3 The relationship between the number of replacements made in thetime interval t = [70, 120] and the opportunity cost per replacementcompared across policies . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4 A sample path of the Base policy from t = [70,120]. The blue linerepresents the number of replacements and the red line represents thenumber of failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.5 Performance of the Pure Aging policy across the 3 different risk mea-sures with different values for α. α decreases from left to right. . . . . 68

5.6 Sample paths of the Pure Aging policy with different values for α . . 695.7 Performance of the Variance Reduction policy across the 3 different

risk measures with β = 0.2 and different values for η. η decreases fromleft to right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.8 Sample paths of the Variance Reduction policy with different values for η 72

5.9 Performance of the Threshold policy across the 3 different risk measureswith β = 0.2 and different values for τ . τ decreases from left to right. 74

5.10 Sample paths of the Threshold policy with different values for τ . . . 755.11 Performance of the Lookahead policies across the 3 different risk mea-

sures with β = 0.2, δ1 = 1, δ2 = 0.5 and different values for γ. γdecreases from left to right. . . . . . . . . . . . . . . . . . . . . . . . 77

5.12 Performance of different policies in terms of reducing p9 with varyingtunable parameters for each policy . . . . . . . . . . . . . . . . . . . 78



5.15 Comparison of different policies across risk measures in the scenario ofzero risk tolerance where failure risk is completely minimized . . . . . 81

5.16 Fitted probability distribution functions of failure values using the twocriteria. The x-axis represents the percentage of the mean value . . . 84

5.17 Fitted probability distribution function of all fault magnitudes . . . . 855.18 Fitted probability distribution function of fault magnitudes separated

by fault type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.19 Distribution functions of the annual amount of aging experienced by

transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.20 Values of the objective function of policies including policies to es-

timate the value of information. The top orange line represents theminimum value across all currently feasible policies. The middle or-ange line represents the minimum value in the Lookahead Plus policy.The bottom orange line represents the minimum value of the PerfectInformation policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.21 Performance of the Lookahead Plus policy under different ε on the p9risk measure. δp increases from the left to right. . . . . . . . . . . . . 92

5.22 Comparison of P it with different values of ε at t=90 in the same sample

path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.23 Impact of measurement noise on Lookahead Plus policy performance

compared to Threshold policy. . . . . . . . . . . . . . . . . . . . . . . 94

List of Tables

3.1 Parameter selection for correlation matrix . . . . . . . . . . . . . . . 293.2 Summary of best fit parameters . . . . . . . . . . . . . . . . . . . . . 36

5.1 Results for Base policy . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2 Results for Pure Aging policy . . . . . . . . . . . . . . . . . . . . . . 675.3 Results for Variance Reduction Policy . . . . . . . . . . . . . . . . . . 705.4 Results for Threshold Policy . . . . . . . . . . . . . . . . . . . . . . . 735.5 Results for Lookahead Policy . . . . . . . . . . . . . . . . . . . . . . . 765.6 Statistics for the failure values for the chronological age and weighted

ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.7 Statistics for the fault magnitude of m2 and m3 faults . . . . . . . . . 865.8 Statistics for the amount of aging per year . . . . . . . . . . . . . . . 885.9 Summary of the value of information . . . . . . . . . . . . . . . . . . 90

Chapter 1

Introduction

Power transformers are one of the most important components of a power grid:

they play a key role in the distribution of power to homes, factories, and other units

on the grid. From the perspective of the utility, the protection and maintenance

of transformers is of the utmost importance. The failure of a transformer is most

often the initial cause or an exacerbating factor in large power outages. Because

transformers are extremely costly and difficult to replace, utilities are constrained in

the number of transformers that it can replace in a given year. Given the uncertainty

surrounding when a transformer will fail, the utility faces a considerable amount

of risk in the failure of transformers. This thesis formulates a stochastic model for

transformer failures in order to design a policy that minimizes the failure risk faced

by a utility.

1.1 PSE&G

This thesis was inspired by close collaboration with Professor Warren Powell and

the Asset Strategy Team at PSE&G led by Richard Wernsing and Angela Rothweiler.

1

After initial meetings with the Asset Strategy Team, it became clear that PSE&G was

running a network with a significant amount of uncertainty surrounding the failure of

its transformers. As a regulated utility with tightly controlled resources for manage-

ment of its grid, PSE&G must implement an effective policy to prevent transformer

failures for the future. Given these resource constraints, PSE&G is especially suscep-

tible to high levels of risk. Thus, transformer reliability represents a top priority for

PSE&G.

1.1.1 The Grid

Figure 1.1: A map of PSE&G’s service area

PSE&G operates a network that serves a densely populated area of New Jersey

that stretches from New York City to Philadelphia, shown in Figure 1.1. PSE&G’s

2

service area is home to more than 60% of New Jersey residents despite occupying less

than 30% of the state’s area. The network consists of 358 transformers in operation,

with a significant number of transformers that are around 50-70 years old. The

primary concern of PSE&G is the failure of a large number of transformers within a

short time frame. Given that more than a third of the transformers in the network

are around 60 years old, the probability of a large number of failures occurring in the

same year is high.

More importantly, the problem of having a sizable cluster of transformers around

the same age is not unique to PSE&G. After World War II, a period of global in-

dustrial growth led to a dramatic growth in energy demand. Between 1950 and

1970, global energy consumption grew from 1 billion mWh to 10 billion mWh (Bart-

ley, 2003). This growth in demand necessitated rapid infrastructure investment and

transformer installation. According to the Council of Economic Advisers, U.S. his-

torical transmission construction reached a peak in 1967, before declining rapidly

through the 1970s and 1980s (Council of Economic Advisers, 2013). Utility compa-

nies nationwide experienced rapid expansion of their networks in a short period of

time, and many now face the same problem as PSE&G.

1.2 Literature Review

1.2.1 Transformer Lifetime

A great deal of literature exists on how to model transformer lifetime in terms

of reliability theory. However, there is not a consensus on what is the best model

for transformer lifetime. Chen and Egan (2006) formulate transformer retirement

patterns using Iowa Curves (asset retirement patterns based on research at Iowa

State University). Given a family of Iowa Curves, Chen and Egan (2006) use a

3

Bayesian method and a Perks’ hazard function to choose the best-fit curve with

highest probability.

van Schijndel et al. (2006) explore the concept of an integral transformer lifetime

model. An integral transformer lifetime model is defined as being able to treat all

relevant degradation mechanisms of all relevant subcomponents. Such a model would

ideally be able to take past degradation data and predict future failure probabilities.

van Schijndel et al. (2006) explore an application on winding insulation degradation,

but stop short of developing a full model. Further research is required on other forms

of degradation and on externally measurable parameters in order to make the model

fully usable.

Hong et al. (2009) develop a statistical procedure for modeling lifetimes of trans-

formers in a system of an unnamed utility company. The procedure recognizes the

difference in lifetime between “New group” and “Old group” transformers and mod-

els them separately. This highlights how the engineering of transformers has changed

over time. Older transformers more than 50 years old were built without the aid of

computer simulation models, which caused manufacturers to build them with signif-

icant overcapacity. Additionally, Hong et al. (2009) identify certain trends among

different manufacturers. A limitation of the procedure, however, is that the predic-

tion interval is extremely wide, since the model was based on limited data. Hong

et al. (2009) recognize that a better model could be built with more detailed usage

data that is not necessarily collected by utilities.

1.2.2 Transformer Aging

The factors that cause transformers to age are well understood. McNutt (1992)

identifies the 3 main factors that impact transformer aging: temperature, moisture,

and oxygen. These 3 factors weaken the strength of the insulation, which can signif-

4

icantly increase the likelihood of transformer failure. McNutt (1992) also highlights

the importance in minimizing the levels of moisture, oxygen, and heat in extending

transformer lifetime.

Arshad et al. (2004) corroborate the same 3 factors that impact transformer

aging. This paper also identifies the occurrence of through faults (“faults”) that

exert significant thermal and electromagnetic forces that weaken the tensile strength

of transformer insulation. Moreover, Arshad et al. (2004) highlight that moisture

and oxygen penetrate the insulation from their presence in the atmosphere. Thus,

given proper maintenance of the transformer, the aging due to moisture and oxygen

is largely constant over time.

Brown (2007) further emphasizes the role of thermal aging and through faults

in transformer failure. A major source of thermal aging is caused by transformer

overload. “Through faults cause extreme physical stress on transformer windings,

and are the major cause of transformer failures.” (Brown, 2007, pg. 19-11)

1.2.3 Handling Transformer Failure

The literature on how to handle transformer failures can be divided into 2 cate-

gories: finding an optimal number of spare transformers and how to determine when

to replace transformers.

Spares

Kogan et al. (1996) identify a procedure to find an optimal number of spares to

keep on hand. Kogan et al. (1996) treat the problem as a standard inventory problem,

and formulate the expected cost as the sum of the cost of nonavailability of a spare

when needed and the cost of owning a spare. In order to accomplish this, Kogan et al.

5

(1996) assume a Poisson distribution of transformer failures.

Chowdhury and Koval (2005) build upon Kogan et al. (1996) to determine a

procedure for finding the optimal number of spares. Failure rates, repair times, and

transformer inventor information are taken as inputs into the model. Chowdhury and

Koval (2005) identify the importance of substation design on the reliability require-

ments of a system of transformers. Depending on the redundancy of transformers at

substations in the system, a spare may or may not be needed in the case of failure at

a location with transformer redundancy.

Replacement

Fischer et al. (2010) explore how to determine the sequence of replacement for

transformers. This is done using 2 criteria: condition and default risk. Specifically,

default risk incorporates the transformer’s importance to the rest of the grid. Default

risk is assessed based on the condition of the insulating system, which is measured

through scheduled diagnostic tests on each transformer.

Dixon et al. (2010) analyze when an utility should replace an aging transformer

and highlights the importance of the insulation system in the functioning of a trans-

former. Although there exist certain symptoms of imminent failure, most of the time

there is significant uncertainty when predicting when a transformer will fail. Dixon

et al. (2010) also note the operational differences between older and new transformers,

noting the cooler operation of and robust design of older transformers.

1.3 Information from Diagnostics

In order to gain information on the condition of transformers in its network,

PSE&G regularly conducts maintenance diagnostics on its transformers . The pri-

6

mary method of diagnostics is through “Dissolved Gas Analysis” (DGA). These tests

are conducted on an annual basis on each transformer in the network. Figure 1.2 is

an excerpt from PSE&G’s “Substation Maintenance Manual”.

Figure 1.2: PSE&G’s Maintenance Policies

The results of the DGA are what PSE&G currently uses to determine when to

replace transformers. However, the limitation of this strategy is that DGA only de-

tects when a transformer is about to fail. Figure 1.3 from Hamrick (2009) shows the

difference in gas concentrations between a transformer operating in normal conditions

compared to failure conditions. From Figure 1.3, it is clear that there are large dis-

crepancies for the concentrations of different gases between the “Action Limits” and

the “Normal Limits.” Moreover, according to Hamrick (2009), the change between

“Action Limits” and “Normal Limits” occurs rapidly. “Where DGA results include

a sharp increase in key gas concentration levels and/or normal limits have been ex-

ceeded, it is suggested that an additional sample and analysis be performed to confirm

the previous evaluation...Once key gas concentrations have exceeded normal limits,

other analysis techniques should be considered for determining the potential problem

within the transformer” (Hamrick, 2009, pg. 2).

Karlsson (2007) describes the accumulation of gases in more detail, by defining

7

Figure 1.3: Summary of gas concentrations from DGA (Hamrick, 2009)

the “gas rate” G:

G =Gas level(t2)−Gas Level(t1)

t2 − t1

The gas rate remains constant during the majority of a transformer’s lifetime, as

the gas level increases linearly. At some point, the gas level sharply increases in

a short period of time, and the gas rate is extremely high. Figure 1.4 shows how

this phenomenon occurs. Karlsson (2007) indicates that once the gas rate exceeds

a certain threshold, the transformer is assumed to have an accelerating wear out

effect. From this, we can see that the results of the DGA will not indicate a problem

throughout the majority of a transformer’s lifetime until the sudden increase in gas

levels occurs.

Unfortunately, this means that DGA does not give very precise information about

the age of the transformer. Its usefulness lies primarily in its ability to detect when the

transformer is close to failure. It should be noted that the results of DGA require some

8

Figure 1.4: Gas concentration over time from Karlsson (2007)

interpretation and are not the sole determinant of transformer replacement. However,

for the purposes of this thesis, it is assumed that once DGA results show that there

has been an increase in the gas concentration, the decision is made to replace the

transformer shortly thereafter. Given the fact that DGA provides little information

on the interim aging process leading up to a transformer’s failure, annual DGA tests

by PSE&G are not sufficient to protect itself from the widespread transformer failure

it is trying to avoid. If too many transformers fail DGA testing in the same year,

PSE&G will not be able to replace them all in such a short time frame. It is from

these constraints that arises the need for replacement policies that take into account

the aging process over time.

9

1.4 Overview of Thesis

If PSE&G believes that its current policy for replacement will lead to years

when a significant number of transformer failures occur, what kind of policy should

PSE&G implement to prevent this from occurring? Given PSE&G’s constraint on

the maximum number of transformer replacements per year, it must begin to replace

transformers before they fail DGA testing. A simple policy that is sure to reduce

failure risk is to replace 10 of the oldest transformers in the network every year. Al-

though this will completely reduce failure risk, it is also incredibly inefficient. PSE&G

must consider the opportunity cost of replacing transformers early when evaluating

replacement policies. The literature establishes that DGA testing is effective at iden-

tifying transformers on the brink of failure. Therefore, the problem is to develop a

policy that identifies transformers for replacement in a given year in addition to the

ones that fail DGA testing. The question becomes what criteria should be used to

identify these transformers in an efficient way such that the opportunity cost of early

replacement is minimized? One criteria for identification is chronological age, where

transformers that exceed a certain chronological age threshold are replaced.

This thesis introduces the idea of utilizing the occurrence of faults in addition

to chronological age as a criteria for replacement. Although there is a wealth of

literature that explores how to model transformer lifetime, there is very little literature

that incorporates fault information. This arises from the difficulty in detecting the

occurrence of a fault and measuring its magnitude. Although PSE&G does not have

fault data, it does have data about outages in the network. After consultation with

the Asset Strategy team, an assumption was made that outages on a substation

circuit indicate a fault occurrence at transformers on the same circuit. A substation

circuit is shown in green in Figure 1.5. From this assumption, the number of faults

that a transformer has experienced over its lifetime can be found in addition to its

10

Figure 1.5: The basic structure of the U.S. electric grid (Council of Economic Advisers, 2013)

chronological age. Although the severity of each fault is still unobservable to PSE&G,

a preliminary hypothesis is that by utilizing this additional information about the

number of faults, a more efficient replacement policy can be implemented.

This thesis approaches the problem in two parts. Part 1 develops a stochastic

model of transformer failures that takes into account the arrival of faults in the net-

work. Using empirical data from PSE&G, appropriate distributions are fit for the

number of faults that occur per year and other stochastic elements of the grid. From

this, failure times are randomly generated for each transformer based on its chronolog-

ical age and the amount of aging caused by faults. Part 2 compares the effectiveness

of different replacements policies on reducing transformer failure by implementing

them in the failure model developed in Part 1. Policies that are implemented include

PSE&G’s current policy, policies that utilize chronological age, and policies that use

both chronological age and fault occurrences. In addition to finding the best perform-

ing feasible policy, policies that utilize information that is currently not observable

to PSE&G are also implemented. These “infeasible policies” are used to estimate

the value of information that could be acquired from investment in new diagnostic

technologies.

11

Chapter 2

The Stochastic Model

The model simulates the arrival of faults in the system over the course of the

next 50 years. There are currently 358 transformers in the system, with the oldest

transformers between 60 and 70 years of age. Thus, in order to capture the arrival of

faults across all the transformers in the system, we simulate the past 70 years as well,

for a total of 120 years of simulation. When looking at the impact of different policies,

the policies are only implemented after year 70 (present day) in the simulation.

2.1 Assumptions

2.1.1 Transformer Location Correlation

Each of the 358 transformers corresponds to a location in the network, and each

fault is simulated to occur at a random location. The model assumes that there is a

correlation between certain locations in the network. This is based on the fact that

faults affect all transformers on the circuit they occur on. This means that faults affect

all transformers at the same substation and any other transformers in substations on

12

the same circuit. Moreover, each fault incident can occur at multiple locations in the

network. This means that each fault incident can impact multiple transformers to

different degrees, which is the foundation of the correlation assumption.

2.1.2 Aging Process

The aging process of transformers is assumed to incorporate two factors: chrono-

logical aging and the impact of faults. Chronological age is equal to the time that a

transformers has been in operation. Each year that a transformer is in operation, the

chronological age of the transformer will increase by 1 year. The impact of faults is

simulated in terms of the amount of aging it causes in a transformer. The combina-

tion of these two factors is defined as real age. At any point in time, the real age of a

transformer is simply the sum of its chronological age and the aging that faults have

caused over the course of its lifetime.

2.1.3 Failure Threshold

Failure thresholds are assumed to be in terms of real age. Each transformer in the

network is simulated with its own specific failure threshold. Once a transformer’s real

age passes its failure threshold, it is assumed to be in failure. The failure threshold is

also assumed to be different for transformers of different ages. Transformers produced

more than 50 years ago contain more excess capacity than transformers produced

more recently. Thus, the failure threshold is assumed to be higher for those older

transformers.

13

2.1.4 Failure and Replacement

When a transformer is in failure, it is recorded as having failed and replaced

with a new transformer. Similarly, if a transformer is chosen by the policy to be

replaced before failure, it is also replaced with a new transformer in the same location.

No policies for replacement are implemented before year 70 (present day) in the

model. The new transformer that is installed is assumed to be of the younger kind of

transformer and has the lower failure threshold explained in section 2.1.3.

2.2 The Unobservable Model

This model captures the failure rate of transformers. It is important to recognize

that this model is from the perspective of simulating a truth. In this respect, this

model is unobservable to the policies and distinct from the model they will use.

The policies are not able to see this truth, meaning that the problem is “partially

observable” from the perspective of the policies. The model that will be used by the

policies will be discussed in Chapter 4, Policies.

2.2.1 Initialization of Model

The total number of transformers in the network is N = 358 and the transformers

are denoted by their location 1, 2, ..., N . When a new transformer is installed, it

inherits the location denotation of the transformer it is replacing. Time is denoted

by t, with t = 1, 2, ...120, where t = 1 is 70 years ago, t = 70 is present day, and

t = 120 is 50 years from today and the end of the simulation. The model assumes

that decisions are made at the end of time t. Therefore, the time period t denotes

the period after time point t and before t+ 1.

14

2.2.2 The State Variable

The state variable st includes the following:

1. ait boolean for transformer i ∈ [1, N ]

Denotes whether or not the transformer has entered operation. ait = 1 if the

transformer at location i has been initialized and 0 otherwise. Because the

transformers in the network are of different ages, the model must keep track of

when each transformer is first initialized. This is only important in t = 1, 2, ...70.

After t = 70, ait = 1 for all i.

2. cit for transformer i ∈ [1, N ]

Denotes the chronological age of transformer i.

3. rit for transformer i ∈ [1, N ]

Denotes the real age of transformer i.

4. Lit for transformer i ∈ [1, N ]

Denotes the failure threshold (limit) of each transformer.

5. F vector of length 120

Denotes the number of failures in each time period t. At the end of each time

period t, F[t] will be updated to be the number of failures that occurred during

t. This is ultimate objective of the Stochastic Model.

We denote the unobservable state st =([ait], [c

it], [r

it], [L

it],F

).

2.2.3 Exogenous Information

Exogenous information includes any events that are random. As such, the ex-

ogenous information in this case will include the arrival of faults. The exogenous

information is as follows:

15

1.ˆN ft

Denotes the number of faults that occur in time period t.

For every fault that occurs, there is additional exogenous information that is asso-

ciated with each fault. Let j = 1, 2, ...,ˆN ft . The additional exogenous information

is:

2. ljt for every fault j

Denotes the location that the fault occurs at. ljt is constrained to be in the

interval [1, N ]Z .

3. mjt for every fault j

Denotes the magnitude of the fault. This will be in terms of real age.

In addition, the exogenous information will include which transformers go into oper-

ation at the end of time period t:

4. at vector

Denotes which transformers enter operation at the end of time t.

Lastly, the exogenous information includes certain static variables including the cor-

relation matrix between transformer locations and the failure thresholds of new trans-

formers:

5. ρ matrix of dimensions N ×N

Denotes the correlation between two transformer locations such that ρ(x, y) is

the correlation between transformers x, y.

6. L

Denotes the failure threshold of a newly installed transformer

16

2.2.4 Transition Functions

As noted above, these transition functions occur at the end of time period t.

It is important that these transition functions do not take into account replacement

policies, because the Stochastic Model is used for simulating failure times. Certain

transition functions, such as in the case where a transformer is replaced before reach-

ing the failure threshold, will have to be adjusted in the policy simulations, which is

detailed in Chapter 4, Policies.

1. ait+1 = 1 if i ∈ at

Denote transformer i as entering operation if the exogenous information denotes

as such.

2. cit+1 = cit + 1 if ait = 1

Increment the chronological age of transformer i if it was in operation during

time period t.

3. In order to update rit+1, we must iterate through each of theˆN ft faults that

occur in time period t and determine which transformers are affected by each.

This updating is done using Algorithm 1.

Algorithm 1 Transition Function for rit+1

for j = 1 toˆN ft do

for i = 1 to N doif ait = 1 then

rit+1 = rit + mjt ∗ ρ(ljt , i) + 1

end ifend for

end for

The above transition functions take into account the aging that has occurred to the

transformers over the course of time period t. Next, the model must check if any

of these transformers have experienced failure. This yields the following transition

17

function:

4. If a transformer fails, the transformer must be replaced and the variables of age

reset. This means that Lit+1 and F[t] must be updated together. In addition,

in the event that transformer i fails, rit+1 and cit+1 must also be reset. This is

all done through Algorithm 2.

Algorithm 2 Transition Function for Lit+1, rit+1, c

it+1 in the event of failure

F[t] = 0for i = 1 to N do

if rit+1 > Lit thenF[t] = F[t] + 1Lit+1 = Lrit+1 = cit+1 = 0

elseLit+1 = Lit

end ifend for

18

Chapter 3

Model Selection

The previous chapter outlines the basics of the failure model. The next task

is to make this model as realistic as possible. This chapter discusses the fitting of

distributions for the stochastic processes outlined in the previous chapter. The parts

of the model incorporating uncertainty are all formulated as part of the exogenous

information. For each of these variables, we choose a distribution and fit appropriate

parameters. Given the data that is available from PSE&G, different variables were

fit using different methodologies. Fitting to existing data was used where possible.

However, in some cases, there was little or no data, requiring assumptions that were

formulated in conjunction with the Asset Strategy Team.

3.1 Transformer Location Initialization

The time of initialization of each transformer location is based on the age of the

transformers in the existing network. This corresponds to the at variable. The distri-

bution of transformer ages is derived from the ages of the transformers in the current

transformer network. PSE&G provided data on their current batch of transformers

19

which included the year of installation. Figure 3.1 shows the distribution of ages in

PSE&G’s current network. 180 out of the 358 transformers are between 50 and 70

Figure 3.1: Histogram showing the ages of current transformers

years of age (Group 1). As Figure 3.1 shows, the number of transformers in this

age range is significantly higher than any other age range. Since this is the basis of

the entire problem, the model for transformer initialization must first and foremost

take this into account. In terms of the ages of the younger transformers, they are

relatively evenly distributed between 0 and 50 years of age (Group 2). There is also

a handful of transformers in the 80 year old range. These older transformers were

built with different capacities than Group 1 or Group 2 transformers. Thus, in order

to avoid needless complexity from these older transformers that are not pertinent to

the problem, the oldest group of transformers is assumed to belong to Group 2.

20

Thus, 2 different distributions are needed to simulate the ages of Group 1 and

Group 2 transformers. The transformers are indexed by i ∈ [1, 358]. Transformers

[1, 180] belong to Group 1 and transformers [181, 358] belong to Group 2. Based on

the above data, the following model of ages was chosen:

Agesi ∼

Unif(60, 70), if i ∈ [1, 180]

Unif(0, 50), if i ∈ [181, 358]

The following equation describes the relationship between Ages and at

at = [all i]

where 70− Agesi = t

3.2 Number of Faults Per Year

The key part of this model is the number of faults that occur in a given year.

PSE&G has historical data on the number of faults that have occurred between 2002

and 2013. This data will be used to fit the distribution of theˆN ft variable.

3.2.1 The Poisson Distribution

A Poisson process is a stochastic process which counts the number of events that

occur in a given time interval. In this case, we are concerned with how many faults

occur every year. Thus, the Poisson distribution is a good and realistic distribution

to use in order to simulate faults experienced by transformers.

LetˆN ft denote the number of fault occurrences that occur in a given year t. If

21

Pt is a Poisson process, then:

ˆN ft = Pt − Pt−1

Pt, by virtue of being a Poisson process will have the following properties:

1. P0 = 0

2. Pt follows a Poisson distribution with parameter λ. For every k = 0, 1, 2, ..., the

probability mass function of Pt is equal to

P (Pt = k) =λke−λ

k!

3. The increments are independent such that Pt−Pt−1 is independent of Pt+1−Pt

4. The increments are stationary such that the expected number of arrivals in a

given interval E(Pt − Pt−1) = λ(t− (t− 1)) = λ(1) = λ

Thus, according to the distribution we have outlined, E(ˆN ft ) = λ for a given

year t. In order to properly use this model in the simulation, fault occurrences will

be assumed to be a homogenous Poisson process. This means that λ is assumed to

be constant over time. This is a realistic assumption because the amount of faults

experienced by transformers should be relatively the same over time. In order for the

model to be valid over the past history, the assumption of homogeneity must hold

true.

3.2.2 The Data

The fault data comes from the Plant Operations Report (POR) from PSE&G.

The POR lists all the outages or incidents that have occurred in the network. Note

that these outages are outages of the power grid and not specific to transformers. The

POR is the best data that PSE&G has since it does not have detailed transformer

22

fault data. After consultation with the Asset Strategy Team, a reasonable assumption

was made that each outage on a substation circuit corresponds to a fault occurring

at a transformer on the circuit. Each outage in the POR is identified by an incident

ID and a circuit location. The assumption is that each outage that occurs in a given

circuit location means that a fault of some sort has occurred at the transformers on

that circuit.

Moreover, analyzing the incident IDs shows that a given incident can impact

multiple locations and by extension the transformers on those circuits. This is the

basis of the assumption of correlation between faults at different transformer locations,

since the data shows that a fault at one transformer is likely tied to the same incident

that caused faults at other transformer locations. Thus,ˆN ft is fitted to the number

of unique incidents that occur every year. Figure 3.2 shows the graph of the number

of incidents per year in the years for which there is data.

Figure 3.2: Historical number of unique incidents per year

23

3.2.3 Fitting the Poisson Distribution

Analysis of the data yields:

µ = 230.6

σ2 = 1, 767.5

A Poisson distribution has 1 parameter λ such that µ = σ2 = λ. Given the estimates

of µ and σ2 from the data, the preliminary results indicate that the Poisson distribu-

tion might not yield a good fit. The poissfit function in MATLAB finds λ using the

Maximum Likelihood Estimator. Using the poissfit function on the historical fault

data yields:

λ = 230.6

Unsurprisingly, the MLE estimate for λ is equal to the mean of the historical data. In

order to check the goodness-of-fit of the fitted model, the Kolmogorov-Smirnov test

(KStest) is used. The results of the KStest are:

pPoisson = 0.3079

Although the p-value is high enough that the KStest fails to reject the null hypothesis

that the empirical data comes from the Poisson distribution, it signifies that the

Poisson distribution is not an especially good fit for the historical fault data. Given µ

and σ2, the empirical data is characterized by a variance that is far greater than the

mean. This problem is defined as overdispersion, where the variance of the empirical

data is higher than is expected from a given statistical model.

24

3.2.4 The Negative Binomial Distribution

Ismail and Jemain (2007) identify the use of the Negative Binomial distribu-

tion as a suitable variation on the Poisson distribution to handle the problem of

overdispersion. Furthermore, the Negative Binomial distribution is used in the ex-

isting literature to model grid outages (Liu et al., 2005). A process following the

Negative Binomial distribution is characterized by many of the same properties as a

Poisson process, such as independent and stationary increments. Such a process has

the following additional properties:

1. The Negative Binomial distribution has 2 parameters, r and p.

2. If Bt describes a process following a Negative Binomial Distribution, the prob-

ability mass function is equal to:

P (Bt = k) =

(k + r − 1

k

)(1− p)rpk

3. µNB = pr1−p

4. σ2NB = pr

(1−p)2

A Poisson distribution has 1 parameter λ that remains constant. The Negative Bi-

nomial distribution has 1 additional parameter, which defines a specific distribution

for λ. In this way, the Negative Binomial distribution is equivalent to a Poisson

distribution where λ comes from a Gamma distribution such that E(λ) = µ.

The nbinfit function in MATLAB is used to compute the MLE estimates for

parameters r and p. The nbinfit function yields the following fitted values:

r = 39.18

p = 0.8548

25

Figure 3.3 shows the difference between the fitted Negative Binomial and Poisson

distributions. As expected, the Poisson distribution is characterized by a much lower

Figure 3.3: Comparison of the Poisson and Negative Binomial distributions. The green line showsthe p.m.f. of the fitted Negative Binomial distribution and the blue line shows the p.m.f. of thefitted Poisson distribution

variance, with most values concentrated within a small range around the mean. In

comparison, the Negative Binomial distribution has a much higher variance. In order

to compare the goodness-of-fit between the two models, the KStest is run comparing

the empirical data with the fitted Negative Binomial distribution, yielding:

pNB = 0.9625

Comparing the two p-values shows that the Negative Binomial distribution has a

significantly higher p-value than the Poisson distribution. Thus, the Negative Bi-

26

nomial distribution is a much better fit for the historical data and is chosen as the

distribution for the faults.

ˆN ft ∼ NB(39.18, 0.8548)

3.3 Location of Faults

The location of each fault that occurs is assumed to be random. It is assumed

that each fault has an equal probability of occurring at any of the transformer lo-

cations in the grid. Thus, the location of each fault is sampled from a uniform

distribution:

ljt ∼ Unif(1, 358)Z

3.4 Transformer Correlation

Correlation between each of the transformer locations is based off of a combi-

nation of the Plant Operations Report and the current list of transformers in the

network. This corresponds to the ρ variable.

Using the unique incident IDs in the POR, we can find the number of the cir-

cuit locations that each incident affected. We denote the number of circuit location

affected in 2013 as Nl. From the list of transformers, we can find the number of

transformers on each circuit. The mean number of transformers on each circuit is

denoted µT . Thus, µTNl is the distribution of the number of transformers impacted

by each fault on average. From the data:

µT = 2.73

27

Figure 3.4 shows the distribution of µTNl. The mean of µTNl is equal to 8.28. There

Figure 3.4: Distribution of the number of transformers affected per fault on average

is a lot of uncertainty surrounding the amount of correlation between transformers.

Certainly, a fault does not affect every single transformer equally. After consultation

with PSE&G, an assumption was made that the correlations between transformers

would vary between 0.5 and 0.9. Thus, Algorithm 3 was used to generate ρ for each

simulation with tunable parameter p.

In order to tune p, we ran 1,000 simulations and checked the average number of

correlated transformers in the generated correlation matrix. Values of p in the interval

of [0.01,0.10] were tested. Table 3.1 shows the results of tuning p. For p = 0.02, the

average number of correlated transformers was 8.24, which was the closest to the

empirical average of 8.28. Thus, ρ was generated using Algorithm 3 with tunable

28

Algorithm 3 Generation of Correlation Matrix ρ with Tunable Parameter p

F[t] = 0for i = 1 to N do

for j = 1 to N doif i = j thenρ(i, j) = 1

elseif Unif(0, 1) < p thenρ(i, j) = Unif(0.5, 0.9)ρ(i, j) = ρ(j, i)

end ifend if

end forend for

p µtNt

0.01 4.560.02 8.240.03 11.740.04 14.870.05 18.370.06 22.410.07 25.980.08 29.000.09 33.140.10 36.70

Table 3.1: Parameter selection for correlation matrix

parameter p = 0.02.

3.5 Failure Times and Fault Magnitude

The failure times and fault magnitude are all in terms of a real age rit. Since real

age is a constructed measure of aging that has no corresponding physical data, we

must find reasonable distributions for the failure times and the fault magnitude that

produce failure rates similar to observed or expected failure rates. This corresponds

to the exogenous variables L and mjt .

29

A transformer i experiences failure when rit > Lit. rit is determined from both

the chronological age of the transformer and the impact of aging through faults, but

the uncertainty in rit is entirely derived from the aging through faults. Thus, L and

mjt together determine when a transformer will fail, meaning that the distributions

for these 2 pieces of exogenous information must be fitted together.

3.5.1 Methodology

There is no data on the real age of transformers and the real age impact of

faults. However, we do know that the result of aging is transformer failures. Thus,

given an assumption of what the rate of transformer failures will be, we can test the

goodness-of-fit of distributions for failure times and fault magnitude.

First, a failure rate path Base Model is determined that represents what the

failure rates will look like in t = [70, 120], based on assumptions from PSE&G. Second,

distributions are chosen for the L and mjt , with unknown values for parameters.

Third, we iterate through different values for each of the parameters and compare the

sample failure rate using the parameters with the Base Model. The p-value of the

Kolmogorov-Smirnov test between the Base Model and sample failure rate is used to

determine goodness-of-fit.

3.5.2 The Base Model

The Base Model was chosen based on certain assumptions on the failure rate

going forward. Figure 3.5 shows the number of failures per year in the Base Model.

The Base Model was chosen because the characteristics match the assumptions that

PSE&G indicated they would expect to see in the failure rates:

1. The model shows no failures before t = 70. This is a strong assumption because

30

Figure 3.5: The number of failures per year in the Base Model

the current age of transformers is based off of transformers that have not yet

failed and are still in operation today. Thus, the model for failures that is ulti-

mately chosen should not show a high number of failures before t = 70. Given

the uncertainty surrounding the problem, we expect that the final model will

show a small number of transformers failing before t = 70 for some realizations,

but the goal is to minimize the number of these early failures.

2. The mean number of failures per year in the interval t = [70, 120] is 6. Not

accounting for the spikes, the mean number of failures is closer to 5 per year.

Based on consultation with PSE&G, 5 failures per year is the number of failures

that they expect to see in a normal year.

3. There are 3 years in which there are observed spikes in the failure rate. 2

31

of these spikes are higher than 15 failures, and another spike is smaller at 10

failures. Based on consultation with the Asset Strategy team, these spikes are

what they are trying to avoid. It is important to note that the failure rates

are the number of failures observed, not the number of replacements. Given

a replacement policy, it would be possible to observe lower spikes if certain

transformers are replaced early.

4. The frequency and magnitude of the spikes are key characteristics of the Base

Model. The low frequency of the number of years in which there are spikes

is a strong assumption for PSE&G. They are especially worried about just a

couple years in which there are spikes. It would not be reasonable to see higher

frequency of spikes. The magnitude of the spikes is also important because they

are lower than 20 failures per year. From PSE&G’s perspective, more than 20

failures in a given year would be catastrophic and they do not expect to see

such a high failure rate. These assumptions are also important for the ability to

design a more efficient replacement policy, as a high frequency and magnitude of

failure rates would make it impossible for PSE&G to avoid system-wide failure.

3.5.3 Failure Times

The failure times in terms of real age are assumed to be different for older

transformers and newer transformers. The older transformers are those older than

50 years old, and correspond to the transformers i ∈ [1, 180]. Newer transformers

are assumed to be transformers that are younger than 50 years old, which include

transformers i ∈ [181, 358]. Moreover, all the new transformers that are installed are

assumed to be newer transformers and have lower failure thresholds in terms of real

age.

Thus, distributions must be fit for L for t < 20 and t > 20 denoted L1 and L2

32

respectively. Based on assumptions from PSE&G, a reasonable difference between L1

and L2 is 30 real age units. We denote the parameter Rf as the failure threshold in

terms of real age for the older transformers. In addition, there is a certain amount of

randomness in the failure thresholds of transformers. The transformers in each group

are not identical and as such would not have the same exact failure thresholds. Thus,

an assumption was made that failure thresholds would be distributed in a ±10 real

age unit band around Rf . Therefore:

L1 ∼ Unif(Rf − 10, Rf + 10)

L2 ∼ Unif(Rf − 30− 10, Rf − 30 + 10)

The parameter for the distribution of failure times that must be fit is Rf .

3.5.4 Magnitude of Faults

“We think that the distribution of faults is such that 10% of the time faults

have no impact, 80% of the time there is some impact, and the other 10%

of the time the fault has a significant impact on the transformer.”

Richard Wernsing, Manager of Asset Strategy

This assumption forms the basis of the model for fault magnitude. We denote m1,

m2, and m3 such that:

P (mjt ∼ m1) = 0.1

P (mjt ∼ m2) = 0.8

P (mjt ∼ m3) = 0.1

Thus, we must find distributions for m1,m2,m3. Of these, we are most concerned

33

with m3, as the assumption is that these are the faults that produce significant impact

and therefore will affect the failure times the most. For m1 , the assumption is that

these faults have no impact on the aging of transformers. Therefore the fitting of m1

is straightforward:

m1 ∼ 0

For m2, we also assume that the impact of these faults is relatively minor. Therefore,

it assumed that m2 is fit to:

m2 ∼ Unif(0, 0.5)

The fitting of m3 is the most important. It is assumed that there is a significant

amount of uncertainty in the faults that fall under the category of m3. PSE&G has

experienced significant faults that range of from just aging of transformers to faults

that cause extensive outages of transformers. Therefore, we assume that m3 must

also come from a set of distributions to capture the wide range of significant faults.

We denote m3,1,m3,2,m3,3 such that:

P (m3 ∼ m3,1) = 0.1

P (m3 ∼ m3,2) = 0.8

P (m3 ∼ m3,3) = 0.1

We assume that these distributions increase in severity such that, on average, m3,3 >

m3,2 > m3,1. However, we do not have an assumption for the relative scale of these

three distributions, so we further denote scaling parameters am, bm such that:

m3,1 ∼ Unif(0, 1)

m3,2 ∼ Unif(0, am(3 + bm))

m3,3 ∼ Unif(3am, am(5 + bm))

34

am and bm control the relative and absolute differences between the distributions. We

can test different relative scales by varying am, bm. The parameters that must be fit

for fault magnitude are thus am, bm.

3.5.5 Parameter Selection

The parameters Rf , am, bm were tested iteratively and each combination of the

3 parameters was run for 50 iterations. The ranges for each parameter were as fol-

lows:

Rf = [110, 210] in increments of 5

am = [0.5, 1.9] in increments of 0.1

bm = [−2, 1.9] in increments of 0.1

21 values were tested for Rf , 15 for am, and 40 for bm, yielding a total of 12,600

combinations of the 3 parameters.

As described earlier, the p-value of the KSTest between the sample failure rate

and the Base Model was used to choose the best combination of parameters. We use

2 versions of this statistic:

1. ptot denotes the p-value when the sample failure rate and the Base Model are

compared for t = [1, 120].

2. p50 denotes the p-value when the sample failure rate and the Base Model are

compared for t = [70, 120].

Both statistics are used in the parameter selection because each statistic tests for

a different characteristic. p50 tests for the similarity of the failure rates specifically

in the simulation time period. p50 is the more important statistic since the failure

35

rate in the future is central to the problem at hand. However, ptot must also be

considered since, as earlier explained, the chosen model should minimize the number

of transformer failures before t = 70. ptot considers the entire timeframe so it tests

for this characteristic. Without considering ptot, a model that has a high p50 value

might closely match the failure rates in the simulation time period but also have a

significant number of failures before t = 70.

Rf am bm ptot p501 210 1.8 -0.3 0.519 0.1322 205 1.0 1.3 0.545 0.1233 210 1.1 1.6 0.533 0.1224 205 1.2 0.6 0.541 0.1215 210 1.2 1.3 0.517 0.1136 210 1.2 1.5 0.509 0.1117 210 1.1 1.3 0.536 0.1098 200 1.2 0.2 0.536 0.1089 200 1.3 -0.2 0.503 0.10810 200 1.0 1.5 0.490 0.108

Table 3.2: Summary of best fit parameters

Table 3.2 shows the top 10 combinations of parameters ranked by p50. All of the

top ranked combinations also have similar ptot, so it is safe to choose based solely on

p50. Thus, combination 1 is chosen and final parameters are:

Rf = 210

am = 1.8

bm = −0.3

Figure 3.6 shows sample paths of the failure model using the chosen parameters.

36

(a) (b)

(c) (d)

Figure 3.6: Sample paths of failure rates using the fitted stochastic model

37

Chapter 4

Policies

This chapter details the different policies that we have tested. The policies oper-

ate using a different model, denoted as the Observable Model. All of the tested policies

are evaluated in comparison to PSE&G’s current policy of replacement (denoted as

the Base policy), which replaces transformers when they fail DGA testing. In order

to avoid years where there are too many failures to replace (“spikes”), the policies

choose additional transformers to replace in addition to the ones that fail DGA test-

ing. Furthermore, the policies are only implemented for time periods t = [70, 120].

This is because the policies are tested for effectiveness in preventing failures in the

future.

4.1 The Observable Model

In contrast to the Unobservable Model explained in Chapter 2, this section out-

lines the Observable Model. The Observable Model contains only the information

that the policies have access to. There are certain state variables that are in both

the unobservable and observable models, such as the chronological age of the trans-

38

formers. In addition, the observable state will include statistics that certain policies

utilize.

4.1.1 The State Variable

The state variable st includes the following:

1. ait for transformer at location i ∈ [1, N ]

Denotes whether or not the transformer has entered operation. ait = 1 if the

transformer at location i has been initialized and 0 otherwise. Because the

transformers in the network are of different ages, the model must keep track of

when each transformer is first initialized. This is only important in t = 1, 2, ...70.

After t = 70, ait = 1 for all i.

2. cit for transformer at location i ∈ [1, N ]

Denotes the chronological age of transformer i.

3. N it,f for transformer at location i ∈ [1, N ]

Denotes the number of faults that each transformer has experienced over its

lifetime.

Thus, the observable state st =([ait], [c

it], [N

it,f ]). Compared to the unobservable

state, the number of variables in the observable state is small. This reflects the core of

the problem that PSE&G is facing: there are many factors affecting the transformers

that the utility does not have information for.

4.1.2 Decision Variables

The decision that must be made in each time period is which transformers should

be replaced. The decision variable is as follows:

39

1. xt a vector of which transformers to replace

Denotes the transformers to be replaced at the end of time t. Since replacement

policies only begin after t = 70, xt = { } for t < 70.

4.1.3 Exogenous Information

The exogenous information at time t in the observable model does not all ar-

rive at the same time. Rather, the exogenous information is divided into two time

periods:

1. The information that arrives before the replacement decision xt is made

2. The information that arrives after the decision is made but before the time

period t+ 1 begins

Moreover, in terms of the simulation, the exogenous information comes from 2 sources:

the unobservable model and events that occur with some randomness. In order to

distinguish between the unobservable and observable exogenous information, the su-

perscript “o” in addition to the ˆ will be used to denote observable exogenous

information.

The observable exogenous information that arrives before the replacement deci-

sions is as follows:

1. aot vector

Denotes which transformers enter operation in time t. This information is

the same as the unobservable exogenous information at detailed in section 3.1,

pg. 19.

2.ˆ

N f,ot

Denotes the number of faults that occur in time period t. This information is

40

the same as the unobservable exogenous informationˆN ft detailed in section 3.2,

pg. 21.

3. ˆlj,ot a vector for every fault j

Denotes the transformer location where the fault occurs at.

ˆlj,ot is distinct from, but related to, the exogenous information denoted ljt in the

unobservable model detailed in section 3.3, pg. 27. In the unobservable model, the

impact of fault j is calculated to be mjt ∗ ρ(ljt , i), where mj

t is the magnitude of the

fault and ρ is the correlation between locations (see pg. 17). However, mjt is not

observable to the policy, so the observable information is just every location on which

the fault had an impact. Therefore,

ˆlj,ot =[i ∈ (mj

t · ρ(ljt , i) > 0)]

Thus, the observable model is able observe the location and the number of faults that

occur in a given year. It is not able to observe the magnitude of the faults.

4. Dt a vector of which transformers fail DGA testing

Denotes which transformers fail DGA testing in a given year. The results of

DGA are a function of the physical condition of the transformers. For simulation

purposes, a threshold with some uncertainty was chosen in order to properly

simulate DGA. The chosen parameters are detailed in section 4.2, pg 44.

After the replacement decision is made, the exogenous information that arrives is as

follows:

5. Ft a vector of which transformers experience failure

Denotes which transformers have experienced failures. This information comes

from the unobservable model in section 3.5, pg. 29, such that a transformer i

experiences failure when rit > Lit. For the policy simulation, we are primarily

41

concerned with the t = [70, 120]. The stochastic model of failure times is selected

to minimize the number of failures that occur before t = 70, but inevitably

some will occur. These will be assumed to be replaced after failure and are not

considered in the policy testing results.

4.1.4 Transition Functions

The transition functions are as follows:

1. ait+1 = 1 if i ∈ at

Denote transformer i as entering operation if the exogenous information denotes

as such.

2. cit+1 =

0, if i ∈ xt

0, if i ∈ Ft

cit + 1, else

Reset the chronological age of a transformer if it fails or is replaced, otherwise

increment the chronological age by one year. Note that in the unobservable

model, the real age rit is also reset when cit is reset.

3. N it+1,f =

0, if i ∈ xt

0, if i ∈ Ft

N it,f +

∑ ˆNf,o

tj=1 1{i∈ ˆ

lj,ot }, else

Similar to the chronological age, the counter for the number of faults experienced

is set to 0 if the transformer fails or is replaced. Otherwise, the number of faults

that affected that transformer location is added to the previous total.

42

4.1.5 Objective Function

The objective function is to minimize the cost to PSE&G and is formally written

as:

minπ

[ 120∑t=70

C(st, Xπ(st))

]In the context of this problem, the costs incurred by PSE&G can be decomposed into

2 costs:

1. Co(st, , Xπ(st))

Denotes the opportunity cost, which is the cost of replacing a transformer before

it reaches failure. All policies that seek to prevent failure will incur some amount

of opportunity cost.

2. Cf (st, , Xπ(st))

Denotes the cost of failure, which is the cost of too many transformer failures in

time t. PSE&G is concerned with minimizing risk. As such, the costs of failure

must be evaluated in terms of the expected value of some risk measure.

The objective function can be reformulated as:

minπ

[ 120∑t=70

Co(st, Xπ(st)) +E

120∑t=70

Cf (st, Xπ(st))

]

This objective function represents the trade-off that PSE&G faces in solving this

problem. PSE&G could avoid the possibility of a high number of transformer failures

by replacing all of its transformers well before they are close to failure. However,

this would incur significant opportunity costs. Thus, the objective function seeks

to minimize the sum of these 2 costs, and the best performing policy will be one

that minimizes failures while also keeping opportunity costs low. Although the con-

cepts of these 2 costs are straightforward, quantifying the costs is a slightly more

43

difficult problem. The process of quantifying these costs is detailed in section 5.1.3,

pg. 60.

4.2 Base Policy

The Base policy is chosen to be PSE&G’s current policy for transformer re-

placement. The current policy is to replace a transformer when a set of diagnostics,

the most prominent being Dissolved Gas Analysis (DGA) testing, indicates that the

transformer is about to fail. The literature (Hamrick (2009), Karlsson (2007)) shows

that DGA testing only indicates a problem when the transformer is on the verge of

failing. Thus, DGA testing does not give any information about the aging of trans-

former before it reaches this point. The exogenous information Dt indicates which

transformers fail DGA testing in a given year t.

PSE&G’s current policy is to just replace whichever transformers fail DGA test-

ing. However, PSE&G faces a constraint on how many transformers it can replace

in a given year, denoted as ν. According to the Asset Strategy team, the maximum

number of replacements that can occur in a year is ν = 8 replacements. Thus, even

if DGA testing indicates that more than ν transformers are near failure, we assume

that PSE&G will not be able to replace all of them.

Thus, the base policy can be formulated as:

XBP (st) =

Dt, if n(Dt) ≤ ν

Dt[1 : ν], if n(Dt) > ν

where n(·) denotes the number of elements contained. Moreover, we assume that

elements in Dt are arranged in temporal order, such that the first element in Dt is

the first transformer to fail DGA testing in time t.

44

4.2.1 Simulating DGA Testing

DGA testing results are a function of the physical condition of the transformer.

However, we do not simulate the specific physical conditions of transformers. Thus,

we must find a way, in terms of the conditions that are simulated, to decide when

a transformer fails DGA testing. The process of determining DGA failure will have

access to the full unobservable model.

We assume that the threshold for DGA failure is in terms of the real age rit and

failure threshold Lit. Since transformer failure occurs when rit > Lit, the threshold

for DGA failure must be where rit ≤ Lit. We assume the following distribution for

Dt:

Dt = {i ∈ [1, N ] | rit > Lit − d}

with parameter d indicating the distance, in terms of real age, from failure when DGA

testing would indicate a problem.

Since DGA testing is not exact, d is chosen not to be a static value, but rather

to be a distribution in order to capture this uncertainty:

d ∼ Unif(1, du)

Since failing a DGA test must occur before transformer failure, the lower bound of d

was fixed at 1. However, the upper bound du is a parameter that must be chosen in

order to best match reality. After testing over 1000 iterations of the failure model,

du = 3 was determined to be the most realistic. This was based on 2 criteria:

1. The average number of transformers that fail DGA testing in t = [70, 120]. The

average in reality should be between 5-6 per year, according to PSE&G.

2. The base policy of replacing when DGA testing indicates failure up to 8 per year

45

resulted in a couple years where there are spikes of failures. This assumption

is core to the problem we are trying to solve. If the base policy sufficiently

prevented all failures then there would be no need to find better policies.

4.3 Pure Aging Policy

The Pure Aging policy chooses transformers to replace based solely on the

chronological age of the transformer. This relatively simple policy will seek to re-

place all transformers whose age exceeds a certain threshold, denoted α. As with all

the policies, this policy gives priority to those transformers that fail DGA testing.

The policy will choose additional replacements in years which Dt < ν.

Let At denote the set of all transformers that exceed a certain chronological age

α, such that:

At = {i ∈ [1, N ] | cit > α}

where α is a tunable parameter. We assume that the elements in At are sorted in

descending age order. The Pure Aging policy is then formulated as:

XPA(st) =

XBP (st), if n(Dt) > ν

Dt ∪At[1 : (ν − n(Dt))] else if n(Dt) + n(At) > ν

Dt ∪At, else if n(Dt) + n(At) ≤ ν

The Pure Aging policy will test the correlation between transformer failure and

chronological age. If all transformers fail in the similar age range, the Pure Aging

Policy should perform very well with a properly tuned α parameter. On the other

hand, if the correlation is low, the Pure Aging Policy might still be able to reduce

transformer failure, while incurring a very high opportunity cost.

46

4.4 Variance Reduction Policy

The Variance Reduction policy seeks to replace the same number of transformers

every year. The motivation for the Variance Reduction policy is from looking at the

replacement path of the Base policy. In the Base policy, the number of replacements

per year varies significantly, with some years at the maximum of 8 replacements

and other years as low as 1-2 replacements. The Variance Reduction policy tries to

“smooth” the number of replacements per year by replacing a constant number every

year, denoted by parameter η.

The question arises of how to choose which transformers to replace. If n(Dt) < η,

we must find a way to choose the additional η − n(Dt) transformers. We introduce

the concept of the weighted ranking, denoted as W it . W

it uses the chronological age,

cit, and number of faults experienced, N it,f , to assign a ranking to each transformer

i:

W it = cit + βN i

t,f

where β is a tunable parameter that determines the weighting given to the number

of faults experienced.

We further denote Wt as the sorted array of all W it in descending order. The

Variance Reduction Policy can then be formulated:

XV R(st) =

XBP (st), if n(Dt) > η

Dt ∪Wt[1 : (η − n(Dt)], else

The tunable parameters in this policy are β and η.

47

4.5 Threshold Policy

The Threshold policy builds upon the intuition of the Variance Reduction policy,

by more specifically utilizing W it . Rather than replacing a constant number of trans-

formers every year, the Threshold policy will replace transformer i when W it exceeds

a threshold value, denoted as τ .

Let Tt denote the set of transformers where W it exceeds τ such that:

Tt = {i ∈ [1, N ] | W it > τ}

Then, the Threshold Policy is formulated as:

XT (st) =

XBP (st), if n(Dt) > ν

Dt ∪Tt[1 : (ν − n(Dt)], else if n(Dt) + n(Tt) > ν

Dt ∪Tt, else if n(Dt) + n(Tt) ≤ ν

The tunable parameters in this policy are τ and, as before in the Variance Reduction

policy, β.

Both the Threshold and Variance Reduction policies rely on the concept of the

weighted ranking, W it . The intuition for W i

t is to better utilize the information that

is available to PSE&G. Currently, PSE&G does not use the number of jolts N it,f in

any way when making the replacement decision. By implementing the Threshold and

Variance Reduction policies, we will test the effectiveness of incorporating N it,f in the

decision making process.

48

4.6 Lookahead Policy

4.6.1 Overview of Lookahead Policies

Lookahead Policies describe a class of policies that plan over future periods

in order to make the current decision. In other words, Lookahead policies look at

st+1, st+2, · · · , st+T in order to make the decision at time t, xt.

In contrast to the other policies tested, the Lookahead policy evaluates not just

the current state of the system, but attempts to evaluate future states as well. As

such, the Lookahead policy requires an accurate method of prediction of how st will

evolve over time. For a Lookahead policy that looks ahead T times periods, it can be

generically formulated as:

X(st) = argminxt,xt+1,··· ,xt+T

C(st,xt) +T∑

t′=t+1

E[C(st′ , xt′)]

where C(·) is a cost function that is used to evaluate the cost of being in a state.

In the case of this problem, the objective is to minimize the occurrences of years

when the number of failures exceed ν. Thus:

C(st,xt) = 1 · (n(Ft)− ν)+

However, the cost function has to also take into account the replacement policy in

time t. We assume that in every case, transformers will fail DGA testing before

entering failure. Thus, PSE&G will replace all transformers that fail DGA testing up

to ν. Thus, the cost function is adapted to be:

C(st,xt) = 1 · (n(Dt)− ν)+

49

The difficult aspect of the Lookahead policy is in predicting the future. We

will use a point estimate of how many failures occur in future time periods for the

Lookahead policy. This point estimate will be based on the weighted ranking, W it .

We assume that W it increases by some amount every year, denoted by parameter δ.

Similar to the Threshold policy, a threshold parameter, denoted γ in this case, is

chosen such that W it > γ indicates DGA failure. Thus, at time t for some time t′ > t,

the belief model is:

E(W it ) = (W i

t )

E(W it′) = E(W i

t′−1) + δ

E(n(Dt′)) = n(E(Wt′) > γ)

E(C(st′ , xt′)) = 1 · (E(n(Dt′))− ν)+

We implement two versions of the Lookahead policy. The first is a 1 year Looka-

head policy that at time t predicts the number of failures that will occur at time t+1.

The second is a 2 year Lookahead policy that predicts the number of failures that will

occur at times t+1, t+2. Let Lkt be the set of transformers chosen for replacement by

a k year Lookahead policy. The Lookahead policy can then be formulated as:

XLk

(st) =

Dt, if n(Dt) > ν

Dt ∪ Lkt , else

The tunable parameters in this policy are γ and δ. The algorithms for deciding Lkt

are detailed in the sections below.

50

4.6.2 1 year

The 1 year Lookahead policy uses the predicted number of failures that will occur

at time t + 1 to choose xt. Algorithm 4 details the algorithm used to determine L1t .

Algorithm 4 Determine Transformer Replacement in 1 Year Lookahead Policy

Wt = [W 1t ,W

2t , · · · ,WN

t ]Wt+1 = Wt + δ · 1EN f

t+1 = 0for i = 1 to N do

if W it+1 ∈Wt+1 > γ then

EN ft+1 = EN f

t+1 + 1end if

end forn = min (ν − n(Dt), N

ft+1 − ν)

L1t = Wt[1 : n]

4.6.3 2 year

The 2 year Lookahead policy looks at the number of failures that will occur at

t + 1, t + 2 to choose xt. In contrast to the 1 year Lookahead policy, the 2 year

Lookahead is conditioned upon the decision in time t + 1. This means that the 2

year Lookahead policy must loop through every potential path in t+ 1. Algorithm 5

details the algorithm used to determine L2t .

51

Algorithm 5 Determine Transformer Replacement in 2 Year Lookahead Policy

for a = 0 to ν − n(Dt) doif i ∈Wt[1 : a] thencit = 0

end ifWt = [W 1

t ,W2t , · · · ,WN

t ]EWt+1 = Wt + δ · 1EN f


if W it+1 ∈ EWt+1 > γ then

EN ft+1 = EN f

t+1 + 1end if

end forE[n( ˆDt+1)] = min (EN f

t+1, ν)

for b = 0 to ν −E[n( ˆDt+1)] doEWt+1[1 : b] = 0EWt+2 = EWt+1 + δ · 1EN f


if W it+2 ∈ EWt+2 > γ then

EN ft+2 = EN f

t+2 + 1end if

end fornb = max (EN f

t+2 − ν, 0)end forna = minb nb

end forn = argmina naL2

t = Wt[1 : n]

4.7 Policies To Estimate the Value of Information

In addition to the policies outlined above, we also test 2 additional policies in

order to estimate the value of information. These 2 policies incorporate information

that is not currently observable to PSE&G, and therefore could not be realistically

implemented. The aim of these policies is to find how much better our policies would

improve if given access to better information.

52

The 2 policies are tested under different assumptions and are summarized as

follows:

1. The Lookahead Plus policy is an extension of the 2 year Lookahead policy out-

lined above. This policy will incorporate better information about the current

state of the system, rather than using the weighted ranking as an approxima-

tion. This will estimate the value of better information about the current state

of the grid.

2. The Perfect Information policy assumes perfect information about not only the

current state of the grid but also perfect information about aging that will occur

in the future.

4.7.1 Lookahead Plus Policy

The Lookahead Plus policy assumes that PSE&G can observe how close to failure

each transformer is, In other words, at time t, the policy is assumed to be able to

observe the quantity Lit − rit. Let Pt denote the exogenous variable of the arrival of

this information. Moreover, we can test how the quality of this information impacts

the value of the information, by introducing a noise term ε such that:

P it = Lit − rit · Unif(1− ε, 1 + ε)

Pt = [P 1t , P

2t , · · · , PN

t ]

It is assumed that Pt is sorted in ascending order such that the first transformers are

closest to failure. We further let Lpt be the set of transformers chosen for replacement

by the Lookahead Plus policy. The algorithm 6 details the algorithm for finding Lpt .

The tunable parameters in this policy are ε and δp. It is important to recognize

that even though the Lookahead Plus policy incorporates better information about

53

Algorithm 6 Determine Transformer Replacement in Lookahead Plus Policy

for a = 0 to ν − n(Dt) doif i ∈ Pt[1 : a] thenrit = 0

end ifPt = [P 1

t , P2t , · · · , PN

t ]EPt+1 = Pt + δp · 1EN f


if P it+1 ∈ EPt+1 < 0 then

EN ft+1 = EN f

t+1 + 1end if

end forE[n( ˆDt+1)] = min (EN f

t+1, ν)

for b = 0 to ν −E[n( ˆDt+1)] doEPt+1[1 : b] =∞EPt+2 = EPt+1 + δp · 1EN f


if P it+2 ∈ EPt+2 < 0 then

EN ft+2 = EN f

t+2 + 1end if

end fornb = max (EN f

t+2 − ν, 0)end forna = minb nb

end forn = argmina naLp

t = Pt[1 : n]

the current state of the system, the δp parameter is still an estimate of how the

transformers will age in the future. Thus, there is still uncertainty about the future

of the system.

4.7.2 Perfect Information Policy

The Perfect Information policy assumes perfect information about the current

state of the system as well as perfect information about the how the system will age in

54

the future. In order to simulate the Perfect Information policy, the amount of aging

due to faults was pregenerated for t = [70, 120].

Algorithm 7 Determine Transformer Replacement in Perfect Information Policy

for t = 70 to 120 doFt = {i ∈ [1, N ] | rit > Lit}if n(Ft) > ν then

xt = Ft[1 : ν]if i ∈ xt thenrit = 0

end ifO = Ft[(ν + 1) : n(Ft)]while n(O) > 0 dot′ = t− 1if n(xt′) < ν then

xt′ = [xt′O[1 : (ν − n(xt′))]for j = t′ to t do

if i ∈ O[1 : (ν − n(xt′))] thenrij = rij−1 + Ar[i, j]

end ifend forO[1 : (ν − n(xt′))] = [ ]

end ifend while

elsext = Ft

if i ∈ xt thenrit = 0

end ifend if

end for

Since the Perfect Information policy assumes perfect information of the future,

let Ar denote the array that keeps track of the amount of aging each transformer

experiences over the entire time period. For example, Ar[i, t] is the amount of aging

experienced by transformer i in time t. The simulation is run without any policy over

all t once to generate Ar. Algorithm 7 details the algorithm used to implement the

Perfect Information policy.

The Perfect Information policy thus utilizes a “spillage” method to prevent fail-

55

ures. The policy iterates forward from t = 70 until the number of failures exceeds ν

in a given year. When this occurs, the policy iterates backwards to check if there is

excess replacement capacity in the previous years. If there is capacity, the policy will

choose to replace transformers then. The policy continues iterating backward until all

transformer failures that exceed ν in the original year are accounted for. In this way,

this policy will make sure that there are no years in which n(Ft) > ν. Moreover, the

Perfect Information policy finds a lower bound on the opportunity cost of policies.

Policies that we implement will always have a higher opportunity cost compared to

the Base Policy, because they try to replace transformers in advance of failure. The

Perfect Information policy can be used to find how low that opportunity cost can

be while preventing excessive transformers failures, and thus it will give important

context to the performance of the other policies.

56

Chapter 5

Results

This chapter presents the results of the policy simulations. First, we identify

the metrics that we use to evaluate the performance of each of the policies. Then,

the results of each policy are presented and compared to each other. The simulation

of the policies involved K = 1, 000 iterations of each policy. Moreover, many of the

policies have tunable parameters, which can be tuned to different values depending

on the assumed risk tolerance. We show the results of the policies using different

values for the tunable parameters.

5.1 Evaluation Metrics

5.1.1 Opportunity Cost

The opportunity cost is the cost of replacing a transformer before it experiences

failure. The earlier a transformer is replaced, the higher the opportunity cost will be.

We assume that the opportunity cost of replacing a transformer is equal to how far

from the failure threshold it is at the time of replacement. The policy is implemented

57

in the time interval t = [70, 120] and xt represents the replacement decision at time

t. The opportunity cost for the k-th simulation is then represented by:

OCk =120∑t=70

∑i∈xt

max(Lit − rit, 0)

Note that we do not consider negative opportunity costs: in other words, if a trans-

former has already exceeded the failure threshold at time of replacement, the oppor-

tunity cost is assumed to be 0. Furthermore, we define the unit opportunity cost,

OCuk , as the opportunity cost incurred per replacement in the k-th simulation:

OCuk = OCk/

120∑t=70

n(xt)

OCk is the total opportunity cost in terms of real age. In order to quantify this cost,

we translate this opportunity cost in terms of lost operation time. Let µa denote

the average annual increase in real age. Through several thousands of iterations of

the model, we find that the policy-independent value of µa = 2.7. Thus, OCk

µais the

total operation time lost in years due to early replacement of transformers in the k-th

simulation.

PSE&G gives us some insight into the cost of losing 1 year of operation. Accord-

ing to PSE&G, the cost of purchasing a new transformer is $500,000. Assuming an

average transformer lifetime of 70 years, the value of each year of operation is:

co =500, 000

70= $7, 142.86

Thus, the total dollar value of the opportunity cost, OCdk , of a policy in simulation k

is defined as:

OCdk =

OCkµa

co

58

5.1.2 Cost of Failure

Recall from section 4.1.5 that PSE&G is primarily concerned with minimizing

the risk of excessive transformer failures. We introduce three risk measures that were

determined after consultation with PSE&G:

1. p9 is the probability that the number of failures is greater than or equal to 9.



These risk measures are derived from the fact that ν = 8. We will use the term

“spike” to mean a given year where the number of failures exceeds ν. 9 failures in

a year indicate that the number of failures exceeds the number that PSE&G could

replace. 12 and 15 failures indicate thresholds where PSE&G would incur increasing

costs due to failure.

These probabilities are computed empirically over the K iterations of each policy.

In this case, the term “failures” is defined loosely to mean the sum of the number of

transformers that fail DGA testing and the number of transformers that experience

failure, Dt + Ft. The probabilities are computed by determining the number of

iterations where the maximum number of failures exceeded the threshold:

p9 =

∑Kk=1 1{maxt∈[70:120](Dt+Ft)≥9}

K

p12 =

∑Kk=1 1{maxt∈[70:120](Dt+Ft)≥12}

K

p15 =

∑Kk=1 1{maxt∈[70:120](Dt+Ft)≥15}

K

In order to translate these risk measures into quantifiable costs, we must assign a

dollar value to transformer failure. We turn to the existing literature on transformer

59

failure to estimate this value. Bartley (2003) examines the cost of transformer fail-

ures from the perspective of insurance claims. The paper examines insurance claims

on transformer failures between 1997 and 2001. Specifically, for utility substation

transformers, the average cost per transformer claim, µf , is $520, 974, which includes

the cost of property damage and business interruption (Bartley, 2003, pg. 3).

The value from Bartley (2003) is used to derive the cost of failure. We assume

that the cost of failure for a policy is the weighted sum of the risk measures and the

cost of the relevant number of failures. Moreover, since multiple transformer failures

in the same year will incur higher costs per transformer failure, we assume a scaling

factor that increases the cost per failure by 1.5 and 2 for years where there are 12

and 15 failures respectively. Thus, the cost of failure is defined as:

FCd = p9(9µf ) + 1.5p12(12µf ) + 2p15(15µf )

This formulation assumes that failure costs are only incurred when the number of

transformer failures in a given year exceeds 8, because otherwise PSE&G would be

able to replace all failing transformers.

5.1.3 Empirical Objective Function

Recall from section 4.1.5 that the objective function is defined as:

minπ

[ 120∑t=70

Co(st, Xπ(st)) +E

120∑t=70

Cf (st, Xπ(st))

]

where the first term represents the opportunity cost and the second term represents

the cost of failure. From the formulation of the cost of failure in the previous sec-

tion, quantifying the failure cost is done after K iterations of the simulation. OCdk

represents the opportunity cost of the k-th simulation. We introduce the average

60

opportunity cost over K simulations, OCd, such that:

OCd

=1

K

K∑k=1

OCdk

And thus, we define the empirical objective function to be:

minπ

(OC

d+ FCd

)

5.1.4 Failure Risk vs. Opportunity Cost per Replacement

We introduce another way to present our results through visualizing the trade-

off between failure risk and opportunity cost. This trade-off is presented graphically

where:

1. The average unit opportunity cost in terms of years, OCu

µa, where OC

uis defined

as the average unit opportunity cost over K simulations, is on the x-axis.

2. One of the risk measures, p9, p12, p15, which shows the probability of a spike

greater than the respective threshold occurring, is on the y-axis.

This graphical representation has some key advantages over simply looking at the

objective function:

1. A very simple representation of the results, from the perspective of the policy

maker. The utility can decide its own risk tolerance and pick a policy that

reduces failure risk to within that tolerance. For example, a utility might decide

that it is willing to incur extra opportunity cost in order to minimize more risk.

Thus, this allows the results of our simulations to be more extendable to different

policy makers.

2. The results are presented without quantified dollar values assigned to them.

61

This makes the results independent of the assumptions about costs and allows

different utilities to assign their own costs appropriate for their grid.

From this, we can find the efficient frontier. Policies that lie on the efficient

frontier have the lowest unit opportunity cost for a given level of failure risk.

5.2 Comparing Policy Performance

First, we evaluate the policy performance in terms of the objective function (or

cost). Figure 5.1 shows the values of the objective function for the simulation results

of the different policies. Each policy is tested with various values for the tunable

parameters. We have chosen in Figure 5.1 to only show the results using the best sets

of tunable parameters for each policy; we will look at the effect of tuning parameters

in detail later on in this section. Even so, we see that tuning the parameters has

a large effect on the objective function in almost all the policies. The Threshold

Figure 5.1: Values of the objective function of policies with different values for the tunable param-eters. The orange line represents the minimum value across all policies and the labels above eachpolicy represent the minimum value within each policy.

62

policy performed the best, with the minimum cost across all policies. The 1-year

and 2-year Lookahead policies incurred costs that were not much higher than the

Threshold and had the next best performance. Neither the Pure Aging or Variance

Reduction policies were able to outperform the Base policy in terms of cost. This

does not mean that these 2 policies were ineffective at reducing the failure risk, but

rather that the trade-off in opportunity costs was too high. Next, we look at how the

different policies perform in terms of this trade-off.

Figure 5.2: The best and worst values of the objective function within each policy, with costs brokendown by opportunity cost and failure cost

Figure 5.2 shows the highest and lowest costs within each policy and compares

the cost breakdown in terms of cost of failure, FCd, and opportunity cost, OCd.

We see that in the Variance Reduction, Threshold, and both Lookahead policies, the

highest cost occurs when the cost is almost entirely opportunity cost. This suggests

that these policies are able to fully eliminate failure risk but at an extremely high

opportunity cost. In terms of the lowest costs within each policy, the best performing

policies (Threshold and Lookaheads) have a more balanced breakdown between cost

63

of failure and opportunity cost. This shows that these policies are able to reduce

failure risk such that the trade-off in opportunity cost is worth it. Comparing this

to the lowest cost Pure Aging and Variance Reduction policies, we see that these 2

policies have a much higher proportion of costs due to cost of failure. In fact, the

lowest Pure Aging and Variance Reduction costs do not have much lower costs of

failure than the Base policy. The preliminary conclusion from this is that these 2

policies are not as effective as the Threshold and Lookahead policies at identifying

which transformers to replace, resulting in opportunity costs that outweigh reduced

failure risk. We can look at the relationship between the number of replacements

made in t = [70, 120] and the unit opportunity cost to get a better understanding of

the opportunity cost.

Figure 5.3: The relationship between the number of replacements made in the time interval t =[70, 120] and the opportunity cost per replacement compared across policies

From Figure 5.3, we see that the number of replacements chosen by a policy

has a direct relationship with the unit opportunity cost. This is expected because

as the total number of replacements increases, the number of replacements per year

64

increases, which results in more transformers being replaced before failure. The slope

of this graph indicates the rate at which the opportunity cost increases as the num-

ber of replacements increases. Since we want to minimize the opportunity cost, a

smaller slope indicates better performance because it shows that a policy is choosing

additional replacements while incurring lower opportunity costs. The Threshold and

Lookahead policies show a very similar shallow slope. In contrast, the Pure Aging

policy has the highest slope, showing that it is choosing to replace a small number of

additional transformers at an extremely high opportunity cost.

As a reference point, the Base policy chooses to replace 279 transformers on

average, while the total number of transformer failures is 290 on average. While

the Base policy on average only fails to replace 11 transformers, these additional

failures all occur in just a couple years, which results in the spikes we are worried

about. However, given that the total number of transformer failures was 290, we

see from Figure 5.3 that many of the policies over-replace transformers. For many

of the policies, even when choosing to replace far more transformers than 290, the

failure risk is still significant. This suggests that the policies are not very good at

determining which transformers are about to fail, and in order to reduce failure risk,

they must cast a wider net and incur significant opportunity costs.

In this section, we have presented a high-level analysis of the polices through

comparing the performance across policies using the objective function. Next, we will

look at the performance of each policy in detail.

65

5.3 Policy Specific Results

5.3.1 Base Policy

The Base policy represents PSE&G’s current policy of replacing transformers

when they fail DGA testing. Table 5.1 shows the results for the Base policy, where

the columns represent the unit opportunity cost in years, the 3 risk measures, and

the value of the objective function respectively.

OCu/µa p9 p12 p15 Obj.

0.215 0.994 0.382 0.042 9328.28

Table 5.1: Results for Base policy

Since DGA testing indicates when a transformer is close to failure, we see that

the opportunity cost of the Base policy is very low. Since all the other policies choose

replacements in addition to the ones that fail DGA testing, the opportunity cost of

the Base policy is thus a lower bound on the opportunity cost. In terms of failure risk,

the Base policy confirms the fears that PSE&G has regarding spikes in the number

of failures per year. p9 is close to 1, indicating an almost 100% probability that there

will be spikes greater than or equal to 9 in a year. Figure 5.4 shows a sample path

of the Base policy. We see that when the failures do occur, the policy chooses to

replace 8 transformers, which is the maximum number of transformers that can be

replaced in a given year. This indicates that the DGA testing is able to identify the

transformers that are about to fail, but there is not enough capacity to replace them

all. This means that in order to prevent these failures, some transformers will need to

be replaced before the year they fail, indicating that the opportunity cost of policies

that seek to reduce failure risk will always be higher than the opportunity cost of the

Base policy.

66

Figure 5.4: A sample path of the Base policy from t = [70,120]. The blue line represents the numberof replacements and the red line represents the number of failures

5.3.2 Pure Aging Policy

α OCu/µa p9 p12 p15 Obj.

110 0.526 0.996 0.370 0.037 9802.06105 0.758 0.983 0.346 0.045 10132.14100 1.016 0.987 0.305 0.032 10102.5895 1.271 0.975 0.281 0.029 10309.2690 1.588 0.952 0.261 0.031 10714.3485 2.011 0.913 0.190 0.013 10487.2780 2.750 0.807 0.103 0.011 10728.4275 4.089 0.559 0.024 0.000 11556.8770 6.079 0.368 0.015 0.000 14954.43

Table 5.2: Results for Pure Aging policy

The Pure Aging policy chooses additional transformers to replace based purely

on chronological age. Recall from section 4.3, that α is a tunable parameter that

indicates a chronological age threshold for replacement. Table 5.2 shows the results

of the Pure Aging policy. We see that as α decreases, the failure risk decreases across

all 3 risk measures since transformers get replaced at younger and younger ages.

However, as failure risk decreases, the opportunity cost increases. Figure 5.5 gives

a graphical representation of the trade-off between failure risk and opportunity cost.

67

Figure 5.5: Performance of the Pure Aging policy across the 3 different risk measures with differentvalues for α. α decreases from left to right.

We see that α can be tuned depending on the risk tolerance of the utility. A

more risk-averse utility would prefer a lower value for α that decreases failure risk

despite incurring higher opportunity costs. We note that the Pure Aging policy is

not able to reduce the failure risk completely: even at low α values, there is still

significant risk of 9 or more failures. Figure 5.6 shows the sample paths of the Pure

Aging policy with different values of α.

The sample paths in Figure 5.6 each show a sustained period where the policy

chooses to replace the maximum number of transformers. As α increases, the time at

which the sustained period begins is later. This suggests that the sustained period of

maximum replacements occurs when a significant portion of the transformer popula-

tion has a chronological age close to α. In Figures 5.6a and 5.6b, we clearly see that

the replacement path becomes similar to the Base policy shown in Figure 5.4 after

the sustained period ends. This is when all of the transformers in the network are

younger than α and the Pure Aging policy reverts to the Base policy. The presence of

68

(a) α = 75 (b) α = 85

(c) α = 95 (d) α = 105

Figure 5.6: Sample paths of the Pure Aging policy with different values for α

a sustained period likely indicates that the α value is too low since the transformers

are consistently reaching α years of age and not failing. In contrast, Figure 5.6d shows

no sustained period, but α = 105 does not significantly reduce the failure risk. From

this, we can conclude that chronological age is not a good predictor of transformer

failure.

5.3.3 Variance Reduction Policy

The Variance Reduction policy is the first policy we test that utilizes the weighted

ranking to choose transformers to replace and seeks to replace a constant number of

transformers over time. The weighted ranking adds information by utilizing both the

69

chronological age of the transformer as well as the number of jolts it has experienced.

The tunable parameters in this policy are β and η. β is a tunable parameter for the

calculation of the weighted ranking, W it . After extensive testing, β was chosen to be

0.2. This value of β is used for the calculation of W it for all other policies that utilize

it. η represents the constant number of replacements per year. Table 5.3 shows the

results of the Variance Reduction policy.

η OCu/µa p9 p12 p15 Obj.

4 0.839 0.988 0.284 0.021 9330.075 1.676 0.892 0.151 0.017 9348.776 3.776 0.251 0.014 0.001 9580.877 7.460 0.003 0.000 0.000 18667.92

Table 5.3: Results for Variance Reduction Policy

Similar to α in the Pure Aging policy, η can be tuned depending on the risk tol-

erance of the utility. As η increases, the failure risk across the 3 parameters decreases.

For η = 4, the policy does not reduce p9 by very much but does reduce the other 2 risk

measures significantly over the Base policy. Overall, we see that for η = 4, 5, 6, the

policy does a decent job of reducing risk while not incurring high opportunity costs.

In these cases, the objective function values are higher than the Base policy, but not

by much. Figure 5.7 gives a graphical representation of this relationship.

We see that between η = 5 and η = 6, the policy is able to dramatically decrease

the failure risk. The average number of failures per year across the time horizon is

equal to slightly less than 6, so this decrease in failure risk makes sense. Unlike the

Pure Aging policy, the Variance Reduction policy is able to reduce the failure risk

entirely when η = 7. However, we see that the opportunity cost increases significantly

between η = 6 and η = 7, suggesting that the trade-off in opportunity cost may not

be worth it. Figure 5.8 shows sample paths for different values of η. We see that

for η = 5, 6, the number of failures is reduced significantly compared to when η = 4.

70

Figure 5.7: Performance of the Variance Reduction policy across the 3 different risk measures withβ = 0.2 and different values for η. η decreases from left to right.

Comparing the graphs for η = 6 and η = 7, we see a dramatic difference. When η = 6,

there are still a few years when the number of replacements exceeds 6. However,

when η = 7, the policy never replaces more than 7 transformers. Since the policy

only chooses to replace more than η transformers when DGA testing indicates more

than η failures, we see that η = 7 is dramatically over-replacing transformers.

The results from the Variance Reduction policy definitely perform better than

the Pure Aging policy. This suggests that the weighted ranking serves as a better

predictor of failure than the chronological age. However, the Variance Reduction

policy suffers from the fact that η must be a discrete value that realistically must lie

between 4 and 7, meaning that the policy loses some flexibility. This is shown by the

huge difference in results between η = 6 and η = 7. We next look at the Threshold

policy, which will also use the weighted ranking but will allow for the policy to be

more flexible in choosing when to replace transformers.

71

(a) η = 4 (b) η = 5

(c) η = 6 (d) η = 7

Figure 5.8: Sample paths of the Variance Reduction policy with different values for η

5.3.4 Threshold Policy

Similar to the Variance Reduction policy, the Threshold policy uses the weighted

ranking to choose replacements. The tunable parameters in this policy are β and τ .

As before, β is chosen to be 0.2. τ is a parameter that represents the nominal

“threshold” in terms of W it , past which a transformer should be replaced. As with

the other policies, τ can be tuned depending on the utility’s risk tolerance. Table 5.4

shows the results from the Threshold policy.

As τ decreases, the failure risk decreases. Unsurprisingly, as failure risk is re-

duced, the opportunity cost increases. However, in contrast to the other policies, the

72

τ OCu/µa p9 p12 p15 Obj.

150 0.218 0.994 0.390 0.046 9472.02145 0.227 0.991 0.383 0.043 9364.74140 0.268 0.997 0.348 0.036 9039.88135 0.392 0.983 0.291 0.028 8570.88130 0.696 0.951 0.212 0.018 8168.57125 1.305 0.828 0.139 0.011 8141.12120 2.313 0.529 0.039 0.004 8006.10115 3.786 0.171 0.006 0.000 9507.75110 5.710 0.024 0.000 0.000 13673.72105 7.952 0.002 0.000 0.000 19836.75100 9.963 0.002 0.000 0.000 26141.16

Table 5.4: Results for Threshold Policy

objective function actually starts to decrease as the failure risk is reduced up to a

point. The minimum value of the objective function is at τ = 120, which moderately

reduces failure risk while not incurring significant opportunity cost. This trend is

significant, because all of the other policies, while successful at reducing the failure

risk, always increased the objective function over the Base policy. This suggests that

the Threshold policy is more efficient at reducing failure risk compared to the policies

we have tested. Figure 5.9 gives a graphical representation of this relationship.

Compared to the Variance Reduction policy, Figure 5.9 shows a more gradual

decrease in failure risk. This is expected since τ is a continuous parameter controlling

risk tolerance whereas η is discrete. This also means that the Threshold policy is able

to eliminate risk at a lower opportunity cost, compared to the Variance Reduction

policy. Moreover, at similar risk levels, the Threshold policy has a lower opportunity

cost compared to the Variance Reduction policy. This suggests that the Threshold

policy is overall more efficient than the Variance Reduction policy.

Figure 5.10 shows sample paths for the Threshold policy with varying values for τ .

We see that in Figure 5.10d where τ = 110, the policy suffers from a similar problem as

the low-α Pure Aging policies in terms of a sustained period of maximum replacement.

73

Figure 5.9: Performance of the Threshold policy across the 3 different risk measures with β = 0.2and different values for τ . τ decreases from left to right.

In this case, the result tells a similar story: the presence of a sustained period of

maximum replacement indicates that the τ is too low since many transformers are

reaching the threshold before failing. We also see that there are smaller sustained

periods of maximum replacement in Figures 5.10a, 5.10b, 5.10c. However, these

periods are much shorter and occur around when transformer failures are occurring.

This indicates that spikes are occurring around when many transformers in the system

are close to the threshold. This suggests that given an appropriate τ , the Threshold

policy is better at identifying which transformers are about to fail than the other

policies we have tested. In contrast to the performance in the Pure Aging policy, we

see that this corroborates our hypothesis that the weighted ranking is a more useful

predictor of transformer failure than chronological age. Nevertheless, although the

sustained periods are shorter and occur at more appropriate times, their occurrences

show that the Threshold policy still over-replaces transformers.

74

(a) τ = 140 (b) τ = 130

(c) τ = 120 (d) τ = 110

Figure 5.10: Sample paths of the Threshold policy with different values for τ

5.3.5 Lookahead Policy

The Lookahead policy that is implemented is based on the same methodology

as the Threshold policy. In this way, the Lookahead policy is more of an extension

of the Threshold policy, and the Threshold policy can be thought of as a “0-year

Lookahead”. Tables 5.5a, 5.5b show the results of the 1 year and 2 year Lookahead

Policies. The tunable parameters in this policy are γ and δ. We assume that W it

increases by δ each year, meaning that δ captures our expectation of the average

increase in W it in a given year. After testing, it was determined that δ1 = 1 for

the 1-year Lookahead and δ2 = 0.5 for the 2-year Lookahead resulted in the best

performance. It is interesting that the two versions of the Lookahead policy have

75

γ OCu/µa p9 p12 p15 Obj.

140 0.215 0.997 0.35 0.031 8871.45135 0.217 0.994 0.363 0.034 9028.82130 0.288 0.992 0.377 0.042 9421.95125 0.707 0.954 0.239 0.030 8644.64120 1.631 0.764 0.099 0.010 8169.88115 3.001 0.388 0.03 0.002 8804.46110 4.845 0.065 0.001 0.000 11540.28105 7.230 0.008 0.000 0.000 17654.57100 9.613 0.001 0.000 0.000 24861.14

(a) 1 Year Lookahead

γ OCu/µa p9 p12 p15 Obj.

140 0.215 0.99 0.368 0.040 9146.86135 0.215 0.997 0.370 0.037 9150.58130 0.216 0.991 0.380 0.036 9205.02125 0.310 0.989 0.324 0.039 8910.25120 0.930 0.926 0.219 0.026 8735.08115 2.090 0.637 0.077 0.001 8249.48110 3.809 0.182 0.007 0.000 9495.73105 6.046 0.019 0.001 0.000 14360.82100 8.710 0.001 0.000 0.000 21891.05

(b) 2 Year Lookahead

Table 5.5: Results for Lookahead Policy

different best-fit values for δ. However, since the 2-year Lookahead is projecting

further into the future, it does make sense that it would have a more conservative

and lower value for the expected amount of aging.

γ is the Lookahead analog for τ in the Threshold policy in that it determines the

threshold past which a transformer should be replaced. The difference is that, in the

Lookahead, γ is used to predict transformer failures in future periods rather than the

current period. Similar to τ , γ can be tuned depending on the risk tolerance of the

utility.

The results show that the Lookahead policies perform quite similarly to each

other. It is interesting that the 2 year Lookahead Policy does not perform signifi-

cantly better than the 1 year Lookahead despite projecting further into the future.

76

Moreover, both 1 and 2 year Lookahead perform very similarly to the Threshold

policy. Figure 5.11 gives a graphical representation of the relationship between op-

portunity cost and failure risk for the Lookahead policies.

(a) 1 Year (b) 2 Year

Figure 5.11: Performance of the Lookahead policies across the 3 different risk measures with β = 0.2,δ1 = 1, δ2 = 0.5 and different values for γ. γ decreases from left to right.

We see that Figures 5.11a, 5.11b look remarkably similar across all 3 of the risk

measures. Overall, the Lookahead policies surprisingly do not perform significantly

better than the Threshold policy. We will analyze why this may be the case later on

in this chapter.

5.4 Comparison Across Different Risk Measures

In this section, we compare the results of the different polices across the 3 risk

measures. Although the Threshold policy performed the best in terms of the objective

function (see Figure 5.1, pg. 62), we want to further investigate which policy performs

the best across different risk tolerances. In this way, we can find the efficient frontier

77

that defines which policy incurs the lowest opportunity cost for a given amount of

risk that a utility is willing to take on. Graphically, a more efficient policy will have

a steeper slope. Moreover, we analyze the 3 different risk measures separately to

explore if our results vary depending on risk measure.

Figure 5.12: Performance of different policies in terms of reducing p9 with varying tunable parametersfor each policy

Figure 5.12 shows the performance of the different policies on the p9 risk measure.

We see that the performance of the Threshold and Lookahead policies are extremely

similar. These policies perform virtually identically on the interval of p9 = [0.7, 1.0].

However, in the interval p9 = [0.1, 0.7], the Threshold policy manages to outperform

the Lookahead policies, consistently incurring slightly lower opportunity costs for the

same level of risk. In the final interval where p9 = [0, 0.1], again we see that the

Threshold and Lookahead policies perform very similarly. Both of these policies are

consistently more efficient than the Variance Reduction and the Pure Aging policies.

We see that Pure Aging policy is by far the least efficient at reducing risk. Overall,

the Threshold policy proves to be the most efficient policy across all risk levels.

78

Figure 5.13: Performance of different policies in terms of reducing p12 with varying tunable param-eters for each policy

Figure 5.13 shows the performance of the different policies on the p12 risk mea-

sure. We see that the relative performance on p12 largely tells the same story as p9.

The Threshold policy still consistently outperforms all of the other policies (except for

around when p12 = 0.1 where the 1 year Lookahead policy performs slightly better).

Overall, we see that although the relative performance of the policies remains largely

unchanged, the difference in slopes is smaller. The Pure Aging policy’s performance

on p12 is much closer to that of the other policies than it was in p9. This is because

p12 is a higher risk measure than p9, and the probabilities of reaching p12 are lower.

Thus, we see that the worse performing policies are relatively more efficient compared

to the better performing policies at higher risk measures.

Figure 5.14 shows the performance of the different policies on the p15 risk mea-

sure. We see some interesting behavior from the Lookahead and Pure Aging policies,

where the risk measure actually increases before decreasing. We can attribute this

to the fact that 15 failures is so rarely reached that the policies are unable to con-

79

Figure 5.14: Performance of different policies in terms of reducing p15 with varying tunable param-eters for each policy

sistently reduce the probability at high risk tolerances. However, we still see that

the Threshold policy is consistently the most efficient policy across all risk levels.

However, we see that p15 is extremely low and should not be a significant concern for

the utility.

5.4.1 Case Study: Zero Risk Tolerance

We explore a specific case where we assume that the utility is extremely risk

averse and has zero risk tolerance. In this case, we consider the opportunity cost

incurred across the different policies when the failure risk is reduced to 0. Note

that for p12 and p15, we consider when the policy actually reaches 0. However, since

none of the policies are able to completely reduce p9 to 0, we consider p9 completely

reduced when the risk measure is less than 0.005. Figure 5.15 shows an extrapolated

version of the graphs shown in the previous section to highlight when the risk measure

approaches 0.

80

(a) 1 Year

(b) 2 Year

(c) 2 Year

Figure 5.15: Comparison of different policies across risk measures in the scenario of zero risk tolerancewhere failure risk is completely minimized

81

We see that in Figure 5.15a, the Variance Reduction policy actually reduces p9

below 0.005 at a lower opportunity cost than the other policies. In Figure 5.15b, the

Threshold policy reduced p12 to 0 at the lowest opportunity cost, while the Variance

Reduction policy performs worse than the Threshold and Lookahead policies. Note

that the Pure Aging policy is not shown in Figure 5.15a or 5.15b because the policy

does not actually reduce the risk measure enough to approach 0. In Figure 5.15c, we

see that again the Threshold Policy reduces p15 to 0 at the lowest opportunity cost.

However, we also see that the Pure Aging policy manages to reach 0 in this case and

at a lower opportunity cost than the Lookahead policies. The Variance Reduction

policy does not reduce p15 to 0 until incurring a unit opportunity cost of 7.5 years

and is not shown in Figure 5.15c.

This case study is just one example of exploring a specific risk tolerance. De-

pending on what kind of risk the utility is willing to tolerate, we can extrapolate

different parts of the graph to find which policies perform the best in the given risk

interval. Moreover, comparing across the 3 risk measures in this case study highlights

how the relative performance of different policies changes depending on which risk

measure we are looking at.

5.5 Limitations of Policies

From the results presented so far, we see that the Threshold policy performs

the best out of all of the policies. Specifically, we saw that the policies that utilized

the weighted ranking performed better than the Pure Aging policy that only relied

on chronological age. However, the policies with the weighted ranking still incurred

significant opportunity costs, especially at very low risk levels. Finally, we saw that

the Lookahead policies actually performed slightly worse than the Threshold policy

82

despite attempting to account for failures in the future. In this section, we will

investigate these limitations and seek to explain why the policies perform in this

way.

5.5.1 Chronological Age vs. Weighted Ranking

The results showed that the policies using the weighted ranking performed bet-

ter than the Pure Aging policy using only chronological age. Recall from section 4.4,

pg. 47 that the weighted ranking, W ti , is derived from a combination of the chronolog-

ical age and the number of jolts that the transformer has experienced over its lifetime.

By taking into account the number of jolts experienced, the hypothesis is that the

weighted ranking more accurately identifies when transformers will fail. Although the

results of the policy simulations suggest that this is the case, we look at the distribu-

tions of each criteria when transformers fail in the simulation. Table 5.6 shows the

µF σF σF/µFChronological age 72.45 18.99 0.261Weighted ranking 118.28 11.65 0.099

Table 5.6: Statistics for the failure values for the chronological age and weighted ranking

results of each criteria at failure. µF is the mean of the failure values and σF is the

standard deviation of the failure values. We see that the failure values for chrono-

logical age have a significantly higher standard deviation compared to the weighted

ranking, which indicates that transformers fail at a wider range of chronological ages

than weighted rankings. Figure 5.16 shows the difference in the dispersion of failure

values between chronological age and weighted ranking.

83

Figure 5.16: Fitted probability distribution functions of failure values using the two criteria. Thex-axis represents the percentage of the mean value

Figure 5.16 clearly shows that the failure values using the weighted ranking are

significantly less dispersed than the chronological age. Given this dramatic difference,

it is not surprising that policies that use weighted ranking as a criteria for replacement

perform better.

5.5.2 Variance in Fault Magnitude

Although it has been determined that using the weighted ranking as a replace-

ment criteria offers better performance than chronological age, we still see that the

policies utilizing weighted ranking incur significant opportunity costs. We hypothe-

size that this is due to the fact that the variance in fault magnitude is extremely high.

84

This is problematic for the weighted ranking formula, because the formula weights

every fault the same since the fault magnitude is unobservable to the policy. Recall

from section 3.5.4, pg. 33, that the magnitude of faults is assumed to come from 3

distributions, denoted m1,m2,m3 in increasing magnitude. m1 faults have 0 magni-

tude and are not counted in the weighted ranking formula. m2 faults constitute 80%

of the faults, but are assumed to have a minor impact on aging. m3 faults constitute

10% of the faults, but have a significant impact on aging. Figure 5.17 shows the

p.d.f. of the fault magnitude. Figure 5.17 shows that, as expected, the majority of

Figure 5.17: Fitted probability distribution function of all fault magnitudes

faults are from m2 faults. However, m3, faults, while low in probability compared to

the m2 faults, have high magnitudes such that they have the most significant impact

on transformer failure. Figure 5.18 shows the separate p.d.f.’s for m2 and m3 faults.

From Figure 5.18a, we see that m2 fault magnitudes lie within the interval of [0, 0.5].

From Figure 5.18b, we see that m3 fault magnitudes show an extremely long right

tail. Table 5.7 shows the statistics for the fault magnitudes, where µM denotes mean

fault magnitude, σM denotes the standard deviation, and Min. and Max. represent

the minimum and maximum observed fault magnitudes. We see that the maximum

m3 fault magnitude is more than 5 standard deviations away from the mean. More-

over, we see that the standard deviation of m3 fault magnitudes is very high relative

to the mean.

85

(a) m2 faults (b) m3 faults

Figure 5.18: Fitted probability distribution function of fault magnitudes separated by fault type

µM σM Min. Max.m2 faults 0.185 0.116 0.00 0.50m3 faults 2.77 2.21 0.50 13.86

Table 5.7: Statistics for the fault magnitude of m2 and m3 faults

Fault magnitude is in terms of real age, which is the unobservable parameter

that determines transformer failure. In this way, both chronological age and weighted

ranking are criteria that try to approximate real age using observable information.

We have shown in the previous section that the correlation between real age and

chronological age is low. Although weighted ranking is a better approximation of real

age than chronological age, we show in this section that the weighted ranking has

significant limitations as well. The magnitude of faults experienced by transformers

has an extremely high variance. The weighted ranking considers every fault to be

equal, ignoring the dramatic differences in magnitude. This is the biggest shortcom-

ing of the weighted ranking formula, and most likely explains the reason that the

policies still incur significant opportunity cost when reducing risk. We will explore

the importance of knowing the real age of transformers later on in section 5.6.

86

5.5.3 Expected Amount of Aging Per Year

The results indicated that the Threshold policy performed better than the 1-year

and 2-year Lookahead policies. This is surprising because the policies are based on

the same methodology, with the Lookahead policies forecasting into the future. Recall

from section 4.6, pg. 49 that the Lookahead policy has a tunable parameter δ that

indicates the expected amount of aging per year. Since δ is constant, if the observed

amount of aging per year has a high variance, δ would not accurately capture the

expected amount of aging.

(a) Fitted p.d.f. (b) Empirical c.d.f

Figure 5.19: Distribution functions of the annual amount of aging experienced by transformers

Figure 5.19 shows the observed distribution functions of the amount of aging in

terms of real age experienced by all transformers over the course of one simulation.

The results show that the amount of aging per year shows a similar trend to the

magnitude of faults (see Figure 5.18). The amount of aging per year is characterized

by an extremely long right tail, indicating that, in a small number of years, a trans-

former can experience significantly greater aging. Table 5.8 shows the statistics of the

observed amount of aging per year across all transformers, with µA representing the

mean amount of aging per year, σA representing the standard deviation, and Min.

and Max. indicating the minimum and maximum observed values over the course of

87

µA σA Min. Max.Aging per year 2.87 2.39 1 27.79

Table 5.8: Statistics for the amount of aging per year

one simulation. Note that µA is different from the value of µa defined earlier since µA

is only the mean across 1 sample path. We see that σA is fairly high and almost equal

to µA. Moreover, the maximum observed value is more than 10 standard deviations

away from the mean. This indicates that the length of the right tail is extremely

large.

These results indicate that the variance in the amount of aging experienced per

year is high. Since δ is constant, even with a finely tuned value, δ would not be

able to accurately forecast the future. Moreover, since the real age is unobservable

to the policy, the Lookahead policy, like the Threshold policy, utilizes the weighted

ranking to approximate real age. We showed in the previous section that the Weighed

Ranking faced some limitations in predicting real age. This fact, combined with the

inaccuracy of forecasting into the future, explain why the Lookahead policies perform

slightly worse than the Threshold policy. Moreover, the inaccuracy in forecasting also

explains why the 2 year Lookahead Policy does not perform any better than the 1

year Lookahead.

5.6 Value of Information

In this section, we will look at the value of information in terms of the objective

function. The two policies that have been implemented to test the value of informa-

tion are the Lookahead Plus policy and the Perfect Information policy. Recall from

section 4.7.1 that the Lookahead Plus policy is a version of the Lookahead policy that

is able to observe the real age of every transformer. The Perfect Information policy

88

assumes that the policy perfectly knows the future. In this way, these 2 policies can

be used to estimate:

1. V IC . Denotes the value of information about the current state of the grid

This is estimated by the improvement in the objective function due to the

Lookahead Plus policy.

2. V IF . Denotes the value of information about the future aging of the grid.

This is estimated by the improvement in the objective function in the Perfect

Information policy over the Lookahead Plus policy.

The total value of perfect information is therefore denoted V I , where

V I = V IC + V I

F

5.6.1 Estimating the Value of Information

Figure 5.20 shows the performance of the Lookahead Plus and Perfect Informa-

tion policies compared to the Base policy and Threshold policy. As expected, the

minimum costs of the Lookahead Plus and Perfect Information policies are both far

lower than the current minimum in the Threshold policy. Like the other policies we

have tested, the Lookahead Plus policy can be tuned depending on utility’s risk tol-

erance. Figure 5.20 shows that many of the costs incurred at different risk tolerances

for the Lookahead Plus policy, not just the minimum cost, are lower than the current

feasible minimum. This indicates that, given accurate information on the current

state of the grid, the Lookahead Plus policy is much more efficient than currently

feasible policies. Moreover, we see that V IC is significantly larger than V I

F . Table 5.9

shows a summary of the values of different information.

These results indicate that the value of being able to observe the current real

89

Figure 5.20: Values of the objective function of policies including policies to estimate the value ofinformation. The top orange line represents the minimum value across all currently feasible policies.The middle orange line represents the minimum value in the Lookahead Plus policy. The bottomorange line represents the minimum value of the Perfect Information policy.

Value % of V I

V IF $5,295 81.7V IC $1,187 18.3V I $6,482 100.0

Table 5.9: Summary of the value of information

age of transformers is significantly higher than the value of information about future

aging. From the perspective of the utility, these results estimate how much money

the utility would save if it were to be able to see this information. In this way, the

estimated value of information can serve as a guide for the utility on how much it

should be willing to spend on investing in potential new technologies to acquire these

capabilities. This paper does not explore what potential investments these may be

or whether or not these investments exist.

Regardless of what the limits of new technology are, these results are promising

for the utility. Technology that better captures the current state of transformers is

90

surely more feasible than forecasting future aging. Therefore, it is promising that

V IC captures an overwhelming percentage of total V I . However, these results assume

that the policy is able to perfectly observe the real age of transformers. Recall from

section 4.7.1 that there is an ε noise parameter in the Lookahead Plus policy that

affects the accuracy of the information. ε is assumed to be zero in this case. In the

next section, we investigate the performance of the policy when ε > 0.

5.6.2 Impact of Measurement Noise

Since there is currently no way to even attempt to measure real age, it is reason-

able to assume that technology that perfectly measures real age will not be developed

for some time. However, the first steps in development of such technology will likely

be technology that is able to measure real age with some amount of noise. In this

section, we explore what kind of impact measurement noise has on the performance

of the Lookahead Plus policy.

Recall from section 4.7.1, pg. 53 that the tunable parameters for the Lookahead

Plus policy are δp, which represents the expected amount of aging per year, and ε,

the noise term. In contrast to the Lookahead policies, where δ was fixed after testing,

δp can be tuned depending on the risk tolerance of the utility because the threshold

for failure is fixed at when P it < 0.

Figure 5.21 compares the performance of the Lookahead Plus policy with in-

creasing values of ε on the p9 risk measure. We omit showing the graphs for the

p12, p15 risk measures because they largely show the same thing. As expected, as

ε increases, the policies perform increasingly worse. The policy is consistently less

efficient as measurement noise increases. We see that for ε = 0.2 and ε = 0.25, the

policy starts at a much lower risk level despite the fact that the same δp values are

used. Since these lower risk levels are achieved at lower opportunity costs when ε is

91

Figure 5.21: Performance of the Lookahead Plus policy under different ε on the p9 risk measure. δpincreases from the left to right.

lower, this does not indicate that higher measurement noise is more efficient. Rather,

this suggests that higher measurement noise causes P it < 0 for more transformers,

which results in the policy choosing to over replace transformers.

Figure 5.22 compares the distributions of P it under different values of ε in the

same simulation path. Figure 5.22a shows the distribution under the assumption of

0 measurement noise and indicates that 5 transformers are near failure. Figure 5.22b

shows the same distribution but with significant noise in the measurement and indi-

cates that 29 transformers are near failure. This shows how increasing measurement

noise causes the policy to believe that more transformers are near failure, which ex-

plains the lower risk levels shown in Figure 5.21. Moreover, these distributions are

of the current belief. Since the Lookahead Plus policy uses the current belief to fore-

cast future failure, error caused by the measurement noise propagates into the future,

affecting the replacement decision.

92

(a) ε = 0 (b) ε = 0.2

Figure 5.22: Comparison of P it with different values of ε at t=90 in the same sample path.

Although measurement noise certainly negatively affects the performance of the

Lookahead Plus policy, how much impact does measurement noise have on the value

of information on the current grid, V IC? More importantly, how much does measure-

ment noise affect performance relative to the Threshold policy, the best performing

feasible policy? Figure 5.23a shows the performance of the policy with different

amounts of measurement noise compared to the Threshold policy on the p9 risk mea-

sure. Figure 5.23b shows the minimum objective function values under each scenario

compared with the minimum Threshold policy value. We see that, even with sig-

nificant measurement noise, the Lookahead Plus policy consistently outperforms the

Threshold policy. With ε = 0.25, the policy is still able to improve the objective func-

tion significantly and capture 60% of V IC . This is promising for the utility because

it shows that perfect measurements of real age are not necessary in order to reduce

costs. We note that even though the minimum costs for ε = 0.2 and ε = 0.25 are

lower than the Threshold policy, these levels of measurement noise will likely lead

to over-replacement of transformers. Thus, high measurement noise decreases the

flexibility of the policy compared to lower measurement noise, which is able to reduce

risk efficiently across all risk tolerances.

93

(a) p9 risk measure (b) Objective function

Figure 5.23: Impact of measurement noise on Lookahead Plus policy performance compared toThreshold policy.

From these results, we conclude that the value of information on the current

state of the grid cannot be underestimated. The value of information on the current

state is far higher than the value of information on future aging. Moreover, these

results show the limitations of currently feasible policies that use the weighted ranking.

Given that policies utilizing better information with significant measurement noise

still significantly outperform the best currently feasible policies, the weighted ranking

is shown to be a relatively inaccurate approximation of real age. We hope that these

results encourage PSE&G and other utilities to recognize the importance of investing

in more accurate diagnostic methods that would make this information accessible,

even if the diagnostics are not perfectly accurate.

94

Chapter 6

Conclusion

The results of the policy simulations confirm the preliminary hypothesis that

policies that utilize the weighted ranking would perform better than policies that use

only chronological age. The weighted ranking polices are able to reduce failure risk at

a lower opportunity cost as well as being able to fully eliminate failure risk entirely.

However, even though the efficient frontier is defined mostly by the Threshold policy,

the question remains of how PSE&G can utilize the results of this thesis to inform

its replacement policy going forward.

The extent to which failure risk is reduced varies depending on the values of the

tunable parameters. The efficient frontier determines the lowest possible opportunity

cost that can be incurred for a given level of failure risk, but it is up to the utility

to decide what level of failure risk is appropriate. The utility must consider that

as the level of failure risk decreases, the marginal reduction in failure risk per unit

opportunity cost also decreases. This suggests that, even on the efficient frontier,

the trade-off in opportunity cost is not worth the benefits of completely reducing

failure risk. The objective function formulated in this thesis attempts to quantify

the trade-off between failure risk and opportunity cost based on existing literature;

95

however, PSE&G and other utilities may operate with different assumptions and risk

tolerances that affect how they consider these costs. Furthermore, the regulatory

environment of public utilities may impact this decision process. The decision to

implement a more aggressive replacement policy that significantly reduces failure risk

would require increased financial outlays that must be budgeted for beforehand. The

tight regulations that utilities face may impose a cap on the amount of spending

utilities can allocate for replacement, which can limit the type of policy that can be

implemented. Ultimately, the final decision on replacement policies must be made

after careful consideration of the level of risk that can be tolerated as well as the

financial capabilities of the utility.

Regardless of how PSE&G approaches its replacement policy going forward, it

must also consider the limitations of these policies and what potential improvements

can be made. A significant limitation is the inability to measure the amount of aging

that a transformer has experienced before it fails DGA testing. The results of the

policy simulations indicate that the value of being able to measure such information

is quite high. This thesis does not go into specifics on if such measurements are

possible, but the results should inform PSE&G and other utilities on how to approach

investing in research that could lead to the future development of such measurement

techniques.

6.1 Areas for Further Research

This thesis represents a first effort at formulating the complex problem of dealing

with transformer failures faced by PSE&G and other utilities. Although great care

was taken to ensure that the problem is formulated as realistically as possible, certain

assumptions were made in order to simplify the problem. Thus, while the results of

96

this thesis provide many insights into a potential solution, there are still many areas

for further research.

First, a more complex model for transformer lifetime could be used to simulate

transformer failures. The literature includes research that models transformer lifetime

using a hazard function, such as a logistic function or the Perks’ function. The

difficulty in this approach lies in deciding on which model to use since there is no

consensus in the literature. Moreover, many of these more complex functions are

modeled with time being the only independent variable, raising the question of how

to simulate the impact of a fault on a hazard rate function.

Second, there were several simplifying assumptions on the operations of trans-

formers that were made. These include the simplifying the process of DGA testing,

the correlation between transformer locations, as well as a singular definition of what

a fault entails. As this was not an electrical engineering thesis, assumptions were

made to simplify the modeling process. A more realistic extension of this thesis

would modify these assumptions based off of electrical engineering literature.

Third, we do not consider the availability of spare transformers when evaluating

the replacement decision. Choosing how many spares to keep at any given point in

time adds another layer of complexity to the problem. However, from the perspec-

tive of the utility, the replacement decision must be evaluated in tandem with the

availability of spares at the time.

6.2 Final Remarks

Providing a resilient and stable power grid is the foremost priority of a public

utility. Transformers serve as the heart of the grid, meaning that potential failures

represent an area of enormous risk in achieving this goal. For PSE&G and utilities

97

nationwide, it is clear that the current policy of relying on DGA testing for transformer

replacement will not be sufficient in preventing transformer failures in the near future.

However, the results of this thesis are promising in showing that by implementing an

early replacement strategy, PSE&G can significantly reduce the risk of transformer

failure. Given careful consideration of its risk tolerance and the opportunity cost

involved, we are confident that PSE&G can implement a policy that matches its

financial capabilities and risk profile.

98

Bibliography

Arshad, M., Islam, S. M., and Khaliq, A. (2004). Power transformer aging and life

extension. 8th International Conference on Probabilistic Methods Applied to Power

Systems.

Bartley, W. H. (2003). Analysis of transformer failures. International Association of

Engineering Insurers 36th Annual Conference.

Brown, R. E. (2007). Power Systems, chapter Power System Reliability, pages 19–1–

19–14. CRC Press.

Chen, Q. and Egan, D. M. (2006). A bayesian method for transformers life estimation

using perks’ hazard function. IEEE Transactions on Power Systems, 21(4).

Chowdhury, A. A. and Koval, D. O. (2005). Development of probabilistic models for

computing optimal distribution substation spare transformers. IEEE Transactions

on Industry Applications, 41(6).

Council of Economic Advisers (2013). Economic benefits of increasing electric grid

resilience to weather outages. Executive Office of the President.

Dixon, F. L., Steward, D., and Hoffmeister, J. (2010). When to replace aging trans-

formers. 2010 Industry Applications Society 57th Annual Petroleum and Chemical

Industry Conference.

Fischer, M., Tenbohlen, S., Schafer, M., and Haug, R. (2010). Determining power

99

transformers sequence of maintenance and repair in power grids. 2010 IEEE Inter-

national Symposium on Electrical Insulation.

Hamrick, L. (2009). Dissolved gas analysis for transformers. ESCO Energy Services.

Hong, Y., Meeker, W. Q., and McCalley, J. D. (2009). Prediction of remaining life of

power transformers based on left truncated and right censored lifetime data. The

Annals of Applied Statistics, 3(2).

Ismail, N. and Jemain, A. A. (2007). Handling overdispersion with negative binomial

and generalized poisson regression models. Casualty Actuarial Society Forum.

Karlsson, S. (2007). A review of lifetime assessment of transformers and the use of

dissolved gas analysis. KTH School of Electrical Engineering.

Kogan, V. I., Roger, C. J., and Tipton, D. E. (1996). Substation distribution trans-

formers failures and spares. IEEE Transactions on Power Systems, 11(4).

Liu, H., Davidson, R. A., Rosowsky, D. V., and Stedinger, J. R. (2005). Negative

binomial regression of electric power outages in hurricanes. Journal of Infrastructure

Systems.

McNutt, W. J. (1992). Insulation thermal life considerations for transformer loading

guides. Transactions on Power Delivery, 7(1).

van Schijndel, A., Wetzer, J. M., and Wouters, P. (2006). Forecasting transformer

reliability. 2006 Annual Report Conference on Electrical Insulation and Dielectric

Phenomena.

100

analyzing transformer replacement policies: a simulation approach to reducing failure...

Documents