skill evaluation of probabilistic forecasts by the

28
Journal of the Meteorological Society of Japan, Vol. 81, No. 1, pp. 85--112, 2003 85 Skill Evaluation of Probabilistic Forecasts by the Atmospheric Seasonal Predictability Experiments Shoji KUSUNOKI Climate Prediction Division, Climate and Marine Department, Japan Meteorological Agency, Tokyo, Japan and Chiaki KOBAYASHI Climate Research Department, Meteorological Research Institute, Japan Meteorological Agency, Tsukuba, Japan (Manuscript received 17 April 2002, in revised form 17 October 2002) Abstract Probabilistic forecast skill of the atmospheric seasonal predictability experiments is evaluated using the Japan Meteorological Agency (JMA) Atmospheric General Circulation Model (AGCM), which is a global spectral model of T63 resolution. Four-month ensemble integrations were carried out with nine consecutive days of initial condition preceding the target season. All four seasons in a 15-year period from 1979 to 1993 were chosen as target seasons. The model was forced with observed sea surface tem- perature (SST) during the time integrations. Probabilistic forecasts of 500 hPa height, 850 hPa temperature and precipitation are verified by four skill measures; the Brier skill score and its decomposition to reliability and resolution, relative operating characteristics (ROC), ranked probability score (RPS) and rank histogram. It is revealed that proba- bilistic forecast bears some similarity in the seasonality and regionality of skill, with deterministic fore- cast such as relatively higher skill in winter of the Northern Hemisphere, and over East Asia and North America. Skill of precipitation is found generally lower than that of 500 hPa height and 850 hPa tem- perature, as is also recognized for deterministic forecast skill. 1. Introduction It is widely recognized that the atmosphere has the predictability on a seasonal time scale associated with the time evolution of forcing the boundary condition, particularly the sea surface temperature (SST). Although the rela- tive importance of atmospheric initial condi- tions to SST forcing in seasonal forecasting is not as crucial as in short-range forecasting, it should be recognized that seasonal forecast- ing is an initial-value problem and the model’s sensitivity to the uncertainty in the initial state must be taken into account. Even if the future time evolutions of SSTs are given perfectly, the future state of the atmosphere cannot be pre- dicted deterministically because of the chaotic nature of the atmosphere that originates from the strong nonlinearity of its own internal dy- namics. Therefore, a probabilistic approach is considered to be more appropriate than a single deterministic approach for the seasonal fore- casting to estimate the probability density function (PDF) of the atmosphere, that is, the possible future atmospheric states and their chances of occurrence. A probabilistic forecast is realized by introducing ensemble forecasting, Corresponding author: Shoji Kusunoki, Climate Prediction Division, Climate and Marine Depart- ment, Japan Meteorological Agency, 1-3-4 Ote- machi, Chiyoda-ku, Tokyo 100-8122, Japan. E-mail: [email protected] ( 2003, Meteorological Society of Japan

Upload: others

Post on 01-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Skill Evaluation of Probabilistic Forecasts by the

Journal of the Meteorological Society of Japan, Vol. 81, No. 1, pp. 85--112, 2003 85

Skill Evaluation of Probabilistic Forecasts

by the Atmospheric Seasonal Predictability Experiments

Shoji KUSUNOKI

Climate Prediction Division, Climate and Marine Department, Japan Meteorological Agency, Tokyo, Japan

and

Chiaki KOBAYASHI

Climate Research Department, Meteorological Research Institute, Japan Meteorological Agency, Tsukuba,Japan

(Manuscript received 17 April 2002, in revised form 17 October 2002)

Abstract

Probabilistic forecast skill of the atmospheric seasonal predictability experiments is evaluated usingthe Japan Meteorological Agency (JMA) Atmospheric General Circulation Model (AGCM), which is aglobal spectral model of T63 resolution. Four-month ensemble integrations were carried out with nineconsecutive days of initial condition preceding the target season. All four seasons in a 15-year periodfrom 1979 to 1993 were chosen as target seasons. The model was forced with observed sea surface tem-perature (SST) during the time integrations.

Probabilistic forecasts of 500 hPa height, 850 hPa temperature and precipitation are verified by fourskill measures; the Brier skill score and its decomposition to reliability and resolution, relative operatingcharacteristics (ROC), ranked probability score (RPS) and rank histogram. It is revealed that proba-bilistic forecast bears some similarity in the seasonality and regionality of skill, with deterministic fore-cast such as relatively higher skill in winter of the Northern Hemisphere, and over East Asia and NorthAmerica. Skill of precipitation is found generally lower than that of 500 hPa height and 850 hPa tem-perature, as is also recognized for deterministic forecast skill.

1. Introduction

It is widely recognized that the atmospherehas the predictability on a seasonal time scaleassociated with the time evolution of forcingthe boundary condition, particularly the seasurface temperature (SST). Although the rela-tive importance of atmospheric initial condi-tions to SST forcing in seasonal forecasting isnot as crucial as in short-range forecasting,it should be recognized that seasonal forecast-

ing is an initial-value problem and the model’ssensitivity to the uncertainty in the initial statemust be taken into account. Even if the futuretime evolutions of SSTs are given perfectly, thefuture state of the atmosphere cannot be pre-dicted deterministically because of the chaoticnature of the atmosphere that originates fromthe strong nonlinearity of its own internal dy-namics. Therefore, a probabilistic approach isconsidered to be more appropriate than a singledeterministic approach for the seasonal fore-casting to estimate the probability densityfunction (PDF) of the atmosphere, that is, thepossible future atmospheric states and theirchances of occurrence. A probabilistic forecastis realized by introducing ensemble forecasting,

Corresponding author: Shoji Kusunoki, ClimatePrediction Division, Climate and Marine Depart-ment, Japan Meteorological Agency, 1-3-4 Ote-machi, Chiyoda-ku, Tokyo 100-8122, Japan.E-mail: [email protected]( 2003, Meteorological Society of Japan

Page 2: Skill Evaluation of Probabilistic Forecasts by the

in which several deterministic forecasts werestarted from a set of slightly different initialconditions representing the uncertainty of theinitial state. In the present study, nine initialconditions are given by the lagged averageforecast (LAF) method, where a number offorecasts are initialized at a set of consecutiveobjective analysis fields. LAF is widely used inseasonal forecasting because of its simplicityand ease of use. Although the ultimate purposeof probabilistic ensemble forecasting is to esti-mate the PDF of the atmospheric state, the en-semble average is often used as a measure ofthe best estimates for the future state.

To assess and compare the potential seasonalpredictability of the atmosphere, three researchprojects have been carried out worldwide. Inthese projects, ensemble seasonal forecastingexperiments were conducted using several at-mospheric general circulation models (AGCMs)of numerical prediction centers and univer-sities. The models were initialized by LAF, andforced with observed SSTs to estimate theupper bound of predictability. The first oneis called the Dynamical Seasonal PredictionModel Intercomparison Project (SMIP), pro-posed by the Working Group on Seasonal to In-terannual Prediction (WGSIP), in the ClimateVariability and Predictability (CLIVAR) Pro-gramme, World Climate Research Programme(WCRP) of the World Meteorological Organiza-tion (WMO). The five institutions, including theJapan Meteorological Agency (JMA), have par-ticipated in the project. The target of the ex-periment was limited to summer and winter ofthe particular four years characterized by thestrong El Nino and the Southern Oscillation(ENSO) phase (Kobayashi et al. 2000).

The second one is called the Dynamical Sea-sonal Prediction (DSP) project, participated inby the five institutions in the United States(Mo and Straus 1999; Shukla et al. 2000a, b).AGCMs were integrated for more than 14 win-ters (January to March) in the 1980’s and1990’s to evaluate the model performance overthe North American region in connection withthe variability of ENSO.

Third one is called PRediction Of climateVariations On Seasonal to interannual Time-scales (PROVOST), participated in by the fourinstitutions in Europe. In contrast to the SMIPand DSP which have some limitations in tar-

get season and year, in PROVOST the nine-member ensembles of four-month integrationwere conducted for all four seasons duringthe 15 years from 1979 to 1993. PROVOSTrevealed that the state of the art AGCMshave the potential ability of seasonal forecastdepending on region, season and physicalvariable. The deterministic and probabilisticforecast skill of the models, as well as the ad-vantages of multi-model approach in the DSPand PROVOST, are fully documented in theDSP/PROVOST special issue of the QuarterlyJournal of the Royal Meteorological Society(Vol. 126, July 2000, Part B, No. 567).

The deterministic forecast skill of individualforecast and ensemble average forecast in thethree projects was evaluated in terms of tradi-tional skill measures such as systematic erroror mean bias (Mo and Straus 1999; Kobaya-shi et al. 2000; Brankovic and Palmer 2000;Palmer et al. 2000; Shukla et al. 2000b),anomaly correlation coefficient (Brankovic andPalmer 2000; Doblas-Reyes et al. 2000; Gra-ham et al. 2000; Kobayashi et al. 2000; Shuklaet al. 2000a, b) and root mean square error(Kobayashi et al. 2000; Shukla et al. 2000a).Moreover, a model’s deterministic forecast abil-ity to reproduce year-to-year variability ofmeteorological variables, low-frequency modesand teleconnection patterns were investigatedintensively by many authors and presentedin the DSP/PROVOST special issue. The JMAhave also conducted almost the same set ofseasonal predictability experiments as that ofPROVOST. The deterministic forecast skillof thess JMA’s experiments was already inves-tigated, mainly in terms of anomaly correlationcoefficient, and reported by Kusunoki et al.(2001, hereafter referred to as K2001).

The skill evaluation of probabilistic forecast,as well as that of deterministic forecast, is in-dispensable for assessing the overall perfor-mance of model, because seasonal forecastingshould be intrinsically realized by predictingthe PDF of atmospheric states. One of the mostcommon and widely used skill measure forprobabilistic forecast is the Brier score. Defin-ing a certain meteorological event in advance,the Brier score estimates the discrepancy be-tween forecast probability ranging from 0 to 1,and observed probability, which takes a valueof either 0 or 1. The reliability diagram, which

Journal of the Meteorological Society of Japan86 Vol. 81, No. 1

Page 3: Skill Evaluation of Probabilistic Forecasts by the

has a close relation with the Brier score, en-ables us to evaluate the model performancevisually and intuitively. The meaning and defi-nitions of the Brier score and reliability dia-gram will be thoroughly described in section 4.Mo and Straus (1999), Palmer et al. (2000) andShukla et al. (2000b) have plotted the reliabil-ity diagram and calculated the Brier scoretogether with relating skill scores. They haverevealed that the skill of 500 hPa height, and850 hPa temperature depend on region and theamplitude of ENSO. However, their target sea-son was restricted to winter only.

The Relative Operating Characteristic (ROC)is also often used as the skill measure basedon the hit rate, and false alarm rate of proba-bilistic forecast for a certain meteorologicalevent. See section 5 for details. For 500 hPaheight in winter, Mo and Straus (1999) andPalmer et al. (2000) have found the ROC skillscore has almost the same regional and ENSOamplitude dependence as the Brier scores.Graham et al. (2000) have calculated the ROCskill score of 850 hPa temperature for all fourseasons and clarified that in the NorthernHemisphere model shows the best performancein winter, and worst in summer, but found noseasonality in the tropics.

Ranked Probability Score (RPS) is one of theskill measures to validate the categorical prob-ability forecast with the emphasis on the shapeof predicted PDF. See section 6 for details. For500 hPa height and 200 hPa zonal componentof wind velocity in winter, Mo and Straus(1999) have shown that skill is generally higherin the tropics than in extratropical regions. Asimilar result was reported by Doblas-Reyeset al. (2000) for precipitation of all four seasons.

Rank histogram, which can be regarded asa non-parametric test, is rather differentfrom the above mentioned three skill measuresin that it doesn’t require any definitions ofthreshold value to specify meteorological eventor categories. This method assesses the model’sability to capture the observed value within thespread of ensemble forecast. See section 7 fordetails. For 500 hPa height in winter, Mo andStraus (1999) have shown that rank histogramcan be a useful tool to detect the systematicerror of the model’s PDF, which is closely con-nected with systematic error of deterministicforecast.

Table 1 summarizes the experimental de-sign and probabilistic forecast skill evaluationmethods applied by the five preceding studies,and compares them with those of the presentstudy. Mo and Straus (1999) have calculated allfour skill measures of DSP runs, but seasonwas restricted to winter. Although Doblas-Reyes et al. (2000) and Graham et al. (2000)have investigated the skill of PROVOST runsfor all four seasons, the evaluation method isrestricted to a single measure.

The purpose of this study is to assess theprobabilistic forecast skill of the JMA AGCM bythe same set of seasonal predictability experi-ments as in K2001 for all four seasons, and tocompare our skill with those of other AGCMs.Probabilistic forecasts of 500 hPa height,850 hPa temperature and precipitation areevaluated by the four skill measures; the Brierskill score and its decomposition to reliabilityand resolution, ROC, RPS and rank histogram.Dependence of skill on validation measure, onmeteorological variable, on season, on region,and on event criterion will be investigated ina more comprehensive way compared with thepreceding studies (Table 1).

Section 2 briefly describes the model usedand the experimental design. Section 3 ex-plains the verification data used. Section 4shows the skill validated by the Brier score,its related skill scores, including reliabilitydiagram. Section 5 shows the skill validatedby ROC. Section 6 shows the skill validated byRPS. Section 7 shows the skill validated byrank histogram. Finally, section 8 summarizesthe results.

2. The model and experimental design

The model and experimental design is ex-actly the same as in K2001, but a brief descrip-tion is given here for reference. The AGCMused in this study is based on the GSM9603 foreight-day and one-month operational forecastsused from March 1996 to February 2001 (JMA1997). This version (T63L30L) has a spectralresolution of triangular truncation at wavenumber 63, corresponding to about a 180 kmhorizontal grid spacing, and has 30 levels witha 10 hPa top. A prognostic Arakawa-Schubertscheme by Randall and Pan (1993) is intro-duced for deep convection. The Simple Bio-sphere scheme (SiB), developed by Sellers et al.

S. KUSUNOKI and C. KOBAYASHI 87February 2003

Page 4: Skill Evaluation of Probabilistic Forecasts by the

Table

1.

Com

parisonsbetwee

ntheex

perim

entaldesignandprobabilisticforecast

skillev

aluation

method

sof

presentstudyandthoseof

five

precedingstudies.

Journal of the Meteorological Society of Japan88 Vol. 81, No. 1

Page 5: Skill Evaluation of Probabilistic Forecasts by the

(1986) and Sato et al. (1989), is implementedfor land-surface processes. Deep soil tempera-ture, soil moisture and snow depth are pre-dicted, but their initial values are climatologi-cally specified.

Our experimental design is almost the sameas that of PROVOST, except that our integra-tions are conducted for 15 winters including the1993 winter that is missing in PROVOST.In this paper, 1993 winter means the winterwhich begins in December 1993. Conditions ofthe integrations are described below.

(1) Target years: 1979 to 1993, 15 years.(2) Target seasons: All four seasons, Spring

(March–June), Summer (June–September),Autumn (September–December), and Win-ter (December–March).

(3) Integration time: Four months.(4) Initial condition: Consecutive nine days at

1200 UTC of ECMWF 15-year Reanalyses(ERA-15) preceding the first month of thetarget season. This corresponds to a LaggedAverage Forecast (LAF) method of ensem-ble forecasting. The ERA-15 is documentedby Gibson et al. (1997).

(5) Ensemble size: Nine members.(6) SST: The model was forced with daily

SST temporally interpolated from observedmonthly SSTs. From February 1979 toDecember 1981, Global Sea-Ice and SeaSurface Temperature Data Version 2.2(GISST2.2) are prescribed. From January1982 to March 1994, Optimum Interpola-tion (OI) analyses of NCEP by Reynoldsand Smith (1994) are prescribed.

(7) Model climatology: For each season, 15years� 9 members ¼ 135 integrations areaveraged to define model climatology.

(8) Model output: The model output was saveddaily at 1200 UTC with 2.5-degree resolu-tion in longitude and latitude. In thispaper, all analysis is based on three-monthaverages derived from this daily data.

3. Verification data

To verify model forecasts for 500 hPa heightand 850 hPa temperature, the ERA-15 datafrom March 1979 to December 1993, and JMAOI Global Analyses (GANAL) data from Janu-ary 1994 to March 1994 are combined. For theverification of 1993 winter, the GANAL is used.

Observed climatology is made by combining theERA-15 and GANAL data. Verification data in-clude only 12 UTC data with 2.5-degree resolu-tion in longitude and latitude.

The verification data for precipitation is Cli-mate prediction center Merged Analysis of Pre-cipitation (CMAP) compiled by Xie and Arkin(1997). A dataset without model generated pre-cipitation in the process of data assimilation isused. All data are interpolated to grid pointswith 2.5-degree resolution in longitude andlatitude.

4. Brier score and reliability diagram

4.1 Brier scoreThe Brier score (Brier 1950) is one of the

most classical and fundamental measures toassess the skill of probabilistic forecast for pre-dicting the occurrence of a certain meteorologi-cal event E. It verifies two category forecast,depending on whether a binary event E occursor not. For example, we define E as the positiveanomaly of 500 hPa height at a grid point.Since the ensemble size of the present study isnine, there are nine different anomalies at eachgrid point. It is natural to define the forecastprobability of E as the fraction of ensemblemembers with positive anomaly to the total en-semble size of nine. Actual occurrence of posi-tive anomaly at the grid point concerned corre-sponds to the observed probability of 1, sohigher forecast probability is preferable at thesame grid point. The Brier score b is defined by

b ¼ 1

N

XN

i¼1

ðpi � viÞ2; ð1Þ

where i is the sample number, N is the totalnumber of the sample, pi is the forecast proba-bility of E for sample i, vi is the observed prob-ability of E for sample i. If E actually occurs,let vi ¼ 1. Otherwise vi ¼ 0. The Brier score is ameasure to assess the magnitude of discrep-ancy between observed probability and forecastprobability. For a perfect deterministic forecastwhere pi always takes the value either 0 or 1,hence b is 0. For a worst deterministic forecast,b is 1. The Brier score is negatively oriented,just like the root mean square error widelyused in the deterministic forecast evaluation.Sampling can be taken either from a single gridpoint or a regional average value, and alsotaken from different time levels.

S. KUSUNOKI and C. KOBAYASHI 89February 2003

Page 6: Skill Evaluation of Probabilistic Forecasts by the

4.2 Murphy’s decompositionThe Brier score (1) can be mathematically

decomposed into three terms (Murphy 1973;Wilks 1995; Palmer et al. 2000) which is calledMurphy’s decomposition;

b ¼ brel � bres þ bunc; ð2Þ

where

brel ¼XJ

j¼1

ðpj � ojÞ2gj; ð3Þ

bres ¼XJ

j¼1

ðoc � ojÞ2gj; ð4Þ

bunc ¼ ocð1� ocÞ; ð5Þ

and

oj ¼ Mj /Nj;

gj ¼ Nj /N;

oc ¼ M/N;

ð6Þ

M ¼XJ

j¼1

Mj;

N ¼XJ

j¼1

Nj:

ð7Þ

j is the category number. J is the total numberof categories and is 10 ¼ 9(ensemble size)þ 1in this study. pj is the forecast probability ofthe category j, which is defined as the fractionof ð j� 1Þ ensemble member satisfying the cri-terion of the event E to the ensemble size. Mj isthe number of actual occurrence of the eventE when forecast probability pj is predicted. Nj

is the total number of forecast predicting theprobability pj. M is the total number of theactual occurrence of the event E. N is the totalnumber of samples. The relation between thesevariables is summarized in Table 2. Therefore,oj is the frequency or fraction of actual occur-rence of E when forecast probability pj is pre-dicted. gj is the frequency of predicting forecastprobability pj. oc is the climatological frequencyof E.

Murphy’s decomposition (2) is regarded as analternative representation of the Brier scorethrough the classification of whole samples intoforecast categories of pj. brel in (3) is called re-liability which measures the extent to which aset of probabilistic forecasts matches the actualfrequency of occurrence of event E. bres in (4) iscalled resolution which measures the ability offorecasts to discern the difference of oj from cli-matological frequency of the event oc. bunc in (5)

Table 2. Contingency table for probabilistic forecasts of a binaryevent E. j is the category number. J is the total number of catego-ries and is 10 ¼ 9(ensemble size)þ 1 in this study. pj is the fore-cast probability of the category j, which is defined as the fractionof ð j� 1Þ ensemble member satisfying the criterion of the event E.Mj is the number of actual occurrence of the event E when forecastprobability pj is predicted. Nj is the total number of forecast pre-dicting the probability pj. M is the total number of the actual oc-cureence of the event E. N is the total number of samples. Data atdifferent grid points are regarded as independent samples givingthe weight factor proportional to cosine of latitude.

Occurrence ofobservation

CategoryNumber

Forecastprobabilities Yes No

Total numberof samples

1 p1 ¼ 0 M1 N1 �M1 N1

2 p2 ¼ 1/ðJ � 1Þ M2 N2 �M2 N2

3 p3 ¼ 2/ðJ � 1Þ M3 N3 �M3 N3

. . . . . . . . . . . . . . .j pj ¼ ð j� 1Þ/ðJ � 1Þ Mj Nj �Mj Nj

. . . . . . . . . . . . . . .J pJ ¼ 1 MJ NJ �MJ NJ

Total M N �M N

Journal of the Meteorological Society of Japan90 Vol. 81, No. 1

Page 7: Skill Evaluation of Probabilistic Forecasts by the

is called uncertainty which measures the in-herent uncertainty in the forecasting situation.bunc depends only on observed cliamatology ocand is never influenced by forecasts. bunc takesthe maximum value 0.25 when oc ¼ 0:5 whichmeans a very marginal state of event to occuror not. bunc takes the minimum value 0 whenoc ¼ 0 or 1 which means a strictly deterministicstate with least uncertainty of the occurrenceof an event. Considering that all these threeterms are always greater than or equal to zeroand that the resolution term is subtracted in theMurphy’s decompostion (2), forecasts of smallerreliability, smaller uncertainty and larger reso-lution will lead to a smaller Brier score.

4.3 Reliability diagram and related skillscores

A reliability diagram (Wilks 1995) is agraphical representation which enable us toassess intuitively the performance of proba-bilistic forecasts. Solid lines in Fig. 1 show thereliability diagrams of 500 hPa height anomalyZA for all four seasons. The event E is the pos-itive anomaly (ZA > 0 m). The calculations in-clude all grid points in the Northern Hemi-sphere (20–87.5�N) for all 15 years. The weightfactors proportional to the cosine of latitude areintroduced to consider the area difference rep-resented by each grid point. The reliability dia-gram is a curve in which oj is plotted as a func-tion of forecast probability pj. The distributionsof gj are also plotted in each panel by the dashline. In case of spring (Fig. 1a), the value of ojfor the forecast probability p9 ¼ 0:88 is about0.7 and less than 0.88, suggesting the over-estimation of predicted probability by themodel. For all four seasons, it is evident thatpredicted probabilities are overestimated forthe forecast probability greater than 0.5 andare underestimated for the forecast probabilityless than 0.5. In case of perfect forecast, relia-bility diagram coincides with the diagonal lineðpjÞ which leads to brel ¼ 0, considering thedefinition of reliability (3).

In each panel of Fig. 1, the values of climato-logical frequency oc are also plotted as the hor-izontal lines. From the definition (4) of bres, it isvisually recognized that the difference betweenthe curve of reliability diagram and the hori-zontal line of oc corresponds to the magnitudeof bres. For a perfect deterministic forecast case

of b ¼ brel ¼ 0, the Murphy’s decompostion (2)gives bres ¼ bunc.

It is convenient to define skill score of theBrier score as the improvement rate with thereference to climatological forecast in whichthe climatological frequency oc is always fore-casted (Wilks 1995; Palmer et al. 2000). Fromthe viewpoint of reliability diagram, climato-logical forecast corresponds to the intersectingpoint of diagonal line and the horizontal line ofoc, resulting in brel ¼ 0. Moreover, the definition(4) gives bres ¼ 0 for the climatological forecast,since oj ¼ oc. Denoting the Brier score of clima-tological forecast as bcli, the Murphy’s decom-postion (2) with brel ¼ bres ¼ 0 gives bcli ¼ bunc.The Brier skill score BSS can be defined as

BSS ¼ bcli � b

bcli � bprf¼ 1� b

bunc; ð8Þ

where bprf ð¼0Þ is the value of Brier score for aperfect deterministic forecast and the relationbcli ¼ bunc is used for the final expression. BSSmeasures the skill improvement with the ref-erence to climatological forecast normalized bythe maximum possible improvement for a per-fect deterministic forecast case, so that BSS ¼1 for a perfect forecast, and becomes negativefor a forecast no better than climatologicalforecast. Similarly, following Palmer et al.(2000), the reliability skill score Brel and theresolution skill score Bres are defined by

Brel ¼ 1� brel /bunc; ð9Þ

Bres ¼ bres /bunc: ð10Þ

For a perfect forecast, brel ¼ 0 and bres ¼ bunc,therefore Brel ¼ Bres ¼ 1. Otherwise, Brel andBres are less than 1. Dividing the both sides ofequation (2) by bunc and substituting the rela-tions (8), (9) and (10), the Murphy’s decomposi-tion (2) can be rewritten in terms of the threeskill scores without any explicit dependence onbunc as follows;

BSS ¼ Brel þ Bres � 1: ð11Þ

It is clear that larger Brel and larger Bres con-tribute to larger BSS. The relation (11) can bealso used for the consistency check of skill scorecalculations. In Fig. 1, the values (%) of BSS,Brel and Bres are plotted in the top of eachpanel. In winter, the reliability diagram ismuch closer to the diagonal line than in any

S. KUSUNOKI and C. KOBAYASHI 91February 2003

Page 8: Skill Evaluation of Probabilistic Forecasts by the

other three seasons, as is confirmed by thehighest value of reliability skill score 97.7% ofall. Moreover, in winter the counter-clockwisetilt of reliability diagram is largest and theresolution skill score Bres also shows the high-est value. Both of the highest values of reliabil-

ity and resolution skill score results in thehighest score of Brier skill score 9.47% inwinter.

The effect of changing the event criteriaor threshold is examined. Figure 2 shows thereliability diagrams for the event of the North-

Fig. 1. Reliability diagrams (solid line) of 500 hPa height for all four seasons. The event is positiveanomaly. The calculations include all the grid points in the Northern Hemisphere (20–87.5�N) forall 15 years. The dash line shows the relative frequency of occurrences gj in each probability cate-gory. Horizontal line in the panel shows the climatological frequency of occurrence. The values ofBSS, Brel and Bres in the top of each panel denote the Brier skill score, reliability skill score andresolution skill score (%), respectively. Diagonal line corresponds to the perfect forecast. (a) Spring(March–May), (b) Summer (June–August), (c) Autumn (September–November), (d) Winter(December–February).

Journal of the Meteorological Society of Japan92 Vol. 81, No. 1

Page 9: Skill Evaluation of Probabilistic Forecasts by the

ern Hemisphere 500 hPa height anomaly (ZA)greater than þ20 m (ZA > þ20 m). The factthat the reliability curves (solid lines) tend tolie under the diagonal line for the larger valuesof forecast probability, at first sight, may givean impression that the reliability skill scoresare lower than those for the event ZA > 0 m(Fig. 1). Contrary to our expectation, the relia-bility skill scores for the event ZA > þ20 m(Fig. 2) are higher than those for the eventZA > 0 m (Fig. 1). This originates from the dis-torted distribution of forecast probability gj

(dash lines) toward small forecast probability.In spite of the large discrepancy between relia-bility curves and the diagonal lines for largervalues of forecast probability, the contributionfrom this part to the summation of originalreliability score brel is diminished due to thesmallness of gj. For smaller values of forecastprobability, regardless of large gj the contribu-tion to the summation of brel is also suppressedbecause the reliability curves well fit to the di-agonal line. On the whole, the reliability skillscores for the event ZA > þ20 m (Fig. 2) exceed

Fig. 2. Same as Fig. 1, but for the event of 500 hPa height anomaly greater than 20 m.

S. KUSUNOKI and C. KOBAYASHI 93February 2003

Page 10: Skill Evaluation of Probabilistic Forecasts by the

those for the event ZA > 0 m (Fig. 1). For thesame reason, the resolution skill scores for theevent ZA > þ20 m are higher than those forthe event ZA > 0 m. As a result, the Brier skillscores are higher for the event ZA > þ20 m. Itshould be noted that reliability and resolutionscore is evaluated with the emphasis on thecharacteristics over the range of large fre-quency of forecast probability.

Dependence of the three Brier skill scores forthe Northern Hemisphere 500 hPa heightanomaly on season, and on the event criteriaare shown in Fig. 3. All three skill scores showalmost the same seasonal variation of thehighest value in winter and the lowest in sum-mer, even if the event criterion is different.This seasonality is similar to that of anomalycorrelation of ensemble average forecast for theNorthern Hemisphere 500 hPa height (Fig. 3aof K2001). The scores of the events ZA > þ20m and ZA < �20 m are generally larger thanthose of ZA > 0 m for all three scores and forall four seasons. Almost all scores are foundpositive, indicating that the ensemble forecastsystem generally performs better than cilmato-logical forecast. The only exception is the sum-mer case of the Brier skill score for the eventZA > 0 m. Some differences and asymmetrybetween scores for the events ZA > þ20 m andZA < �20 m may originate from the samplingerrors due to insufficient sample size for thestatistics.

Figure 4 shows the dependence of the North-ern Hemisphere 500 hPa height anomaly Brierskill scores on longitudinal zone and on season.For the event ZA > 0 m (solid line), the skillis relatively higher over the East Asia (90–177.5�E) and North America (180–92.5�W),than over Europe (0–87.5�E) and the Atlantic(90–2.5�W) except for summer. Relatively highskill over the East Asia and North America isalso similar to that found by the anomaly cor-relation of ensemble average forecast for theNorthern Hemisphere 500 hPa height (Fig. 4 ofK2001). This regionality except for summer isalso recognized for the events ZA > þ20 m andZA < �20 m together with the skill enhance-ment over the event ZA > 0. Using AGCM sim-ulations by the Center for Ocean-Land Atmo-sphere studies (COLA) participated in DSP, Moand Straus (1999) have calculated the Brierskill scores for large positive and large negative

Fig. 3. Dependence of 500 hPa heightanomaly (ZA) skill scores (%) on seasonfor the Northern Hemisphere (20–87.5�N). The event ZA > 0 m is shownby the solid line, ZA > þ20 m by thedashed line, and ZA < �20 m by thedotted line. (a) Brier skill score BSS, (b)reliability skill score Brel, (c) resolutionskill score Bres.

Journal of the Meteorological Society of Japan94 Vol. 81, No. 1

Page 11: Skill Evaluation of Probabilistic Forecasts by the

500 hPa height anomaly events defined by themagnitude of observed standard deviation ateach grid point. Although their target seasonis restricted only to boreal winter, they havefound the tendency of relatively lower skill overEurope and higher skill over North Americawhich was also confirmed by our present study.

Figure 5 compares the Brier skill scores in

Fig. 4. Dependence of 500 hPa heightanomaly (ZA) Brier skill scores (BSS,%) on longitudinal zone, and on seasonfor the Northern Hemisphere (20–87.5�N). The event ZA > 0 m is shownby the solid line, ZA > þ20 m by thedashed line, and ZA < �20 m by thedotted line. (a) Spring, (b) Summer, (c)Autumn, (d) Winter.

Fig. 5. Dependence of 500 hPa heightanomaly (ZA) Brier skill scores (BSS,%) on season, and region. The eventZA > 0 m is shown by the solid line,ZA > þ20 m by the dashed line, andZA < �20 m by the dotted line. (a)Northern Hemisphere (20–87.5�N), (b)Tropics (17.5�S–17.5�N), (c) SouthernHemisphere (20–87.5�S).

S. KUSUNOKI and C. KOBAYASHI 95February 2003

Page 12: Skill Evaluation of Probabilistic Forecasts by the

the Northern Hemisphere with those in thetropics and Southern Hemisphere. Figure 5a isexactly the same as Fig. 3a with different ver-tical scaling. In the Southern Hemisphere,skills are relatively high in the coldest season(boreal summer) as in the Northern Hemi-sphere, but not the lowest in the hottest sea-son (boreal winter). Skill enhancement for theevent ZA > þ20 m and ZA > �20 m is evidentin the Northern and Southern Hemisphere. Inthe tropics, skill for the event ZA > 0 m haslarge seasonal variability, compared with theNorthern and Southern Hemisphere. The de-pendence of skill on event criterion is large, butnot systematic as in the Northern and South-ern Hemisphere. Considering that the observedinterannual variability of 500 hPa heightamounts to about 10 m (standard deviation) inthe tropics, the specified threshold value of20 m, however, is so large compared to theobserved variability that climatological fre-quency of the event oc is less than 0.02 incase of Fig. 5b. This means that the number ofsample data that meet the event criteria aresmall, which leads to the unstable nature ofthe statistics.

The above-mentioned verifications are basedon the cases of no lead time forecasts. For thespring case, since the integrations start fromthe days just before March, forecasts averagedfrom March to May corresponds to the no leadtime forecast. The forecast field averaged fromApril to June can be interpreted as a one-monthlead time forecast for spring. Figure 6 illustratesthe effect of lead time on the three skill scores.The skill of a one-month lead forecast is gen-erally lower than that of the no lead forecastfor all three scores and for all seasons. As forBrier skill score, one-month lead forecast out-performs climatological forecast only in winter.

As for 850 hPa temperature anomaly TA, thedependence of Brier skill scores on season andon region was also examined and depicted inFig. 7. The threshold value of event criteriajTAj > 1�C was chosen such that the climato-logical frequencies of the event become about0.2 which is almost comparable to those for the500 hPa height case of event jZAj > 20 m. Al-though skill is relatively high in winter for theNorthern Hemisphere (Fig. 7a), and in borealsummer for the Southern Hemisphere (Fig. 7c),the skill enhancement for the event jTAj > 1�C

Fig. 6. Dependence of 500 hPa heightanomaly (ZA) skill scores (%) on season,and on lead time for the Northern Hemi-sphere (20–87.5�N). The event is posi-tive anomaly. No lead time forecast isshown by the solid line; one-month leadtime forecast, by the dashed line. Forspring, the forecasted field averagedfrom March to May corresponds to theno lead time forecast, and that aver-aged from April to June corresponds tothe one-month lead time forecast. (a)Brier skill score BSS, (b) reliability skillscore Brel, (c) resolution skill score Bres.

Journal of the Meteorological Society of Japan96 Vol. 81, No. 1

Page 13: Skill Evaluation of Probabilistic Forecasts by the

in both hemispheres is not obvious as in 500hPa height case of event jZAj > 20 m. In thetropics, the skill for the event TA > 0�C (solidline in Fig. 7b) is relatively high in spring and

winter as in 500 hPa height case (solid linein Fig. 5b). Although the skill for 850 hPatemperature has some similarity in seasonal-ity with that for 500 hPa height, the skill of850 hPa temperature is generally lower thanthat of 500 hPa height for all four seasons andfor all three regions. The seasonality of skill forthe event TA > 0�C (solid line in Fig. 7) is sim-ilar to that of the Northern Hemisphere 850hPa temperature anomaly correlation for en-semble average forecast (dotted lines in Fig. 12of K2001).

In case of the Northern Hemisphere 850 hPatemperature anomaly, the dependence of theBrier skill scores on longitudinal zone and onseason (Figure not shown) was found to showsimilar characteristics as in 500 hPa heightcase (Fig. 4). Palmer et al. (2000) have cal-culated the Brier skill score for the eventTA < 0�C and TA < �1�C using AGCM simu-lations by the European Centre for Medium-Range Forecasts (ECMWF), United KingdomMeteorological Office (UKMO), Meteo-France(MF) and Electricite de France (EDF), all par-ticipated in the PROVOST project. While theirtarget season was restricted to only borealwinter, they have demonstrated the tendency ofrelatively lower skill over Europe, which wasalso confirmed by our present study.

As for precipitation anomaly RA, the depen-dence of the Brier skill scores on season andon region was also examined and depicted inFig. 8. The threshold value of event criteriajRAj > 0:2 mm/day was again chosen such thatthe climatological frequency of the event be-come about 0.2 which is almost comparableto those for the 500 hPa height case of eventjZAj > 20 m. Skill is relatively high in winterfor the Northern Hemisphere (Fig. 8a) and inboreal summer for the Southern Hemisphere(Fig. 8c). In the tropics, skills for all threeevents are relatively high in spring and winter.In general, however, skill score is no betterthan climatological forecast except for theNorthern Hemisphere winter and for somecases in the tropics.

In summary, the common feature of the Brierskill score for positive anomaly (solid line) of500 hPa height (Fig. 5), 850 hPa temperature(Fig. 7) and precipitation (Fig. 8) is that skill isrelatively high in the Northern Hemispherewinter, in spring and winter of the tropics, and

Fig. 7. Same as Fig. 5, but for 850 hPatemperature anomaly (TA) Brier skillscore (BSS, %). The event TA > 0�C isshown by the solid line, TA > þ1�C bythe dashed line, and TA < �1�C by thedotted line. Note that the scales of ver-tical axes are different from those inFig. 5.

S. KUSUNOKI and C. KOBAYASHI 97February 2003

Page 14: Skill Evaluation of Probabilistic Forecasts by the

in boreal summer of the Southern Hemisphere.Skill for positive anomaly of 500 hPa height isgenerally higher than those of 850 hPa tem-perature and precipitation in the whole globe.

5. Relative operating characteristics

The relative operating characteristic (ROC)was originally developed in the fields of radarsignal detection theory, psychological and med-ical test evaluation (Mason and Graham 2002).Stanski et al. (1989) has introduced the ROCinto the field of atmospheric science. ROC is ameasure to evaluate the skill of probabilisticforecast for a specific event E over a range offorecast probability threshold pt. The eventsare defined as binary which means that E willoccur if the forecast probability pb pt and willnot occur if p < pt. Table 2 is a contingency ta-ble which gives the observed number of occur-rences and non-occurrences of an event as afunction of forecast probability. Using the defi-nitions of variables in Table 2, the hit rate ht

and the false alarm rate ft can be expressed as

ht ¼1

M

XJ

j¼t

Mj; ð12Þ

ft ¼1

N �M

XJ

j¼t

ðNj �MjÞ; ð13Þ

where t is the category number of forecastprobability threshold pt. In the definitions of(12) and (13), pt doesn’t appear explicitly, butboth ht and ft are obviously the function of pt

and take values from 0 to 1. The ROC curve is aplot of ht for vertical axis against ft for hori-zontal axis. The area under ROC curve A is ameasure of skill score representing the totalperformance of the ensemble forecast system. Aperfect deterministic forecast gives ht ¼ 1 andft ¼ 0, hence the curve degenerates to the up-per-left hand corner giving A ¼ 1. No-skill fore-cast with the relation ht ¼ ft is represented by adiagonal line giving A ¼ 0:5. In case of the cli-matological probabilistic forecast where the cli-matological frequency oc is always predicted,ROC curve consists of only the two pointsðht; ftÞ ¼ ð0; 0Þ and ð1; 1Þ. This means that theROC curve coincides with the diagonal linewhich gives the ROC skill score of A ¼ 0:5. Forfurther detailed explanation, see the last para-graph in the appendix. The interpretation of Aas well as its significance testing were fullydiscussed by Mason and Graham (2002).

Figure 9 shows ROC curves of the NorthernHemisphere 500 hPa height for positive anom-

Fig. 8. Same as Fig. 5, but for precipita-tion anomaly (RA) Brier skill score(BSS, %). The event RA > 0 mm/day isshown by the solid line, RA > þ0:2 mm/day by the dashed line, and RA < �0:2mm/day by the dotted line. Note thatthe scales of vertical axes are differentfrom those in Fig. 5.

Journal of the Meteorological Society of Japan98 Vol. 81, No. 1

Page 15: Skill Evaluation of Probabilistic Forecasts by the

aly event. The bulge of curve toward upper-lefthand is relatively large in spring and winter, asis also confirmed by the larger values of skillscore A. Figure 10 shows the dependence ofROC skill score A on season, region, and eventcriterion. Criteria of events are the same asthose in the Brier Skill score (Fig. 5). The ROC

skill scores are generally greater than 50% forall four seasons and all three regions, while theskills of summer and autumn in the tropics forthe event ZA < �20 m are almost marginal.Skill over all three regions has the similar sea-sonality in that skill is relatively high in springand winter, except for spring of the tropics

Fig. 9. Same as Fig. 1, but for the Relative Operating Characteristic (ROC) curve of 500 hPa heightfor all four seasons. The event is positive anomaly. The calculations include all the grid points inthe Northern Hemisphere (20–87.5�N) for all 15 years. The values on the curve denote the forecastprobability threshold pt (%) for the occurrence of event. The values of A� 100 in the top-left corneroutside of each panel denote the ROC skill score (%) which is defined by the area under the ROCcurve. (a) Spring, (b) Summer, (c) Autumn, (d) Winter.

S. KUSUNOKI and C. KOBAYASHI 99February 2003

Page 16: Skill Evaluation of Probabilistic Forecasts by the

for the event ZA > þ20 m. This seasonality ofthe ROC skill score basically resembles that ofthe Brier skill score (Fig. 7) in all three regionswith some differences. Skill enhancement forthe event jZAj > 20 m is evident in the North-ern and Southern Hemisphere (Fig. 10a, c).

The ROC skill score tends to overestimate theperformance of forecast, compared with theBrier Skill score (Fig. 5) where forecasts havenegative skill score in some cases. This ten-dency is also confirmed by a simplified exampleof probabilistic forecast which shows negativeBrier Skill score, but gives the marginal ROCskill score of 0.5 (See Appendix).

As for 850 hPa temperature, the ROC skillscores are also depicted in Fig. 11. The criteriaof events are the same as those used in theBrier skill score (Fig. 7). The ROC skill scoresare greater than 50% for all four seasons andall three regions. Seasonality of the ROC skillscore basically resembles that of the Brier skillscore (Fig. 7) in all three regions. Overestima-tion of forecast skill in terms of the ROC skillscore over the Brier skill score is again evidentas in 500 hPa height case. Focussing on thepositive anomaly event (solid line), skill ismuch higher in the tropics than in the North-ern and Southern Hemisphere. Graham et al.(2000) have calculated the ROC skill score fornegative 850 hPa temperature anomaly eventusing AGCM simulations by UKMO andECMWF participated in the PROVOST project.For all four seasons, they found relativelyhigher skill in the tropics than in the extra-tropics (their Fig. 2a, 2b, 4) which was also rec-ognized in the present study. Note that theROC skill score for the negative anomaly eventis identical to that of the positive anomalyevent, because the two events are complimen-tary so that the corresponding two ROC curvesbecome symmetric with respect to false alarmrate. While Graham et al. (2000) have demon-strated that for the Northern Hemisphere skillis highest in spring and lowest in autumn, inour present study the skill is highest in winterand lowest in summer.

In the Northern Hemisphere (Fig. 11a), skillof the events jTAj > 1�C (dash and dotted lines)is larger than that of the events TA > 0�C forall four seasons. Although the target of calcula-tion is limited to in the Northern Hemispherewinter only, Palmer et al. (2000) have also in-dicated that skill of the event TA > þ1�C ishigher than that of the event TA > 0�C. In con-trast, skill enhancement for the events jTAj >1�C is opposite in the tropics (Fig. 11b), and isnot clear in the Southern Hemisphere (Fig.11c).

Fig. 10. Same as Fig. 5, but for the ROCskill score (%) of 500 hPa height anom-aly (ZA). The event ZA > 0 m is shownby the solid line, ZA > þ20 m by thedashed line, and ZA < �20 m by thedotted line.

Journal of the Meteorological Society of Japan100 Vol. 81, No. 1

Page 17: Skill Evaluation of Probabilistic Forecasts by the

As for precipitation, the ROC skill scores arealso depicted in Fig. 12. The criteria of eventsare the same as those used in the Brier skillscores (Fig. 8). The ROC skill scores are greater

than 50% for all four seasons and all three re-gions. Seasonality of the ROC skill score basi-cally resembles that of the Brier skill score(Fig. 8) in all three regions. Overestimation of

Fig. 11. Same as Fig. 7, but for the ROCskill score (%) of 850 hPa temperatureanomaly (TA). The event TA > 0�C isshown by the solid line, TA > þ1�C bythe dashed line, and TA < �1�C by thedotted line. Note that the scales of ver-tical axes are different from those inFig. 10 of 500 hPa height.

Fig. 12. Same as Fig. 8, but for the ROCskill score (%) of precipitation anomaly(RA). The event RA > 0 mm/day isshown by the solid line, RA > þ0:2 mm/day by the dashed line, and RA < �0:2mm/day by the dotted line. Note thatthe scales of vertical axes are differentfrom those in Fig. 10 of 500 hPa height.

S. KUSUNOKI and C. KOBAYASHI 101February 2003

Page 18: Skill Evaluation of Probabilistic Forecasts by the

forecast skill in terms of the ROC skill scoreis striking compared with the Brier skill score.Again, these features are similar to the previ-ous two cases of 500 hPa height and 850 hPatemperature. Focussing on the skill of posi-tive anomaly (solid line), similar to the case of850 hPa temperature (Fig. 11), skill is muchhigher in the tropics than in the NorthernHemisphere. As for seasonal differences, skill isrelatively high in spring and winter both forthe Northern Hemisphere and the tropics. Gra-ham et al. (2000) have also demonstrated simi-lar seasonality in the Northern Hemisphere(their Fig. 9b). In the tropics, however, theyfound little variation of skill with season (theirFig. 9a), which contrasts to our results of strongseasonal dependence.

6. Ranked probability score

The ranked probability score (RPS) is a mea-sure which is sensitive to the shape of proba-bility distribution function (PDF) given bycategorical probabilistic forecast (Epstein 1969;Murphy 1969; Murphy 1971; Wilks 1995). RPSis an extension of the Brier score to the case ofmultiple category event. RPS can be defined forthe probabilistic forecast with a number ofcategories K as

RPS ¼ 1

K � 1

XK

k¼1

ðPk �OkÞ2; ð14Þ

where

Pk ¼Xk

l¼1

pl;

Ok ¼Xk

l¼1

ol:

k; l are the category numbers. pl is the forecastprobability of the category l. Pk is the cumula-tive forecast probability of the category k. olis the observed probability of the category lwhich, like a delta function, equals to 1 if cate-gory l actually occurs, and equals to zero other-wise. Ok is the cumulative observed probabilityof the category k which, like a Heaviside or stepfunction, equals to 1 for the category numbergreater than or equal to that where the eventactually occurs, and zero otherwise. Because

OK ¼ PK ¼ 1 by definition, hence the relationOK � PK ¼ 0 always holds. Considering there isno contribution of Kth category to the summa-tion in (14), in practice summation goes from 1to K � 1. RPS is negatively oriented like theBrier score. For a perfect deterninistic forecast,RPS ¼ 0. Less accurate forecasts have higherpositive value of RPS. In case of two categoriesevent K ¼ 2, RPS reduces to the Brier score.

The main advantage of RPS is that, throughthe use of cumulative forecast probability, itcan capture the degree of how the PDF offorecast concentrates on the category where theevent actually occurs. For example, suppose thecase of three category forecast where we havetwo different forecasts A; ðp1; p2; p3Þ ¼ ð0:1; 0:4;0:5Þ and B; ðp1; p2; p3Þ ¼ ð0:4; 0:1; 0:5Þ with theactual observation occurred at third category;ðo1; o2; o3Þ ¼ ð0; 0; 1Þ. The two or three categoryBrier score of the two forecasts A and B are thesame, because what matters is only the differ-ence between forecast probability and observedprobability at each category and the differencesof forecast probability distribution at first andsecond category don’t affect the final score.Since the peak of the PDF is more concentratedto the third category in forecast A than in B, itis obvious that forecast A is better than B fromthe stand point of accurate PDF forecast. Infact, the cumulative probabilities of forecast A;ðP1;P2;P3Þ ¼ ð0:1; 0:5; 1:0Þ and B; ðP1;P2;P3Þ ¼ð0:4; 0:5; 1:0Þ give the results of RPS ¼ 0:130 forforecast A and RPS ¼ 0:205 for forecast B, in-dicating the relevance of forecast A. Anotheradvantage of RPS is that score can be definedand calculated for a PDF inferred from a set ofsingle ensemble forecast at a grid point, where-as the calculation of Brier score needs largesample size for the stability of the statistics.

It is comprehensive to define RPSS skill score(RPSS) with the reference to the climatologicalprobabilistic forecast as in the case of the Brierskill score. RPSS is defined as

RPSS ¼ 1� RPS /RPScli; ð15Þ

where RPScli is the RPS of the climatologicalprobabilistic forecast. RPSS is 1 for a perfectdeterministic forecast, and is negative for aforecast no better than the climatological prob-abilistic forecast.

Figure 13 shows the distribution of 500 hPaheight anomaly of RPSS (%) for all four sea-

Journal of the Meteorological Society of Japan102 Vol. 81, No. 1

Page 19: Skill Evaluation of Probabilistic Forecasts by the

sons. Three equal probable (33.33%) categoriesare defined by the standard deviations (SD) ofthe observed 15 years time series at each gridpoint, assuming the Gaussian distribution. Thevalues of anomaly threshold between categoriesare give byG0.47� SD at each grid point. Theclimatological probabilistic forecast is definedto have the same 33.33% probability for above-normal, near-normal and below-normal cate-gories at each grid point. RPSS was calculatedat each grid point for each year and then aver-aged for 15 years. Skill over the tropics is gen-erally higher than that over the mid- and high-latitudes for all four seasons, although inautumn the region of high RPSS is limited tothe tropical Pacific. Focussing on the tropics,

skill over the maritime continent or the west-ern tropical Pacific is relatively higher thanother regions in spring, summer and winter. Inthe Northern Hemisphere, there is almost noskill over the continents. Although the seasonis limited to winter only, Mo and Straus (1999)have calculated the global distribution of threecategory RPSS of 500 hPa height anomalyusing AGCM simulations participated in theDSP. They have found relatively higher skill intropics than in extratropics, which is also rec-ognized in the present study. Focussing on thetropics, they have indicated that skill is rela-tively higher in the central and eastern tropicalPacific, whereas our results show relativelyhigher skill in the western tropical Pacific.

Fig. 13. Distribution of 500 hPa height anomaly of Ranked Probability Skill Score (RPSS, %) aver-aged for 15 years. Three equal probable (33.33%) categories are defined by the standard devi-ations of 15 years time series of observation at each grid point, assuming the Gaussian distribu-tion. A forecast of negative RPSS is no better than climatological probabilistic forecast. (a) Spring,(b) Summer, (c) Autumn, (d) Winter.

S. KUSUNOKI and C. KOBAYASHI 103February 2003

Page 20: Skill Evaluation of Probabilistic Forecasts by the

Using PROVOST simulations, Doblas-Reyes etal. (2000) have also demonstrated the higherskill over tropics and negative skill on thenorthern continents in winter (their Fig. 3).

It is noteworthy to indicate that RPSS hasclose relation with the deterministic forecastskill in terms of ensemble average forecast.Figure 14 shows the distribution of interannualtemporal correlation coefficients (%) betweenobservation and model ensemble average fore-cast of 500 hPa height for 15 years. The largevalues of correlation coefficient means the highability of model to reproduce year-to-year vari-ability. It is evident that the geographical dis-tribution of RPSS (Fig. 13) is qualitativelysimilar to that in Fig. 14 for all four seasons.

Figure 15 shows the distribution of RPSS for850 hPa temperature anomaly. Skill is gener-

ally lower than that of 500 hPa height (Fig. 13)over the whole globe. Skill is relatively higherover oceans than over the continents. Consid-ering that the lower troposphere temperatureis strongly related with the land surface condi-tion, low skill over the continents may origi-nate from the model’s difficulty in simulatingland surface processes. In contrast to 500 hPaheight (Fig. 13), skill over the maritime con-tinent is generally low and negative. In theNorthern Hemisphere, skill is highest in win-ter, especially over the North Pacific and NorthAmerica. This regionality of skill was alsodocumented by Doblas-Reyes et al. (2000) forthe winter case.

Figure 16 shows the distribution of RPSS forprecipitation anomaly. Skill is generally lowerthan those of 850 hPa temperature (Fig. 15)

Fig. 14. Distribution of internannual temporal correlation coefficients (%) between observation andmodel ensemble average forecast of 500 hPa height for 15 years. The 95% significance level is 51%.(a) Spring, (b) Summer, (c) Autumn, (d) Winter.

Journal of the Meteorological Society of Japan104 Vol. 81, No. 1

Page 21: Skill Evaluation of Probabilistic Forecasts by the

over the whole globe, which means that skill forprecipitation is the lowest of three physicalvariables. The region with positive skill islimited to the tropical oceans and the northof Brazil (Nordeste). In the Northern Hemi-sphere, there is almost no skill for all four sea-sons and seasonality is weak. On the otherhand, skill in the tropics is relatively higher inspring and winter over the Pacific Ocean.

7. Rank histogram

The rank histogram is a measure to evaluatethe ability of the model to catch hold of the ob-servation within the spread of an ensembleforecast (Hamill and Colucci 1997, 1998). Thismethod is also called the binned probabilityensemble (Anderson 1996; Mo and Straus 1999)or the Talagrand diagram (Talagrand andVautard 1999). It is one of the non-parametricvalidation technique which only refers therelative magnitude of verifying analysis and

forecast without any definition of parametersuch as event criterion or threshold of category.

Suppose a forecast system with the ensemblesize of n for a set of m independent samples,where in our case m is the number of yearsconsidered in the experiment. For a particularyear, we have n values of a certain physicalvariable at each grid point, which can be sortedfrom smallest to largest value to define nþ 1ranks. The observation will fall into one ofthese ranks. Counting the rank of observationsover all m independent samples, it is expectedthat each rank should have an equal probabil-ity of containing the observation. If the rankdistribution is significantly different from theexpected flat distribution, it suggests that themodel has some deficit or all members are notequally weighted. For example, a high frequencyof outermost ranks indicates a possibility oflarge systematic error in model or too small en-semble spread within the model ensemble sys-

Fig. 15. Same as Fig. 13, but for RPSS (%) of 850 hPa temperature anomaly.

S. KUSUNOKI and C. KOBAYASHI 105February 2003

Page 22: Skill Evaluation of Probabilistic Forecasts by the

tem. The degree of deviation from the expectedflat distribution can be measured by the quan-tity called discrepancy D which is defined as

D ¼Xnþ1

i¼1

mðpi � eÞ2

e: ð16Þ

i is the rank number, n the ensemble size, mthe number of samples, pi the probability ofobservation for the rank i, e the expected flatdistribution 1/ðnþ 1Þ. In the present study,n ¼ 9 and m ¼ 15 years. Since D obeys a chi-square distribution with n degrees of freedom,the significance level of uniformity of rank his-togram (distribution) can be examined using achi-square goodness-of-fit test (Anderson 1996;Wilks 1995; Hamill and Colucci 1997; Mo andStraus 1999). The rank histogram can be ap-plied to any type of probability distributionfunction of forecast, because this method is a

nonparametric skill measure without regard tothe event criteria or threshold values of catego-ries. This is contrast to the Brier score, ROCand RPSS which require the definition of eventor the definition of threshold values for catego-ries.

Examples of the rank histogram for 500 hPaheight anomaly in spring are shown by Fig. 17.The distribution of frequency at the grid pointin Japan (Fig. 17a) is approximately uniform,whereas that at the grid point in the easternequatorial Pacific (Fig. 17b) is far from the flatdistribution and large frequencies are observedfor the outermost ranks of 1 and 10, resultingin the large value of D ¼ 35.

Geographical distributions of discrepancy Dfor 500 hPa height anomaly are depicted in Fig.18 for all four seasons. Discrepancy is larger inthe tropics, especially in the eastern tropicalPacific, than in extratropical regions. In winter

Fig. 16. Same as Fig. 13, but for RPSS (%) of precipitation anomaly.

Journal of the Meteorological Society of Japan106 Vol. 81, No. 1

Page 23: Skill Evaluation of Probabilistic Forecasts by the

discrepancy is also large over the Indian Ocean.For winter, Mo and Straus (1999) have calcu-lated global distribution of discrepancy usingCOLA simulations and also found large valuesover the Indian Ocean. However, values overthe eastern tropical Pacific was small in con-trast to our result.

These regional difference and seasonal ten-dency is also recognized for 850 hPa tempera-ture with larger values of D in the tropics thanD of 500 hPa height (figure not shown). In caseof precipitation, it was again found that D isrelatively larger in the tropics than in extra-tropical region, but the contrast in two regionsare weaker than those of 500 hPa height and850 hPa temperature (figure not shown). Largediscrepancy in the tropics seems to contradictthe tendency of RPSS (Fig. 13), which showsrelatively higher skill there.

To identify the reason of poor performancein the tropics in terms of the rank histogram,we have reexamined the time series of origi-nal observational and forecast data. Figure 19shows the observed anomalies and the pre-

dicted anomalies of all 9 members for all 15years at the same grid points as in Fig. 17. Atthe grid point over Japan (Fig. 17a), the ob-served anomalies are almost included withinthe spread of ensemble members, which leadsto a relatively high skill in terms of the rankhistogram with low discrepancy D. The year-to-year variation of observed anomaly, how-ever, is not well reproduced by the model interms of ensemble average as is indicated bythe low temporal correlation coefficient (26.4%)between observed anomaly (solid line) and theensemble average forecast (dashed line). This isalso confirmed by the low correlation coefficientover Japan in Fig. 14a. At the same time, thepoor performance of probabilistic forecast overJapan is revealed by the low skill of RPSS inFig. 13a.

The situation is much different in the tropics(Fig. 19b). Although the reproducibility of in-terannual variation is high in terms of RPSS(Fig. 13a) and ensemble average forecast (Fig.14a), the spread of ensemble is too small to hitthe observed anomaly, resulting in the largediscrepancy D.

8. Summary

Probabilistic forecast skill of the atmosphericseasonal predictability experiments is eval-uated using the JMA AGCM, which is a globalspectral model of T63 resolution, and was usedfor former operational one-month forecasts.Four-month ensemble integrations was carriedout with nine consecutive days of initial condi-tion preceding the target season. All four sea-sons in a 15-year period from 1979 to 1993 arechosen as target seasons. The model was forcedwith observed SST during the time integra-tions.

Probabilistic forecasts are verified by fourdifferent skill measures for three meteorologi-cal variables of 500 hPa height, 850 hPa tem-perature and precipitation. Dependence of skillon validation measure, on meteorological vari-able, on season, on region, and on event crite-rion were investigated in a more comprehen-sive way compared with preceding studies.Results are summarized below.

(1) The Brier skill score (BSS), reliabilityskill score and resolution skill score includingreliability diagram were calculated for theNorthern Hemisphere 500 hPa height anomaly

Fig. 17. Examples of rank histogram for500 hPa height anomaly in spring. Hori-zontal line shows the expected flat dis-tribution of frequency 0.1. The values inthe top-left corner outside of each paneldenote the measure of discrepancy Dbetween the observed and the expecteddistribution. (a) Japan (135�E, 35�N),(b) The eastern equatorial Pacific (90�W,equator).

S. KUSUNOKI and C. KOBAYASHI 107February 2003

Page 24: Skill Evaluation of Probabilistic Forecasts by the

to find that all three skill scores show the sameseasonality of the highest value in winter andthe lowest in summer, even if the event crite-rion is different. The Brier skill score is rela-tively higher over East Asia and North Amer-ica. These tendencies are similar to that ofanomaly correlation of ensemble average fore-cast. In the tropics, the dependence of the Brierskill score on season and on event criteria islarger than in the extratropics. All three skillscores of one-month lead time forecasts is lowerthan that of no lead time forecasts for all sea-sons in the Northern Hemisphere. Althoughthe Brier skill scores of 850 hPa temperatureand precipitation are generally lower than thatof 500 hPa height, they have the common fea-ture with that of 500 hPa height in that skill

is relatively higher in winter of the NorthernHemisphere, in spring and winter of the trop-ics, and in boreal summer of the SouthernHemisphere. The tendency of relatively higherskill over North America and lower skill overEurope is consistent with the results by Moand Straus (1999) and Palmer et al. (2000),although their target season, region and mete-orological variable were restricted.

(2) Relative Operating Characteristics (ROC)skill scores for 500 hPa height, 850 hPa tem-perature and precipitation have almost thesame seasonality as the Brier skill score does inthe Northern and Southern Hemisphere and inthe tropics. While relatively higher skill in thetropics for 850 hPa temperature and precipita-tion is consistent with the results by Graham et

Fig. 18. Distributions of discrepancy D between the observed and the expected distribution of500 hPa height anomaly. The 95% significance level is 20. (a) Spring, (b) Summer, (c) Autumn, (d)Winter.

Journal of the Meteorological Society of Japan108 Vol. 81, No. 1

Page 25: Skill Evaluation of Probabilistic Forecasts by the

al. (2000), in the tropics for precipitation theyfound little variation of skill with season, whichcontrasts to our results of strong seasonal de-pendence.

(3) Ranked Probability Skill Score (RPSS) for500 hPa height is higher in the tropics than inextratropical regions for all four seasons. Thegeographical distribution of 500 hPa heightRPSS is qualitatively similar to that of inter-annual temporal correlation coefficients be-tween observation and model ensemble averageforecast of 500 hPa height for all four seasons,which means RPSS has closely related with themodel’s ability to reproduce year-to-year varia-blity. As for 850 hPa temperature and precip-itation, the distribution of RPSS are foundalmost the same as that of 500 hPa height, al-though their magnitude of skill is generallylower than that of 500 hPa height. As far as

500 hPa height in winter is concerned, rela-tively higher skill in the tropics is consistentwith the results both by Mo and Straus (1999)and Doblas-Reyes et al. (2000).

(4) Rank histogram was calculated for 500hPa height, 850 hPa temperature and precipi-tation together with the discrepany D whichmeasures the deviation from expected flat dis-tribution of histogram. D is larger, corre-sponding to lower skill, in the tropics than inextratropical regions for all four seasons. Thistendency seems to contradict with that byRPSS which shows higher skill in the tropics.The poor performance of the model in the trop-ics from the viewpoint of discrepancy D can beattributed to the fact that spread of ensemble istoo small to capture the observation within therange of ensemble members. Conversely, in theextratropics the model spread of ensemble is

Fig. 19. Interannual variations of observed and model 500 hPa height anomaly for spring at thesame grid points as in Fig. 17. Observation is shown by the solid line, individual forecast by cross,and ensemble average forecast by the dashed line. The value of R� 100 in the top-right corneroutside of each panel denotes temporal correlation (%) between observation and ensemble averageforecast for 15 years. The values of discrepancy D and RPSS (%) are also shown in the top outsideof each panel. (a) Japan (135�E, 35�N), (b) The eastern equatorial Pacific (90�W, equator).

S. KUSUNOKI and C. KOBAYASHI 109February 2003

Page 26: Skill Evaluation of Probabilistic Forecasts by the

large enough to include observation within therange of ensemble members. For winter caseof 500 hPa height, Mo and Straus (1999) havealso found large values of discrepancy overthe Indian Ocean, however values were muchsmaller over the eastern tropical Pacific wherewe found large values in our model.

One of the drawbacks of ranked probabilityscore (RPS) is that skill depends on the num-ber of categories and on the definition of theirthreshold values. Moreover, the number of cat-egories has limitation and cannot increase toinfinity. Hersbash (2000) has proposed a kindof hybrid method called continuous rankedprobability score (CRPS) combining the conceptof RPS with that of rank histogram. CPRS doesnot require any definition of categories and canbe regarded as a RPS with infinite number ofcategories which enables one to deal with aninfinite number of ensemble size. It is expectedthat CPRS would give comprehensive interpre-tation and explanation on the contradiction ofmodel skill between RPSS and rank histogramfound in this study.

Acknowledgements

We acknowledge anonymous reviewerswhose valuable comments and suggestionshave greatly improved the quality of themanuscript. Thanks are also extended to Ms.Eriko Ooyama of JMA who gave us valuabletechnical advice.

Appendix

An example of probabilistic forecastshowing negative Brier skill score

and the ROC skill score of 0.5

Performance of probabilistic forecast tends tobe overestimated by the ROC skill score com-pared with the Brier skill score. Here, let usconsider an example of simple probabilisticforecast where oj ¼ oc for all j ð j ¼ 1; 2; . . . ;JÞ.All notations used in this appendix is the sameas in sections 4 and 5 of the main part. The re-liability diagram of this forecast coincides withthe horizontal line indicating the climatologicalfrequency of occurrence oc in Fig. 1. The distri-bution of relative frequency of occurrence gj hasno restriction and is arbitrary. The Brier skillscore BSS and the ROC skill score A for thisparticular forecast are calculated analyticallyby the simple algebra as follows:

A.1 The Brier skill scoreFrom the definition of resolution (4), ob-

viously bres ¼ 0. Using (2), the Brier score b be-comes

b ¼ brel � bres þ buns ¼ brel þ buns:

Definition of (8) gives the Brier skill scoreBSS as

BSS ¼ 1� b /buns ¼ 1� ðbrel þ bunsÞ /buns¼ �brel /buns:

From the definition of reliability (3) it is clearthat brel > 0 for the forecast concerned, sincereliability measures the discrepancy betweenthe diagonal line and the reliability diagram ofthe forecast. From the definition of uncertainty(5) bunc is non-negative quantity and becomeszero if either oc ¼ 0 or 1 which corresponds tothe deterministic situation. Here we are dis-cussing probabilistic forecast, so it is naturalto assume 0 < oc < 1, which gives bunc > 0.Together with brel > 0, finally we get the con-clusion of BSS < 0.

A.2 The ROC skill scoreSetting oj ¼ oc in (6), the relations

Mj ¼ oc Nj; ðA:1Þ

M ¼ oc N; ðA:2Þ

can be derived. Substituting (A.1) and (A.2) into(12) and (13), the hit rate ht and the false alarmrate ft can be transformed as

ht ¼1

M

XJ

j¼t

Mj ¼1

oc N

XJ

j¼t

oc Nj ¼1

N

XJ

j¼t

Nj;

ðA:3Þ

ft ¼1

N �M

XJ

j¼t

ðNj �MjÞ

¼ 1

N � oc N

XJ

j¼t

ðNj � oc NjÞ

¼ 1

Nð1� ocÞXJ

j¼t

ð1� ocÞNj ¼1

N

XJ

j¼t

Nj: ðA:4Þ

Therefore, the relation ht ¼ ft holds for allt ðt ¼ 1; 2; . . . ;JÞ. This means all points of ROCcurve lie on the diagonal, resulting in the ROCskill score A ¼ 0:5.

Furthermore, let us consider the more spe-cific case of the climatological probabilisticforecast where the climatological frequency oc

Journal of the Meteorological Society of Japan110 Vol. 81, No. 1

Page 27: Skill Evaluation of Probabilistic Forecasts by the

is always predicted for all sampling grid pointsincluded in the ROC statistics. In this case,Nj has the value of N only for the category cor-responding to the probability of oc, otherwiseNj ¼ 0. For instance, let t ¼ 5 is the categorynumber of the probability of oc. Judging fromthe last right hand side in (A.3) and (A.4),N5 ¼ N always contributes to the summation ofj for t ¼ 1 to 5, giving ht ¼ ft ¼ 1. On the con-trary, when t ¼ 6 to J, there is no contributionto the summation and therefore ht ¼ ft ¼ 0. Inconclusion, ROC curve consists of only the twopoints; the left-bottom corner ðht; ftÞ ¼ ð0; 0Þand the right-top corner ðht; ftÞ ¼ ð1; 1Þ. Thismeans that ROC curve coincides with thediagonal line, which gives ROC skill score ofA ¼ 0:5.

References

Anderson, J.L., 1996: A method for producing andevaluating probabilistic forecasts from ensem-ble model integrations. J. Climate, 9, 1518–1530.

Brankovic, C. and T.N. Palmer, 2000: Seasonal skilland predictability of ECMWF PROVOSTensembles. Quart. J. Roy. Meteor. Soc., 126,2035–2067.

Brier, G.W., 1950: Verification of forecasts expressedin terms of probability. Mon. Wea. Rev., 78, 1–3.

Doblas-Reyes, F.J., M. Deque and J.-P. Piedelievre,2000: Multi-model spread and probabilisticseasonal forecast in PROVOST. Quart. J. Roy.Meteor. Soc., 126, 2069–2087.

Epstein, E.S., 1969: A scoring system for probabilityforecasts of ranked categories. J. Appl. Meteor.,8, 985–987.

Gibson, J.K., P. Kallenberg, S. Uppala, A. Hernan-dez, A. Nomura and E. Serrano, 1997: ECMWFRe-Analysis Project Report Series 1. ERA de-scription. 72pp.

Graham, R.J., A.D.L. Evans, K.R. Mylne, M.S.J.Harrison and K.B. Robertson, 2000: An assess-ment of seasonal predictability using atmo-spheric general circulation models. Quart. J.Roy. Meteor. Soc., 126, 2211–2240.

Hamill, T.M. and S.J. Colucci, 1997: Verification ofEta-RSM short-range ensemble forecasts. Mon.Wea. Rev., 125, 1312–1327.

——— and ———, 1998: Evaluation of Eta-RSMensemble probabilistic precipitation forecasts.Mon. Wea. Rev., 126, 711–724.

Hersbach, H., 2000: Decomposition of the continuousranked probability score for ensemble predic-tion. Wea. Forecasting, 15, 559–570.

Japan Meteorological Agency (JMA), 1997: Outlineof the operational numerical weather predic-tion at Japan Meteorological Agency. Appendixto Progress report on numerical weather pre-diction. 126pp.

Kobayashi, C., K. Takano, S. Kusunoki, M. Sugi andA. Kitoh, 2000: Seasonal prediction skill overthe Eastern Asia using the JMA global model.Quart. J. Roy. Meteor. Soc., 126, 2111–2123.

Kusunoki, S., M. Sugi, A. Kitoh, C. Kobayashi and K.Takano, 2001: Atmospheric seasonal predict-ability experiments by the JMA AGCM. J. Me-teor. Soc. Japan, 79, No. 6, 1183–1206.

Mason, S.J. and N.E. Graham, 2002: Areas beneaththe relative operating characteristics (ROC)and relative operating levels (ROL) curves:Statistical significance and interpretation.Quart. J. Roy. Meteor. Soc., 128, 1623–1640.

Mo, R. and D.M. Straus, 1999: Statistical verificationof dynamical seasonal prediction. Center forOcean-Land Atmosphere studies Technical Re-port 73, 73pp.

Murphy, A.H., 1969: On the ‘‘ranked probabilityscore’’. J. Appl. Meteor., 8, 988–989.

———, 1971: A note on the ranked probability score.J. Appl. Meteor., 10, 155–156.

———, 1973: A new vector partition of the probabil-ity score. J. Appl. Meteor., 12, 595–600.

Palmer, T., C. Brankovic and D.S. Richardson, 2000:A probability and decision-model analysis ofPROVOST seasonal multi-model ensemble in-tegrations. Quart. J. Roy. Meteor. Soc., 126,2013–2033.

Randall, D.A. and D.M. Pan, 1993: Implementationof the Arakawa-Schubert cumulus parame-terization with a prognostic closure. Meteo-rological Monographs, Vol. 24, No. 46, Therepresentation of cumulus convection in numer-ical models. American Meteorological Society,Chapter 11, 137–147.

Raynolds, R.W. and T.M. Smith, 1994: Improvedglobal sea surface temperature analyses usingoptimum interpolation. J. Climate., 7, 929–948.

Sato, N., P.J. Sellers, D.A. Randall, E.K. Schnider,J. Shekel, J.L. Kinter III, Y.-T. Hou andA. Albertazzi, 1989: Effects of implementingthe simple biosphere model (SiB) in a generalcirculation model. J. Atmos. Sci., 46, 2757–2782.

Sellers, P.J., Y. Mintz, Y.C. Sud and A. Dalcher,1986: A simple biosphere model (SiB) for usewithin general circulation models. J. Atmos.Sci., 43, 505–531.

Shukla, J., J. Anderson, D. Baumhefner, C. Bran-kovic, Y. Chang, E. Kalnay, L. Marx, T.Palmer, D. Paolino, J. Ploshay, S. Schubert,

S. KUSUNOKI and C. KOBAYASHI 111February 2003

Page 28: Skill Evaluation of Probabilistic Forecasts by the

D. Straus, M. Suarez and J. Tribbia, 2000a:Dynamical seasonal prediction. Bull. Amer.Meteor. Soc., 81, 2593–2606.

———, D.A. Paolino, D.M. Straus, D. DeWitte, M.Fennessy, J.L. Kinter, L. Marx and R. Mo,2000b: Dynamical seasonal predictions withthe COLA atmospheric model. Quart. J. RoyalMeteor. Soc., 126, 2265–2291.

Stanski, H.R., L.J. Wilson and W.R. Burrows, 1989:Survey of common verification methods in me-teorology. World Weather Watch Technical Re-port No. 8. WMO/TD No. 358. World Meteoro-logical Organization. Geneva. pp114.

Talagrand, O. and R. Vautard, 1999: Evaluation ofprobabilistic prediction systems. Workshopproceedings ‘‘Workshop on predictability’’, 20–22 October 1997, ECMWF, Reading, UK, 1–25.

Wilks, D.S., 1995: Statistical Methods in the Atmo-spheric Sciences. Cambridge Academic Press,467pp.

Xie, P. and P. Arkin, 1997: Global precipitation: A17 year monthly analysis based on gauge ob-servations, satellite estimates and numericalmodel outputs. Bull. Amer. Meteor. Soc., 78,2539–2558.

Journal of the Meteorological Society of Japan112 Vol. 81, No. 1