1
A Multivariate Analysis on the 2004
Summer Olympic Games
Wei Xiong, M.Sc Student, Department of Mathematics and Statistics,
University of Guelph
May 12-13, 2005
2
OUTLINE
1. Introduction• 2004 Summer Olympic Games• Multivariate techniques: cluster analysis,
multivariate analysis of variance, multivariate regression analysis
• Literature review of analyses on Olympic Games
2. Data Analysis and Discussion
3. Conclusions
3
2004 Summer Olympic Games • the largest event, 11,000 athletes from 202
countries, 929 metals won by 75 countries/regions.
Multivariate (>1 response variable) Techniques • Cluster Analysis: obs’n (countries) classified into clusters
(groups) based on each obsn’s similarity of multi variables
(number of gold, silver, bronze and total), by measuring the
distance or dissimilarity between any two clusters.
4
• Multivariate Analysis of Variance (MANOVA): a generalization of ANOVA, used to compare more than
two population mean vectors
Hypothesis:
H0: 1 = … = t versus Ha: j ≠ k (for some j ≠k)
H0 is rejected if H = SS(Treatment) >> E = SS(Error) Wilk’s statistic = |E| / |E+H|
5
• Multivariate Regression
model: Y (nxp) = X (nxq) (qxp) + E (nxp) where n: observations,
p: response variables,
q: explanatory variables
Least square estimator of is:
(X'X )-1X'Y
6
Literature review
• Condon et al [1] tried to predict a country’s success at the Olympic Games using linear regression models and neural network models.
• Lins et al [2] developed a Data Envelopment Analysis (DEA)-based model to rank each country based on its ability to win medals in relation to its available resources.
• Churilov and Flitman [3] improved the Data Envelopment Analysis (DEA)-based model by combining different sets of input parameters with the DEA model.
This study: uses multivariate techniques to analyze the 2004 Summer Olympic Games and try to explore the factors that influence the number of medals won.
7
Table 1: Rankings For Participating Countries
Country Gold
(y1)
Silver
(y2)
Bronze
(y3)
Total
(y4)
Ranking
(by Gold)
[4]
Ranking
(by Cluster Analysis)
USA 35 39 29 103 1 1
China 32 17 14 63 2 2
Russia 27 27 38 92 3 1
Canada 3 6 3 12 21 4
Syrian 0 0 1 1 71 5
Trinidad 0 0 1 1 71 5
Note: number of countries in cluster 1, 2, 3, 4 and 5 are 2, 3, 7, 7, 56 respectively.
8
Table 2: Least Square Means for Group Medals
y1 (Gold) y2 (Silver) y3 (Bronze) y4 (Total)
1(USA, RUS)
31.00 33.00 33.50 97.50
2(CHN, AUS, GER )
21.00 16.33 16.00 53.33
3 10.43 8.86 # 11.00 30.29
4 4.86 7.00 # 5.29 17.14
5 1.23 1.34 1.75 4.32
Group
Medals
Note: # close to each other
9
Multivariate Analysis of Variance (MANOVA):Compares the metal means for the 5 groups
MANOVA Test: Hypothesis of No Overall Group Effect
Statistic Value F Value Pr > F
Wilks' Lambda 0.02126952 49.34 <.0001
proc glm;class groupmodel y1-y4=group;manova h=group;lsmeans group/pdiff;run;
10
Least Squares Means for effect group for silver (y2)
Pr > |t| for H0: LSMean(i)=LSMean(j)
i/j 1 2 3 4
2 <.0001
3 <.0001 <.0001
4 <.0001 <.0001 0.0572
5 <.0001 <.0001 <.0001 <.0001
Note: p-values for other metals < 0.0001
11
? WHY
• Why some countries won more medals and the others won less
• Hypotheis: the larger the population and GDP, the more the
medals
Population: the larger the population (x1), the more the outstanding athletes available
GDP (Gross Domestic Product): the higher the GDP, the more the funding for athletes training
12
Number of Gold (y1)
Number of Silver (y2)
Number of Bronze (y3)
Number of Total (y4)
1 p-value 2 p-value 3 p-value 4 p-value
x1 (million)
0.0116 0.0002 0.0043 0.1223 0.0031 0.3712 0.0190 0.0317
x2 ($bill-
ion)
0.0031 <.0001 0.0033 <.0001 0.0027 <.0001 0.0091 <.0001
y’s
x’s
Table 3: Multivariate Regression of Medals on Population (x1) [5] and GDP (x2) [6]
proc glm;model y1-y4 = x1-x2/xpx i;run;
13
Conclusions
The 2004 Summer Olympic Games are analyzed using multivariate
methods: Cluster Analysis, Multivariate Analysis of Variance,
Multivariate Regression Analysis.
Participating countries are classified into 5 groups based on their number
of medals won. It is found that each group differs significantly in terms of
the number of medals in that group.
14
Population and GDP are two significant factors for each group’s number of
medals: an increase of 1 million in population increase the number of gold
by 0.0116, or the number of total medals by 0.019. 1 billion’s increase in
GDP increase the number of gold by 0.0031, silver 0.0033, bronze 0.0027,
or total by 0.0091.
References
[1] Edward M. Condon, Bruce L. Golden and Edward A. Wasil (1999).
Predicting the success of nations at the Summer Olympics using neural
networks. Computers & Operations Research. 26(13),1243-1265.
15
[2] Marcos P. Estellita Lins, Eliane G. Gomes, João Carlos C. B. Soares de Mello and Adelino José R. Soares de Mello (2003). Olympic ranking based on a zero sum gains DEA model. European Journal of Operational Research. 148(2), 312-322.
[3] L. Churilov and A. Flitman (2004). Towards fair ranking of Olympics
achievements: the case of Sydney 2000. Computers & Operations
Research. Available online 6 November 2004.
[4] http://www.athens2004.com/en/OlympicMedals/medals, accessed
May 11, 2005.
[5] http://www.geohive.com/global/index.php, accessed Nov. 25, 2004.
[6] http://www.geohive.com/global/geo.php?xml=ec_gdp1&xsl=ec_gdp1,
accessed May 11, 2005.
16
17
Appendix 1
Table 1. Number of metals for each country/region
• Country/Region,Gold,Silver,Bronze,Total
• USA 35,39,29,103 CHN 32,17,14,63 RUS 27,27,38,92 AUS17,16,16,49 JPN16,9,12,37 GER 14,16,18,48 FRA11,9,13,33 ITA 10,11,11,32 KOR 9,12,9,30 GBR 9,9,12,30 CUB 9 7 11 27 UKR 9 5 9 23 HUN 8 6 3 17 ROM 8 5 6 19 GRE 6 6 4 16 NOR 5 0 1 6 NED 4 9 9 22 BRA 4 3 3 10 SWE 4 1 2 7 ESP 3 11 5 19 CAN 3 6 3 12 TUR 3 3 4 10 POL 3 2 5 10 NZL 3 2 0 5 THAThailand314826BLRBelarus2671527AUTAustria241728ETHEthiopia232729IRII.R.Iran222630SVKSlovakia222631TPEChineseTaipei221532GEOGeorgia220433BULBulgaria2191234JAMJamaica212535UZBUzbekistan212536MARMorocco210337DENDenmark206838ARGArgentina204639CHIChile201340KAZKazakhstan143841KENKenya142742CZECzechRepublic134843RSASouthAfrica132644CROCroatia122545LTULithuania120346EGYEgypt113547SUISwitzerland113548INAIndonesia112449ZIMZimbabwe111350AZEAzerbaijan104551BELBelgium102352BAHBahamas101253ISRIsrael101254CMRCameroon100155DOMDominicanRep100156IRLIreland100157UAEUArabEmirates100158PRKDPRKorea041559LATLatvia040460MEXMexico031461PORPortugal021362FINFinland020263SCGSerbia.Monteneg020264SLOSlovenia013465ESTEstonia012366HKGHongKong010167INDIndia010168PARParaguay010169NGRNigeria002270VENVenezuela002271COLColombia001172ERIEritrea001173MGLMongolia001174SYRSyrianArabRep001175TRITrinidad.Tobago0011
18
SAS coding-1data Anthemn2004SummerOlympic;input Country $ y1-y4;cards;see Table 1 for data;proc cluster method=eml standard rmsstd rsquare outtree=tree;var y1-y4 ;id country;run;proc tree data=tree noprint n=5 out=countryout;id country;run;proc tree data=tree n=5;id country;run;proc sort;by country;proc sort data=Anthemn2004SummerOlympic out=new;by country;data temp;merge new countryout;by country;proc sort;by cluster;proc print;id country;proc factor heywood rotate=varimax, quartimax;var y1-y4 ;by cluster;proc princomp;var y1-y4 ;run;proc factor heywood rotate=varimax, quartimax;var y1-y4 ;run;
19
SAS coding-2data Anthemn2004SummerOlympic;input group y1-y4 x1-x2 ;cards;5 35 39 29 103 273 108825 27 27 38 92 146 4334 32 17 14 63 1247 14104 17 16 16 49 19 5184 14 16 18 48 82 2401;proc glm;class group;model y1-y4=group;manova h=group/printe printh;lsmeans group/pdiff;run;
20
SAS codingdata Anthemn2004SummerOlympic;input group y1-y4 x1-x2 ;cards;……………..;proc corr;var y1-y4 x1-x2;run;proc glm;model y1-y4 = x1-x2/xpx i;MANOVA H=x1 x2 /printe printh;run;
21
Log Likelihood
2356
1856
1356
856
356
- 144
- 644
Count r y
USAUnite
RUSRussi
CHNChina
AUSAustr
GERGerma
JPNJapan
FRAFranc
GBRGreat
ITAItaly
KORKorea
CUBCuba
UKRUkrai
HUNHunga
GREGreec
ROMRoman
CANCanad
BLRBelar
NEDNethe
ESPSpain
NORNorwa
SWESwede
NZLNewZe
GEOGeorg
LTULithu
MARMoroc
IRII.R.I
SVKSlova
CROCroat
TPEChine
JAMJamai
UZBUzbek
ARGArgen
AZEAzerb
EGYEgypt
SUISwitz
SLOSlove
CHIChile
BAHBaham
ISRIsrae
CMRCamer
DOMDomin
IRLIrela
UAEUArab
INAIndon
ESTEston
ZIMZimba
BELBelgi
NGRNiger
VENVenez
COLColom
ERIEritr
MGLMongo
SYRSyria
TRITrini
FINFinla
SCGSerbi
HKGHongK
INDIndia
PARParag
AUTAustr
ETHEthio
RSASouth
KAZKazak
KENKenya
CZECzech
PRKDPRKo
LATLatvi
MEXMexic
PORPortu
BRABrazi
TURTurke
POLPolan
THAThail
BULBulga
DENDenma
Cluster analysis: Countries Classified into 5 Groups
CAN
5 4 3 2 1Groups:
22
Table 2: Factor Analysis on Metals
Group 1 2 3 4 5
LatentFactor
(%)
1
(94.71) *
1(95.99)
1(61.35)
2(86.50)
1
(52.10)
2
(83.89)
1
(58.04)
2
(83.09)
Gold
(y1)
0.9634 # 0.9997 0.8783 -.0055 -.0378 0.9844 0.7654 0.0052
Silver
(y2)
0.9694 0.9839 0.1470 0.9873 0.6388 -.5449 0.1409 0.9893
Bronze
(y3)
0.9595 - .9414 0.8314 -.0629 0.8151 -.1716 0.8546 -.1100
Total
(y4)
0.9999 0.9928 0.8682 0.4932 0.9551 0.2727 0.9186 0.3908
Note: * cumulative eigenvalues, percentage of total variation explained in the four variables (metals) # Factor loading, correlation between latent factor and variables (Factor Analysis, rotation = quartimax, make latent factor strongly or weakly correlated to variables)
23
Correlation Between y’s and x’s x1 ( # Population) [2] , x2 ( # GDP, Gross Domestic Product) [3]
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0
y1 y2 y3 y4
x1 0.46543 0.3038 0.23199 0.34887 <.0001 0.0081 0.0452 0.0022 x2 0.70219 0.76180 0.60769 0.71640 <.0001 <.0001 <.0001 <.0001
Note: reasonable correlation between y’s and x1, large correlation between y’s and x2.
# Both population and GDP are in 2003