spatial statistics: topic 31 descriptive statistics assoc. prof. dr. abdul hamid b. hj. mar iman...
Post on 18-Dec-2015
216 views
TRANSCRIPT
Spatial Statistics: Topic 3 1
Descriptive Statistics
Assoc. Prof. Dr. Abdul Hamid b. Hj. Mar Iman
DirectorCentre for Real Estate Studies
Faculty of Engineering and Geoinformation ScienceUniversiti Tekbnologi Malaysia
Skudai, Johor
Spatial Statistics (SGG 2413)
Spatial Statistics: Topic 3 2
Learning Objectives
Overall: To give students a basic understanding of descriptive statistics
Specific: Students will be able to: * understand the basic concept of descriptive statistics * understand the concept of distribution * can calculate measures of central tendency dispersion * can calculate measures of kurtosis and skewness
Spatial Statistics: Topic 3 3
Contents
What is descriptive statisticsCentral tendency, dispersion, kurtosis,
skewnessDistribution
Spatial Statistics: Topic 3 4
Use sample information to explain/make abstraction of population “phenomena”.
Common “phenomena”: * Association (e.g. σ1,2.3 = 0.75) * Tendency (left-skew, right-skew) * Trend, pattern, location, dispersion, range * Causal relationship (e.g. if X then Y) Emphasis on meaningful characterisation of data
(e.g. central tendency, variability), graphics, and description
Use non-parametric analysis (e.g. 2, t-test, 2-way anova)
Descriptive Statistics
Spatial Statistics: Topic 3 5
Trends in property loan, shop house demand & supply
0
50000
100000
150000
200000
Year (1990 - 1997)
Loan to property sector (RM
million)
32635.8 38100.6 42468.1 47684.7 48408.2 61433.6 77255.7 97810.1
Demand for shop shouses (units) 71719 73892 85843 95916 101107 117857 134864 86323
Supply of shop houses (units) 85534 85821 90366 101508 111952 125334 143530 154179
1 2 3 4 5 6 7 8
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
Batu P
ahat
Joho
r Bah
ru
Kluang
Kota T
ingg
i
Mer
sing
Mua
r
Pontia
n
Segam
at
District
No
. o
f h
ou
ses
1991
2000
0
2
4
6
8
10
12
14
0-4
10-1
4
20-2
4
30-3
4
40-4
4
50-5
4
60-6
4
70-7
4
Age Category (Years Old)
Pro
po
rtio
n (
%)
E.g. of Abstraction of phenomena
Demand (% sales success)
12010080604020
Pri
ce
(R
M/s
q.f
t. b
uilt
are
a)
200
180
160
140
120
100
80
Spatial Statistics: Topic 3 6
Using sample statistics to infer some “phenomena” of population parameters
Common “phenomena”: cause-and-effect * One-way r/ship * Feedback r/ship * Recursive
Use parametric analysis (e.g. α and ) through regression analysis
Emphasis on hypothesis testing
Y1 = f(Y2, X, e1)Y2 = f(Y1, Z, e2)
Y1 = f(X, e1)Y2 = f(Y1, Z, e2)
Y = f(X)
Inferential Statistics
Spatial Statistics: Topic 3 7
Statistical analysis that attempts to explain the population parameter using a sample
E.g. of statistical parameters: mean, variance, std. dev., R2, t-value, F-ratio, xy, etc.
It assumes that the distributions of the variables being assessed belong to known parameterised families of probability distributions
Parametric statistics
Spatial Statistics: Topic 3 8
Examples of parametric relationship
Coefficientsa
1993.108 239.632 8.317 .000
-4.472 1.199 -.190 -3.728 .000
6.938 .619 .705 11.209 .000
4.393 1.807 .139 2.431 .017
-27.893 6.108 -.241 -4.567 .000
34.895 89.440 .020 .390 .697
(Constant)
Tanah
Bangunan
Ansilari
Umur
Flo_go
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: Nilaisma.
Dep=9t – 215.8
Dep=7t – 192.6
Spatial Statistics: Topic 3 9
First used by Wolfowitz (1942) Statistical analysis that attempts to explain the
population parameter using a sample without making assumption about the frequency distribution of the assessed variable
In other words, the variable being assessed is distribution-free
E.g. of non-parametric statistics: histogram, stochastic kernel, non-parametric regression
Non-parametric statistics
Spatial Statistics: Topic 3 10
DS gather information about a population characteristic (e.g. income) and describe it with a parameter of interest (e.g. mean)
IS uses the parameter to test a hypothesis pertaining to that characteristic. E.g.
Ho: mean income = RM 4,000
H1: mean income < RM 4,000) The result for hypothesis testing is used to make
inference about the characteristic of interest (e.g. Malaysian upper middle income)
Descriptive & Inferential Statistics (DS & IS)
Spatial Statistics: Topic 3 11
Measure Advantages Disadvantages
Mean(Sum of all values ÷
no. of values)
Best known average
Exactly calculable
Make use of all data
Useful for statistical analysis
Affected by extreme values Can be absurd for discrete data
(e.g. Family size = 4.5 person)
Cannot be obtained graphically
Median(middle value)
Not influenced by extreme
values Obtainable even if data
distribution unknown (e.g.
group/aggregate data) Unaffected by irregular class
width
Unaffected by open-ended class
Needs interpolation for group/
aggregate data (cumulative
frequency curve) May not be characteristic of group
when: (1) items are only few; (2)
distribution irregular
Very limited statistical use
Mode(most frequent value)
Unaffected by extreme values
Easy to obtain from histogram
Determinable from only values
near the modal class
Cannot be determined exactly in
group data
Very limited statistical use
Sample Statistics: Central Tendency
Spatial Statistics: Topic 3 12
Central Tendency – Mean
For individual observations, . E.g.
X = {3,5,7,7,8,8,8,9,9,10,10,12}
= 96 ; n = 12 Thus, = 96/12 = 8 The above observations can be organised into a frequency
table and mean calculated on the basis of frequencies
= 96; = 12
Thus, = 96/12 = 8
x 3 5 7 8 9 10 12
f 1 1 2 3 2 2 1
fx 3 5 14 24 18 20 12
Spatial Statistics: Topic 3 13
Central Tendency - Mean and Mid-point
Let say we have data like this:
Location Min Max
Town A 228 450
Town B 320 430
Price (RM ‘000/unit) of Shop Houses in Skudai
Can you calculate the mean?
Spatial Statistics: Topic 3 14
Central Tendency - Mean and Mid-point (contd.)
Let’s calculate:
Town A: (228+450)/2 = 339
Town B: (320+430)/2 = 375
Are these figures means?
M = ½(Min + Max)
Spatial Statistics: Topic 3 15
Central Tendency - Mean and Mid-point (contd.)
Let’s say we have price data as follows: Town A: 228, 295, 310, 420, 450 Town B: 320, 295, 310, 400, 430 Calculate the means? Town A: Town B: Are the results same as previously?
Be careful about mean and “mid-point”!
Spatial Statistics: Topic 3 16
Central Tendency – Mean of Grouped Data
House rental or prices in the PMR are frequently tabulated as a range of values. E.g.
What is the mean rental across the areas?
= 23; = 3317.5
Thus, = 3317.5/23 = 144.24
Rental (RM/month) 135-140 140-145 145-150 150-155 155-160
Mid-point value (x) 137.5 142.5 147.5 152.5 157.5
Number of Taman (f) 5 9 6 2 1
fx 687.5 1282.5 885.0 305.0 157.5
Spatial Statistics: Topic 3 17
Central Tendency – Median
Let say house rentals in a particular town are tabulated:
Calculation of “median” rental needs a graphical aids→
Rental (RM/month) 130-135 135-140 140-145 155-50 150-155
Number of Taman (f) 3 5 9 6 2
Rental (RM/month) >135 > 140 > 145 > 150 > 155
Cumulative frequency 3 8 17 23 25
1. Median = (n+1)/2 = (25+1)/2 =13th. Taman
2. (i.e. between 10 – 15 points on the vertical axis of ogive).
3. Corresponds to RM 140-145/month on the horizontal axis
4. There are (17-8) = 9 Taman in the range of RM 140-145/month
5. Taman 13th. is 5th. out of the 9
Taman
6. The rental interval width is 5
7. Therefore, the median rental can
be calculated as:
140 + (5/9 x 5) = RM 142.8
Spatial Statistics: Topic 3 18
Central Tendency – Median (contd.)
Spatial Statistics: Topic 3 19
Central Tendency – Quartiles (contd.)
Upper quartile = ¾(n+1) = 19.5th. Taman
UQ = 145 + (3/7 x 5) = RM 147.1/month
Lower quartile = (n+1)/4 = 26/4 = 6.5 th. Taman
LQ = 135 + (3.5/5 x 5) = RM138.5/month
Inter-quartile = UQ – LQ = 147.1 – 138.5 = 8.6th. Taman
IQ = 138.5 + (4/5 x 5) = RM 142.5/month
Following the same process as in calculating “median”:
Spatial Statistics: Topic 3 20
Variability
Indicates dispersion, spread, variation, deviation For single population or sample data:
where σ2 and s2 = population and sample variance respectively, xi = individual observations, μ = population mean, = sample mean, and n = total number of individual observations.
The square roots are:
standard deviation standard deviation
Spatial Statistics: Topic 3 21
Variability (contd.)
Why “measure of dispersion” important? Consider yields of two plant species: * Plant A (ton) = {1.8, 1.9, 2.0, 2.1, 3.6} * Plant B (ton) = {1.0, 1.5, 2.0, 3.0, 3.9} Mean A = mean B = 2.28% But, different variability! Var(A) = 0.557, Var(B) = 1.367
* Would you choose to grow plant A or B?
Spatial Statistics: Topic 3 22
Variability (contd.) Coefficient of variation – CV – std. deviation as % of
the mean:
A better measure compared to std. dev. in case where samples have different means. E.g.
* Plant X (ton/ha) = {1.2, 1.4, 2.6, 2.7, 3.9} * Plant Y (ton/ha) = {1.4, 1.5, 2.1, 3.2, 3.9}
Spatial Statistics: Topic 3 23
FarmNo.
Yield(ton/ha)
SpeciesX
SpeciesY
1 1.2 1.4
2 1.4 1.5
3 2.6 2.1
4 2.7 3.2
5 3.9 3.9
Mean 2.36 2.42
Var. 1.20 1.20
Variability (cont.)
Calculate CV for both species.
CVx = (1.2/2.36) x 100
= 50.97%
CVy = (1.2/2.42) x 100
= 49.46% Species X is a little more variable than species Y
Spatial Statistics: Topic 3 24
Variability (cont.) Std. dev. of a frequency distribution E.g. age distribution of second-home buyers (SHB):
Spatial Statistics: Topic 3 25
Probability distribution If there 20 lecturers, the probability that
A becomes a professor is: p = 1/20 = 0.05 Out of 100 births, half of them were
girls (p=0.5), as the number increased to 1,000, two-third were girls (p=0.67) but from a record of 10,000 new-born babies, three-quarter were girls (p=0.75)
The probability of a drug addict recovering from addiction is 50:50
General rule: No. of times event X occurs Pr (event X) = ------------------------------------- Total number of occurrences Probability of certain event X to occur has a specific form of
distribution
Logical probability:
Experiential probability:
Subjective probability:
Spatial Statistics: Topic 3 26
Probability Distribution
Dice1
Dice2 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
Classical example of tossing
What is the distribution of the sum of tosses?
Spatial Statistics: Topic 3 27
Probability Distribution (contd.)
Values of x are discrete (discontinuous)
Sum of lengths of vertical bars p(X=x) = 1 all x
Discrete variable
Spatial Statistics: Topic 3 28
Probability Distribution (cont.)
Age Freq Prob.
36 3 0.02
37 14 0.07
38 10 0.04
39 36 0.18
40 73 0.36
41 27 0.14
42 20 0.10
43 17 0.09
Total 200 1.00
Age distribution of second-home buyers in
probability histogram
Pr (Area under curve) = 1Pr (Area under curve) = 1
Continuous variable
Mean = 39.5
Std. dev = 2.45
Spatial Statistics: Topic 3 29
Pr (Age ≤ 36) = 0.02 Pr (Age ≤ 37) = Pr (Age ≤ 36) + Pr (Age = 37) = 0.02 + 0.07 = 0.09 Pr (Age ≤ 38) = Pr (Age ≤ 37) + Pr (Age = 38) = 0.09 + 0.04 = 0.13 Pr (Age ≤ 39) = Pr (Age ≤ 38) + Pr (Age = 39) = 0.13 + 0.18 = 0.31 Pr (Age ≤ 40) = Pr (Age ≤ 39) + Pr (Age = 40) = 0.31 + 0.36 = 0.67 Pr (Age ≤ 41) = Pr (Age ≤ 40) + Pr (Age = 41) = 0.67 + 0.14 = 0.81 Pr (Age ≤ 42) = Pr (Age ≤ 41) + Pr (Age = 42) = 0.81 + 0.10 = 0.91 Pr (Age ≤ 43) = Pr (Age ≤ 42) + Pr (Age = 43) = 0.91 + 0.09 = 1.00
Probability Distribution (cont.)
Cumulative probability corresponds to the
left tail of a distribution
Spatial Statistics: Topic 3 30
As larger and larger samples are drawn, the probability distribution is getting smoother
Tens of different types of probability distribution: Z, t, F, gamma, etc
Most important: normal distribution
Larger sample
Very large sample
Probability Distribution(cont.)
Spatial Statistics: Topic 3 31
Normal Distribution - ND
Salient features of ND:
* Bell-shaped, symmetrical
* Total area under curve = 1
* Area under curve between
any two points = prob. of
values in that range (shaded area)
* Prob. of any exact value = 0
* Has a function of:
μ = mean of variable x; σ = std. dev. of x; π = ratio of circumference of a circle to its diameter = 3.14; e = base of natural log = 2.71828.
Spatial Statistics: Topic 3 32
Normal Distribution - ND
Population 1Population 2
1 2
1
2
* A larger population has
narrower base (smaller
variance)
* determines location
while determines
shape of ND
Spatial Statistics: Topic 3 33
Normal Distribution (cont.)* Has a mean and a variance 2, i.e. X N(, 2 )
* Has the following distribution of observation:
“Home-buyers example…”
Mean age = 39.3
Std. dev = 2.42
Spatial Statistics: Topic 3 34
Standard Normal Distribution (SND)
Since different populations have different and (thus, locations and shapes of distribution), they have to be standardised.
Most common standardisation: standard normal distribution (SND) or called Z-distribution
(X=x) is given by area under curve Has no standard algebraic method of integration
→ Z ~ N(0,1) To transform f(x) into f(z):
x - µ
Z = ------- ~ N(0, 1)
σ
Spatial Statistics: Topic 3 35
Z-Distribution
Probability is such a way that: * Approx. 68% -1< z <1 * Approx. 95% -1.96 < z < 1.96 * Approx. 99% -2.58 < z < 2.58
Spatial Statistics: Topic 3 36
Z-distribution (cont.)
When X= μ, Z = 0, i.e.
When X = μ + σ, Z = 1 When X = μ + 2σ, Z = 2 When X = μ + 3σ, Z = 3 and so on. It can be proven that P(X1 <X< Xk) = P(Z1 <Z< Zk)
SND shows the probability to the right of any particular value of Z.
Spatial Statistics: Topic 3 37
Normal distribution…QuestionsA study found that the mean age, A of second-home buyers in Johor Bahru is 39.3 years old with a variance of RM 2.45.Assuming normality, how sure are you that the mean age is: (a) ≥ 40 years old; (b) 39 to 42 years old?
Answer (a): P(A ≥ 40) = P[Z ≥ (40 – 39.3)/2.4] = P(Z ≥ 0.2917 0.3000) = 0.3821 (b) P(39 ≤ A ≤ 42) = P(A ≥ 39) – P(A ≥ 42) = 0.45224 – P[A ≥ (42-39.3)/2.4] = 0.45224 – P(A ≥ 1.125) = 0.45224 – 0.12924 = 0.3230
Always remember: to convert to SND, subtract the mean and divide by the std. dev.
Use Z-table!
Spatial Statistics: Topic 3 38
“Student’s t-Distribution”
Similar to Z-distribution (bell-shaped, symmetrical) Has a function of
where = gamma distribution; v = n-1 = d.o.f; = 3.147
Flatter with thicker tails Distributed with t(0,σ) and -∞ < t < +∞ As n→∞ t(0,σ) → N(0,1)
Probability calculation requires
information on d.o.f.
Spatial Statistics: Topic 3 39
How Are t-dist. and Z-dist. Related? Using central limit theorem, N(, 2/n) will become
zN(0, 1) as n→∞ For a large sample, t-dist. of a variable or a
parameter is given by:
The interval of critical values for variable, x is:
Spatial Statistics: Topic 3 40
Skewness, m3 & Kurtosis, m4
Skewness, m3 measures degree of symmetry of distribution
Kurtosis, m4 measures its degree of peakness
Both are useful when comparing sample distributions with different shapes
Useful in data analysis
Xi = indivudal sample observation, =
sample mean; = std. deviation; n = sample size
Spatial Statistics: Topic 3 41
Skewness
Bimodal Uniform J-shaped
Perfectly normal (zero skew)Right (+ve) skew Left (-ve) skew
Spatial Statistics: Topic 3 42
Kurtosis
Mesokurtic
(normal)
(zero kurtosis)
Leptokurtic
(high peak)
(+ve kurtosis)
Platykurtic
(low peak)
(-ve kurtosis)
Mesokurtic distribution…kurtosis = 3
Leptokurtic distribution…kurtosis < 3
Platykurtoc distribution…kurtosis > 3
Spatial Statistics: Topic 3 43
X-coord.(000)
Y-coord.(000)
Trees with Ganoderma
535.60 104.80 8
536.70 107.30 12
536.80 106.80 11
537.30 107.31 12
537.15 105.40 13
537.40 105.37 13
538.48 107.82 9
542.22 106.10 8
540.35 105.91 7
540.10 104.95 7
540.30 104.75 6
538.75 102.80 5
545.10 105.90 4
546.30 105.90 3
547.15 105.90 2
Occurrence of ganoderma
X-coord.(000)
Y-coord.(000)
Trees with ganoderma
547.75 106.08 5
547.10 105.25 8
547.80 101.05 7
548.18 105.92 8
548.80 105.90 12
548.95 104.85 15
548.94 104.50 13
548.75 103.73 7
548.94 102.80 4
Occurrence of ganoderma
Spatial Statistics: Topic 3 44
Al p.p.m. Freq.
0 0
250 7
500 13
750 25
1000 18
1250 13
1500 9
1750 7
2000 3
2250 4
2500 3
E.g. Al2++ + H2++O-- → Al2O + H2
sum 102.00
mean 1073.53
553.05
305867.94
169161266.28
93555193911.64
skew 0.77
kurtosis 13.44
Aluminium residues in the soil
Spatial Statistics: Topic 3 45
E.g. WCM = ((545.10-542.86)2 + (105.90-105.48)2)0.5
= (5.0176 + 0.1764)0.5
= 2.28 (i.e. 2,280 m)
Measures of spatial separation
Weighted mean centre (Xcoord.) =
Weighted mean centre (Ycoord.) =
Standard distance =
Distance (x1,y1) and (x2,y2) =
Spatial Statistics: Topic 3 46
Occurrence of ganoderma
Sum f = 191.00 Xw = 103687.00 Yw = 20147.40 (Xw- )2 =588.46 (Yw- )2 = 55.50
Weighted mean centre 542.86 105.48
Standard distance 1.84
Point to point distance (e.g.)
x-dist. 5.00
y-dist. 0.17
Distance Wc-M 2.27
Spatial distribution –
Spatial Statistics: Topic 3 47
Spatial distribution – point dataEthnic distribution of residence
Spatial Statistics: Topic 3 48
Ethnic distribution of residence
k = (fx) -1
Test statistics
-8.15tc
0.12CV
0.02CV
0.012
0.49
1.5468140
1.511892
0.5150501
-0.490810
(x- )2fxfx
Ho: 2 = (pattern is random)
H1: 2 > (pattern is clustered) or 2 < (pattern is scattered)
X = no. of observations per quadrat; f = frequency of quadrats; = (fx)/f; 2 = (x- )2/(fx) -1; CV = 2/ ;
CV = (2/(k-1))½.
Reject Ho…residence pattern is scattered