buss1020 quantitative business analysis stats lectures ......buss1020 quantitative business analysis...

5
BUSS1020 Quantitative Business Analysis Stats Lectures Week 1 Lecture 1 Three different branches of analytics used in business Descriptive o Collecting, summarizing, presenting and analysing data ! Helps summarize business data aka tables and charts ! Collect/present/characterise data Predictive o Using a model and data to make forecasts of outcomes ! Predicting Inferential o Using data collected from a small group to draw conclusions about a larger group ! Also used to develop, quantify, and improve the accuracy of predictive models ! Estimation/hypothesis testing Statistics vocab Variables – characteristics of an item or individual. Data on a variable is what you analyse when you use a statistical method Data – different values or outcomes associated with a variable Operational definitions – data values that are meaningless unless their variables have operational definitions aka universally accepted meanings that are clear to all associated with an analysis Population – consists of all the items or individuals about which you want to draw a conclusion. The population is the “large group” Sample – the portion of the population selected for analysis. The same is the “small group” Parameter – a numerical measure that describes a relevant characteristic of a population Statistic – a numerical measure that describes a characteristic of a sample. Often a statistic estimates a parameter Types of variables Categorical (qualitative) – variables that have values that can only be placed into categories (yes, no) Numerical (quantitative) – variables that have values that represent actual quantities; o Discrete numerical variables – arise from a counting process ! E.g. number of children o Continuous numerical variables – arise from a measuring process and can be assigned any value within a given interval(s) ! E.g. cost Levels of data measurement Nominal (lowest level of measurement) o Classifies or categorises data without any specified order ! E.g. employment classification – teacher, chef, lawyer Ordinal o Classifies and indicates rank or order, often representing an underlying scale ! E.g. extremely unlikely, unlikely, somewhat unlikely etc. Interval o Data are numerical and differences between values have a consistent meaning, there is no true zero point ! E.g. Celsius temperature, monetary utility Ratio (highest level of measurement) o Same properties as interval but has a true meaningful zero point that represents the absence of the phenomenon being measured ! E.g. weight/height/volume measurement, profit/price/revenue

Upload: others

Post on 05-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BUSS1020 Quantitative Business Analysis Stats Lectures ......BUSS1020 Quantitative Business Analysis Stats Lectures Week 1 Lecture 1 Three different branches of analytics used in business

BUSS1020 Quantitative Business Analysis Stats Lectures Week 1 Lecture 1 Three different branches of analytics used in business

⋅ Descriptive o Collecting, summarizing, presenting and analysing data

! Helps summarize business data aka tables and charts ! Collect/present/characterise data

⋅ Predictive o Using a model and data to make forecasts of outcomes

! Predicting ⋅ Inferential

o Using data collected from a small group to draw conclusions about a larger group

! Also used to develop, quantify, and improve the accuracy of predictive models

! Estimation/hypothesis testing Statistics vocab

⋅ Variables – characteristics of an item or individual. Data on a variable is what you analyse when you use a statistical method

⋅ Data – different values or outcomes associated with a variable ⋅ Operational definitions – data values that are meaningless unless their variables

have operational definitions aka universally accepted meanings that are clear to all associated with an analysis

⋅ Population – consists of all the items or individuals about which you want to draw a conclusion. The population is the “large group”

⋅ Sample – the portion of the population selected for analysis. The same is the “small group”

⋅ Parameter – a numerical measure that describes a relevant characteristic of a population

⋅ Statistic – a numerical measure that describes a characteristic of a sample. Often a statistic estimates a parameter

Types of variables

⋅ Categorical (qualitative) – variables that have values that can only be placed into categories (yes, no)

⋅ Numerical (quantitative) – variables that have values that represent actual quantities; o Discrete numerical variables – arise from a counting process

! E.g. number of children o Continuous numerical variables – arise from a measuring process and can be

assigned any value within a given interval(s) ! E.g. cost

Levels of data measurement

⋅ Nominal (lowest level of measurement) o Classifies or categorises data without any specified order

! E.g. employment classification – teacher, chef, lawyer ⋅ Ordinal

o Classifies and indicates rank or order, often representing an underlying scale ! E.g. extremely unlikely, unlikely, somewhat unlikely etc.

⋅ Interval o Data are numerical and differences between values have a consistent

meaning, there is no true zero point ! E.g. Celsius temperature, monetary utility

⋅ Ratio (highest level of measurement) o Same properties as interval but has a true meaningful zero point that

represents the absence of the phenomenon being measured ! E.g. weight/height/volume measurement, profit/price/revenue

Page 2: BUSS1020 Quantitative Business Analysis Stats Lectures ......BUSS1020 Quantitative Business Analysis Stats Lectures Week 1 Lecture 1 Three different branches of analytics used in business

Sources of data ⋅ Primary sources (analyst collects the data)

o E.g. data from political survey, from an experiment, or directly observed ⋅ Secondary sources (analyst of data didn’t collect that data)

o E.g. analysing census data, internet published data, text book data ⋅ Sources of data fall into four overall categories:

o Data distributed by an organisation or an individual o A designed experiment o A survey o An observational study

Organising categorical data

⋅ Summary table o Indicates the frequency, amount, or percentage of items in a set of categories

so that you can see differences between categories ⋅ Contingency table

o Used to study patterns that may exist between the responses of two or more categorical variables

o Crsoss tabulates or tallies jointly the responses of the categorical variables Week 2 Lecture 2 Organising numerical data

⋅ Ordered array o Sequence of data, in rank order, from the smallest value to the largest value o Shows range (minimum value to max value o May help identify outliers (unusual observations)

⋅ Frequency distribution o Summary table in which the data are arranged into numerically ordered

classes o Must give attention to selecting the appropriate number of class groupings for

the table, determining a suitable width of a class grouping, and establishing the boundaries of each class grouping to avoid overlapping

o Number of classes depends on the number of values in the data (larger numbers often have more classes, typically 5 < classes < 15)

o Determining the width of class interval by dividing the range (highest – lowest value) of the data by the number of class groupings desired - round up

o Condenses raw data into more useful form to allow for quick visual interpretation, and enables the determination of major characteristics of the data set including where the data is concentrated/clustered

Visualising categorical data by using graphical displays

⋅ Bar chart o Shows each category, the length of which represents the amount, frequency

or percentage of values falling into a category which comes from the summary table of the variable

⋅ Pie chart o Broken up into slices that represent categories. Size of each slice varies

according to the percentage in each category ⋅ Pareto chart

o Used to portray categorical data (nominal scale) o Vertical bar chart where categories are shown in descending order of

frequency o A cumulative polygon is shown in the same graph o Used to separate the “vital few” from the “trivial many”

⋅ Side by side bar charts o Represents the data from a contingency table (rather than summary table)

Visualising numerical data by using graphical displays

⋅ Histogram

Page 3: BUSS1020 Quantitative Business Analysis Stats Lectures ......BUSS1020 Quantitative Business Analysis Stats Lectures Week 1 Lecture 1 Three different branches of analytics used in business

o Organises data into groups (called bins) so that the size of the bin reflects the amount or percentage of data points in each group

o In percentage histogram the vertical axis would be defined to show the percentage of observations per class

o A vertical bar chart of the data in a frequency distribution o There are no gaps between adjacent bars for continuous data. There may be

gaps for discrete data o The class boundaries (class midpoints) are shown on the horizontal axis) o Vertical axis is either the frequency, relative frequency or percentage o The height of the bars represent the frequency, relative frequency or

percentage when considering identical width bins (intervals, class width) ⋅ Polygon

o A percentage polygon is formed by having the midpoint of each class represent the data in the class and then connecting the sequence of midpoints in their respective class percentages

o The cumulative percentage polygon, or ogive, displays the variable of interest along the X axis, and the cumulative percentage along the Y axis

! In an ogive the percentage of the observations less than each lower class boundary are plotted versus the lower class boundaries

o Useful when there are two or more groups to compare ⋅ Scatter plot

o Used for numerical data consisting of paired observations taken from two numerical variables

o One variable is measured on the vertical axis and the other variable is measured on the horizontal axis

o Scatter plots are used to examine possible relationships between two numerical variables

⋅ Time series plot o Used to study patterns in the values of a numeric variable over time o The numeric variable is measured on the vertical axis and the time period is

measured on the horizontal axis Principles of graphical excellence

⋅ Graph should not distort the information in the data ⋅ The graph should not contain too many unnecessary adornments ⋅ The scale on the vertical axis should (usually) begin at zero ⋅ All axes should be properly and clearly labelled ⋅ The graph should contain an informative title ⋅ The simplest possible graph should usually be used for a given set of data ⋅ The graph should contain the source of the data ⋅ 3D graphs should have a meaningful 3rd dimension – usually 2D is sufficient ⋅ Graphs should be used to objectively and clearly convey the message (i.e. relevant

information) in the data Numerical descriptive measures

⋅ Central tendency – the extent to which all the data values group around a typical or central value

⋅ Variation – the amount of dispersion or scattering of values around the central value ⋅ Shape – the pattern in the distribution of values form the lowest value to the highest

value Measures of central tendency

⋅ Mean o Most common measure of central tendency o Is the sum of values divided by the number of values o Affected by extreme values (outliers)

o ! = !!!!!!! = !!!!!!⋯!!!

!

⋅ Median o The middle number in an ordered array (50% above, 50% below) o Not affected by extreme values

Page 4: BUSS1020 Quantitative Business Analysis Stats Lectures ......BUSS1020 Quantitative Business Analysis Stats Lectures Week 1 Lecture 1 Three different branches of analytics used in business

o Location of the median when values are in numerical order (smallest to largest) is: Median position = !!!! position in the ordered data

o If number is odd, the median is the middle number o If the number of values is even, the median is taken as the average of the two

middle numbers ⋅ Mode

o Value that occurs most often o Not affected by extreme values o Used for either numerical or categorical (nominal) data o There may be no mode, or several modes

Which measure of central tendency to use?

⋅ The mean is generally used, unless extreme values (outliers) exist ⋅ The median is often used, since the median is not sensitive to extreme values (median

house prices may be reported) ⋅ Some situations are suitable to report both mean and median ⋅ Mode is the most frequent observation, and data may have more than one mode and

hence it is usually only reported for discrete data Geometric mean and geometric rate of return

⋅ Geometric mean o Often used to measure the rate of change of a variable over time

o !! = (!!×!!×…×!!)!!

⋅ Geometric mean rate of return o Measures the status of an investment over time

o !! = 1+ !! × 1+ !! ×…× 1+ !!!! − 1

o Where !! is ther rate of return in the time period ! Week 3 Lecture 3 Measures of variation

⋅ Gives information on the spread/variability/dispersion of the data values ⋅ Range

o Simplest, giving difference between largest and smallest values o Range = X largest – X smallest

! Can be misleading as it ignores the way the data are distributed, and is sensitive to outliers

⋅ Sample variance o Average of squared deviations of values from the mean

o !! = (!!!!)!!!!!!!!

o Variance measures the variability (volatility) from an average, and is a measure of risk.

⋅ Sample standard variation o Most commonly used measure of variation o Shows variation about the mean o Is square root of the variance o Has same units as the original data

o ! = (!!−!)2!!=1

!−1

⋅ Steps for computing standard deviation o Compute the difference between each value and the mean o Square each difference o Add the squared differences o Divide this total by n-1 to get the sample variance o Take the square root of the sample variance to get the sample standard

deviation Measure of variation: summary characteristics

Page 5: BUSS1020 Quantitative Business Analysis Stats Lectures ......BUSS1020 Quantitative Business Analysis Stats Lectures Week 1 Lecture 1 Three different branches of analytics used in business

⋅ The more the data are spread out, the greater the range, variance, and standard deviation (low spread out fat bell curve)

⋅ The more the data are concentrated, the smaller the range, variance and standard deviation (thin and tall bell curve)

⋅ If the values are all the same (no variation) all these measures will be at zero ⋅ None of these measures are ever negative

Coefficient of variation

⋅ Measures relative variation ⋅ Always in percentage ⋅ Shows variation relative to mean ⋅ Can be used to compare the variability of two or more sets of data measured in

different units

⋅ !" = !! ×100%

⋅ S = standard deviation, != average price etc. Sample Z-score: assessing extreme observations

⋅ To compute z-score of data value, subtract the mean and divide by the standard deviation

⋅ The z-score is the number of standard deviations a data value is from the mean ⋅ A data value could be considered an outlier (extreme) if its z-score is ±3.0 ⋅ The larger the absolute value of the z-score, the father the data value is from the mean

⋅ ! = !!!!!!

Shape of a distribution

⋅ Describes how the data are distributed ⋅ Skewness

o Measures the amount of asymmetry in a distribution (statistic equals to) ! Mean < median = left skewed (<0) ! Mean = median = symmetric (=0) ! Mean > median = right skwewed (>0)

⋅ Kurtosis o Measures the relative concentration of values in the centre of a distribution as

compared with the tails ! Flatter than bell shaped (<0) ! Bell shaped (=0) ! Sharper peak than bell shaped (>0)

Quartile measures

⋅ Quartiles split the ranked data into 4 segments with an equal number of values per segment

o Q1 is the value for which 25% of the observations are smaller and 75% are larger

! Q1 = !!!! th ranked value o Q2 = median with 50% larger and 50% smaller observations

! Q2 = !!!! th ranked value o Q3 is the value for which 25% observations are larger and 75% are smaller

! Q3 = !(!!!)! th ranked value ⋅ If result is whole number then that’s the ranked position to use (e.g. 10th ranked

number) ⋅ If result is a fractional half (1.5, 12.5 etc.) then average the two corresponding data

values for which the result is between ⋅ If the result is not a whole number nor fractional half (any other decimal that doesn’t

end in .5), round the result to the nearest integer to find the ranked position (either up or down)

Interquartile range