statistics ch 1-5 notes (2)

Upload: prfktshun1

Post on 05-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    1/39

    Chapter 1: Introduction Defining the Role of Statistics in Business

    Statistical Analysis: helps extract information from data and provides anindication of the quality of that information

    Data mining: combines statistical methods with computer science & optimizationin order to help businesses make the best use of the information contained in largedata sets

    Probability: helps you understand risky and random events and provides a way ofevaluating the likelihood of various potential outcomes

    1.1 - Why should you learn statistics?oAdvertising. Effective? Which Commercial? Which markets?oQuality control. Defect rate? Cost? Are improvements working?oFinance. Risk how high? How to control? At what cost.oAccounting. Audit to check financial statements. Is error material?oOther economic forecasting, measuring and controlling productivity

    1.2 What is statistics?Statistics: the art and science of collecting and understanding data

    oA complete and careful statistical analysis will summarize the general factsthat apply to everyone and will also alert you to any exceptions.

    1.3 The Five Basic Activities of Statistics1. Design Phase: will resolve these issues so that useful data will result

    a. Designing the Studyinvolves planning the details of datagathering. Can avoid the costs & disappointment of find out too late that the data collected are not adequate to answer the importantquestions.

    b. The Population: large group of people, firms, or other items

    c. The Sample: a smaller group that consists of some of the populationd. Statistical Inference: the process of generalizing from the

    observed sample to the larger populatione. The Random Sample: best way to select a practical sample, to be

    studied in detail, from a population that is too large to be examinedin its entirety

    i. Guarantees the selection process is fair & without bias; sosample is representative of the population

    ii. The randomness, introduced in a controlled way during thedesign phase, will help ensure validity of the statisticalinferences drawn later

    2. Exploring the Data: involves looking at your data set from many angles,

    describing it, and summarizing it. Exploration is the first phase once youhave data to look at.

    a. Prepares for the formal analysis either by:i. By verifying that the expected relationships actually exist in

    the data, thereby validating the planned techniques ofanalysis

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    2/39

    ii. By finding some unexpected structure in the data that must betaken into account, thereby suggesting some changes in theplanned analysis

    3. Modeling the Data: a system of assumptions & equations is selected inorder to provide a framework for further analysis.

    a. Model: a system of assumptions and equations that can generate

    artificial data similar to the data you are interested in, so that youcan work with a few numbers (parameters) that represent theimportant aspects of the data

    i. Often, a model says that: data equals structure plusrandom noise

    1. Data = Structure + Random Noise4. Estimating an Unknown Quantity: a numerical summary of an unknown

    quantity, based on data. It produces the best educated guess possiblebased on the available data. We all want (and often need) estimates ofthings that are just plain impossible to know exactly.

    a. Provides an indication of the amount of uncertainty or error involvedin the guess, accounting for the consequences of random selection of

    a sample from a large populationb. Confidence Interval: gives probable upper and lower bounds on

    the unknown quantity being estimated. Puts the estimate inperspective and helps you avoid the tendency to treat a singlenumber as very precise when, in fact, it might not be precise at all.

    c. NOTES:i. Estimating an unknown, best guess based on dataii. Wrong, but by how much?iii. were 95% sure that the unknown is between

    never say 100% wrong 5% of the time (but by how much)5. Hypothesis Testing: uses the data to help decide what the world is really

    like in some respect. It is the use of data in deciding between two (or more)different possibilities in order to resolve an issue in an ambiguous situation

    a. Produces a definite decision about which of the possibilities iscorrect, based on data

    b. Procedure is to collect data that will help decide among thepossibilities and to use careful statistical analysis for extra powerwhen the answer is not obvious from just glancing at the data

    c. Each hypothesis makes a definite statement, and it may be eithertrue or false

    d. The result of a statistical hypothesis test is the conclusion that eitherthe data support the hypothesis or they dont.

    e. NOTES:i. Hypothesis testing data decide between two possibilitiesii. Does it really work? Or is it just randomly better?iii. Whiter, brighter wash?

    1.4 Data Mining

    Data Mining: a collection of methods for obtaining useful knowledge by analyzinglarge amounts of data, often by searching for hidden patterns.

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    3/39

    Probabilit

    Statistical

    o Goal of data mining is to obtain value from these vast stores of data inorder to improve the company with higher sales, lower costs, and betterproducts

    o Marketing and sales: data can be mined for guidance on how (and when) tobetter reach customers in the future

    o Finance: useful in forming and evaluating investment strategies and inhedging (or reducing) risk

    o Product Design: answers what particular combinations of featurescustomers are ordering in larger-than-expected quantities.

    o Production

    o Fraud Detection: best methods of protection involves mining data todistinguish between ordinary and fraudulent patterns of usage, then usingthe results to classify new transactions, and looking carefully at suspiciousnew occurrences to decide whether or not fraud is actually involved

    o Data Mining involves combining resources from many fields:

    Statistics: All of the basic activities of statistics are involved: adesign for collecting the data, exploring for patterns, a modeling

    framework, estimation of features, and hypothesis testing to assesssignificance of patterns Computer Science: Efficient algorithms (computer instructions) are

    needed for collecting, maintaining, organizing, and analyzing data. Optimization: Helps achieve a goal, which might be very specific

    such as maximizing profits, lowering production cost, finding newcustomers, developing profitable new product models, or increasingsales volume.

    Often accomplished by adjusting the parameters of a modeluntil the objective is achieved

    1.5 - Probability

    Probability: a what iftool for understanding risk and uncertainty. Shows you thelikelihood, or chances, for each of the various potential future events, based on a setof assumptions about how the world works

    o Probability is the inverse of statistics. Whereas statistics helps you go fromobserved data to generalizations about how the world works, probabilitygoes the other direction.

    o Probability works with statistics by providing a solid foundation forstatistical inference.

    How the world works What is likely to happen

    What happened What is likely tohappen

    If you make assumptions about how the world works, then probability can helpyou figure out how likely various outcomes are and thus help you understandwhat is likely to happen. If you have data that tell you something about what

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    4/39

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    5/39

    Chapter 1 Questions

    1. Why is it worth the effort to learn about statistics?a. Answer for management in general

    b. Answer for one particular area of business of special interest to you

    2. Skip

    3. How should statistical analysis and business experience interact with eachother?

    4. What is statistics?

    5. What is the design phase of a statistical study?

    6. Why is random sampling a good method to use for selecting items for study?

    7. What can you gain by exploring data in addition to looking at summary resultsfrom an automated analysis?

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    6/39

    8. What can a statistical model help you accomplish? Which basic activity ofstatistics can help you choose an appropriate model for your data?

    9. Are statistical estimates always correct? If not, what else will you need (in

    addition to the estimate values) in order to use them effectively?

    10.Why is confidence interval more useful than an estimated value?

    11.Give two examples of hypothesis testing situations that a business firm wouldbe interested in.

    12.What distinguishes data mining from other statistical methods? What methods,in addition to those of statistics, are often used in data mining?

    13.Differentiate between probability and statistics.

    14.A consultant has just presented a very complicated statistical analysis,complete with lots of mathematical symbols and equations. The results of thisimpressive analysis go against your intuition and experience. What should youdo?

    15.Why is it important to identify the source of funding when evaluating theresults of a statistical study?

    Problems6. Which of the five basic activities of statistics is represented by each of the

    following situations?a. A factorys quality control division is examining detailed quantitative

    information about recent productivity in order to identify possibletrouble spots.

    b. A focus group is discussing the audience that would best be targetedby advertising, with the goal of drawing up and administering aquestionnaire to this group.

    c. In order to get the most out of your firms Internet activity data, itwould help to have a framework or structure of equations to allowyou to identify and work with the relationships in the data.

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    7/39

    d. A firm is being sued for gender discrimination. Data that showsalaries for men and women are presented to the jury to convincethem that there is a consistent pattern of discrimination and thatsuch a disparity could not be due to randomness alone.

    e. The size of next quarters gross national product must be known sothat a firms sales can be forecast. Since it is unavailable at this time,

    an educated guess is used.

    Chapter 2: Data Structures Classifying the Various Types of Data Sets

    Data Set: consists of observations on items, typically with the sameinformation being recorded for each item

    Elementary Units: the items themselves

    Data sets can be classified:o By the number of pieces of information (variables)o By the kind of measurement (numbers or categories)o By whether or not the time sequence of recording is relevant

    o By whether or not the information was newly created or had previouslybeen created by others for their own purposes

    2.1 How Many Variables?

    Variables: a piece of information recorded for every item (its cost, forexample)

    o One = univariate data, two = bivariate data, & many = multivariate data

    Univariate Data Sets (one-variable)o Have just one piece of information recorded for each itemo Statistical methods summarize the basic properties and answer

    questions such as what is a typical summary value, how diverse arethese items, and do any individuals or groups require special attention

    o Examples: incomes of subjects in a marketing survey, number of defectsin each TV set sample, interest rate forecasts of 25 experts, and thebond ratings of the firms in an investment portfolio

    Bivariate Data Sets (two-variable)o Have exactly two pieces of information recorded for each item

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    8/39

    o In addition to summarizing each of these two variables separately asunivariate data sets, statistical methods would also be used to explorethe relationship between the two factors being measured.

    o Answers is there a simple relationship between the two, how stronglyare they related, can you predict one from the other & if so with whatdegree of reliability, and do any individuals or groups require specialattention

    o Examples: cost of production (1st variable) & number produced (2nd

    variable), price of one share of your firms common stock (first variable)& the date (2nd variable), and purchase or non-purchase of an item (1st

    variable) & whether an advertisement for the item is recalled (2nd

    variable)

    Multivariate Data (many variable)o Have three or more pieces of information recorded for each itemo Summarizes each variable separately, looks at the relationship between

    any two variables, AND also looks at the interrelationships among all the

    itemso Answers is there a simple relationship between the two, how strongly

    are they related, can you predict one from the other & if so with whatdegree of reliability, and do any individuals or groups require specialattention

    o Examples: growth rate (special variable) and a collection of measures ofstrategy (the other variables), such as type of equipment, extent ofinvestment, and management style, for each of a number of newentrepreneurial firms, and salary (special variable) and gender (recordedas male or female 0/1), number of years of experience, job category,and performance record, for each employee

    2.2 Quantitative Data: Numberso Meaningful numbers are numbers that directly represent the measured

    or observed amount of some characteristic or quality of the elementaryunits

    Include dollar amounts, counts, sizes, numbers of employees, andmiles per gallon

    They exclude numbers that are merely used to code for or keeptrack of something else (like 1 = buy stock, 2 = sell stock, 3 =buy bond, 4 = sell bond)

    o Quantitative Data: data that is meaningful numbers, that representquantities

    Discrete Quantitative Datao Discrete Variable: can assume values only from a list of specific

    numbers Example: the number of children in a household is a discrete

    variable

    Continuous Quantitative Datao Continuous variable: any numerical variable that is not discrete

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    9/39

    o The possible values form a continuum such as the set of all positivenumbers, all numbers, or all values between 0 and 100%

    Example: the actual weight of a candy bar marked net weight1.7 oz is a continuous random variable, the actual weight mightbe 1.70235 or 1.69481 oz

    Watch out for meaningless numberso Make sure the numbers are meaningful

    2.3 Qualitative Data: Categorieso Qualitative Data: if the data set tells you which one of several non-

    numerical categories each item falls into (b/c they record some qualitythat the item possesses)

    Ordinal: for which there is a meaningful ordering but nomeaningful numerical assignment

    Can say first, second, third, and so on

    Can rank the data according to this ordering, and ranking

    will probably play a role in the analysis There is a median value (the middle one, once the data is

    put into order) Nominal: for which there is no meaningful order

    There are only categories, with no meaningful order

    There are no meaningful numbers to compute with, and nobasis for ranking

    About all that can be done is to count and work with thepercentage of cases falling into each category, using themode (the category occurring most often) as a summarymeasure

    2.4 Time-Series and Cross-Sectional Datao Time-Series Data: if the data values are recorded in a meaningful

    sequence, such as daily stock priceso Cross-Sectional Data: if the sequence in which the data are recorded

    is irrelevant, such as the first-quarter earnings of eight aerospace firms

    Another way of saying that no time sequence is involved; yousimply have a cross-section, or snapshot, of how things are at oneparticular time

    2.5 Sources of Data, including the Interneto Primary Data: when you control the design of the data-collection plan

    (even if the work is done by others) More likely to be able to get exactly the information you want

    because you control the data-generating process

    Primary data sets are often expensive and time-consuming toobtain

    o Secondary Data: when you use data previously collected by others fortheir own purposes

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    10/39

    Often inexpensive (or even free) and you might find exactly (ornearly) what you need

    o To look for data on the Internet, most people use a search engine andspecify some key words

    Still common for a search to fail to find the information you reallywant

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    11/39

    Chapter 2 Questions1. What is a data set?

    2. What is a variable set?

    3. What is an elementary unit?

    4. What are three basic ways in which data sets can be classified? (Hint: theanswer is not univariate, bivariate and multivariate, but is at a higher level)

    5. What general questions can be answered by analysis of:

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    12/39

    a. Univariate data

    b. Bivariate data

    c. Multivariate data?

    6. In what way to bivariate data represent more than just two separateunivariate data sets?

    7. What can be done with multivariate data?

    8. What is the difference between quantitative and qualitative data?

    9. What is the difference between discrete and continuous quantitativevariables?

    10.What are qualitative data?

    11.What is the difference between ordinal and nominal qualitative data?

    12.Differentiate between time-series data and cross-sectional data.

    13.Which are simpler to analyze, time-series or cross-sectional data?

    14.Distinguish between primary and secondary data

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    13/39

    Chapter 3: Histograms Looking at the Distribution of Data

    Histogram: a picture that gives you a visual impression of many of the basicproperties of the data set as a whole

    o Answers what values are typical in this data set, how different are thenumbers from one another, are the data values strongly concentratednear some typical value, what is the pattern of the concentration (dodata values trail off at the same rate at lower values as they do athigher values), are there any special data values that might requirespecial treatment, and do you have single/ homogeneous collection orare there distinct groupings within the data that might require separate

    analysiso Many standard methods of statistical analysis require that the data beapproximately normally distributed

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    14/39

    3.1 A List of Data

    List of Numbers: the simplest kind of data set, representing some kind ofinformation (a single statistical variable) measured on each item of interest(each elementary unit)

    Number Line: a straight line with the scale indicated by numberso In order to visualize the relative magnitudes of a list of numberso The numbers need to be regularly spaced on a number line so that there

    is no distortion

    3.2 Using a Histogram to Display the Frequencies Histogram: displays the frequencies as a bar chart rising above the number

    line, indicated how often the various values occur in the data seto Horizontal axis = measurements of the data set (dollars, # of people,

    miles/ gallon, etc)o Vertical axis = represents how often these values occuro An especially high bar indicates that many cases had data values at this

    position on the horizontal number line, while a shorter bar indicates aless common value

    A histogram is a bar chart of the frequencies, not of the datao The height of each bar in the histogram indicates how frequently the

    values on the horizontal axis occur in the data set (where values areconcentrated & where they are scarce)

    3.3 Normal Distributions

    Normal Distribution: an idealized, smooth, bell-shaped histogram with all ofthe randomness removed

    o Represents an ideal data set that has lots of numbers concentrated inthe middle of the range, with the remaining numbers trailing offsymmetrically on both sides

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    15/39

    o It is common for statistical procedures to assume that the data set isreasonably approximated by a normal distribution

    o It is important to explore the data, by looking at a histogram, todetermine whether or not it is normally distributed

    Especially important if a standard statistical calculation will beused that requires a normal distribution

    3.4 Skewed Distributions and Data Transformation

    Skewed Distribution: is neither symmetric nor normal because the datavalues trail off more sharply on one side than on the other

    In business often find skewness in data sets that represent sizes using positivenumbers

    o Reason is that data values cannot be less than zero (imposing aboundary on one side), but are not restricted by a definite upperboundary

    One of the problems with skewness in data is that many statistical methodsrequire at least on approximately normal distribution

    Transformation: a solution to skewness; makes a skewed distribution moresymmetric. It is replacing each data value by a different number (such as alogarithm) to facilitate statistical analysis

    o If data includes a negative number or zero, this technique cannot beused

    o Logarithm: using the log often transforms skewness into symmetrybecause it stretches the scale near zero, spreading out all of the smallvalues, which had been bunched together

    Base 10 (common logs) (*what we will use in this section)

    Base e (natural logs)o The logarithm pulls in the very large numbers, minimizing their

    difference from other values in the set, and stretching out the low values

    3.5 Bimodal Distributions

    It is important to recognize when a data set consists of two or more distinctgroups so that they may be analyzed separately

    o Can be seen in a histogram as a distinct gap between two cohesivegroups of bars

    Bimodal Distribution: when two clearly separate groups are visible in ahistogram

    o Has two modes, or two distinct clusters of data

    May be an indication that the situation is more complex, or that extra care isrequired

    o Should find out the reason for the two groupso Must be large enough, individually cohesive, and either have a fair gap

    between them or else represent a large enough sample to be sure thatthe lower frequencies between the groups are not just randomfluctuations

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    16/39

    3.6 Outliers

    Outliers: data values that dont seem to belong with the others because theyare either far too big or far too small

    How you deal with outliers depends on what caused them

    o 1- mistakes and 2 correct but different data values Dealing with outliers

    o Mistakes change the data value to the number it should have been inthe first place

    o Correct outliers are more difficult to deal with

    If it can be argued convincingly that the outliers do not belong tothe general case under study, they may then be set aside so thatthe analysis can proceed with only the coherent data

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    17/39

    Must be able to convince any person for whom report is intended Compromise solution perform two different analyses. One with

    the outlier included and one with it omitted. By reporting theresults of both analyses, you have not unfairly slanted the results

    Whenever any outlier is omitted, in order to inform othersand protect yourself from any possible accusations:whenever an outlier is omitted, explain what you did andwhy.

    Why must outliers be addressed?

    It is difficult to interpret the detailed structure in a data setwhen one value dominates the scene and calls too muchattention to itself

    Many of the most common statistical methods can failwhen used on a data set that doesnt appear to have anormal distribution

    3.7 Data Mining with Histograms

    The histogram is a useful tool for large data sets because you can see theentire data set at a glance

    o Provides a visual impression of the data set, and with large data setsyou will be able to see more of the detailed structure

    One advantage of data mining with a large data set is that we can ask for moredetail

    o Can have more histogram bars by reducing the width of the bar

    3.8 Histograms by Hand: Stem-and-Leaf

    Stem-and-Leaf:easiest way to construct a histogram by hand, in which thehistogram bars are constructed by stacking numbers one on top of the other(or side-by-side).

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    18/39

    Chapter 3 Questions

    1. What is a list of numbers?

    2. Name six properties of a data set that are displayed by a histogram.

    3. What is a number line?

    4. What is the difference between a histogram and a bar chart?

    5. What is a normal distribution?

    6. Why is the normal distribution important in statistics?

    7. When a real data set is normally distributed, should you expect the histogramto be a perfectly smooth bell-shaped curve? Why or why not?

    8. Are all data sets normally distributed?

    9. What is a skewed distribution?

    10.What is the main problem with skewness? How can it be solved in many cases?

    11.How can you interpret the logarithm of a number?

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    19/39

    12.What is a bimodal distribution? What should you do if you find one?

    13.What is an outlier?

    14.Why is it important in a report to explain how you dealt with an outlier?

    15.What kinds of trouble do outliers cause?

    16.When is it appropriate to set aside an outlier and analyze only the rest of thedata?

    17.Suppose there is an outlier in your data. You plan to analyze the data twice:once with and once without the outlier. What result would you be most pleasedwith? Why?

    18.What is a stem-and-leaf histogram?

    19.What are the advantages of a stem-and-leaf histogram?

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    20/39

    Chapter 4: Landmark Summaries Interpreting Typical Values andPercentiles

    Summarization: using one or more selected or computed values to representthe data set

    Discovering and identifying the features that the cases have in common arestatistical activities because they treat the information as a whole

    In statistics, one goal is to condense a data set down to one number (or two ora few numbers) that express the most fundamental characteristics of the data)

    Methods most appropriate for a single list of numbers:

    o One the average, median and mode different ways of selecting asingle number that closely describes all the numbers in a data set

    Typical value, center, or locationo Two a percentile summarizes information about ranks

    o Three the standard deviation is an indication of how different thenumbers in the data set are from one another (also referred to asdiversity or variability)

    Outliers may be described separately. You can summarize a large group ofdata by 1) summarizing the basic structure of most of its elements and 2)making a list of any special exceptions

    4.1 What is the Most Typical Value?

    Typical Value: the ultimate summary of any data set is a single number thatbest represents all of the data values

    o Average or Mean can only be computed for meaningful numbers

    (quantitative data) the most common method for finding a typical value for a list of

    numbers, found by adding up all the values and then dividing bythe number of items

    Excels average function can be used to find the average of a listof numbers

    =AVERAGE(A3:A7)

    The idea of an average is the same whether you view your list ofnumbers as a complete population or as a representative samplefrom a larger population; however, the notion differs slightly

    For an entire population, the convention to use N torepresent the number of items and let (Greek letter mu)represent the population mean value

    The average may be interpreted as spreading the total evenlyamong the elementary units (if you replaced each data value bythe average, then the total remains unchanged)

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    21/39

    The average preserves the total while spreading amounts outevenly, it is most useful as a summary when there are no extremevalues (outliers) present and the data set is a more-or-lesshomogeneous group with randomness

    The average is the only summary measure capable ofpreserving the total

    Weighted Average

    Is like the average, except that it allows you to give adifferent importance, or weight to each data item

    Gives you the flexibility to define your own system ofimportance when it is not appropriate to treat each item

    equally

    The weighted average may best be interpreted as anaverage to be used when some items have moreimportance than others; the items with greater importancehave more of a say in the value of the weighted average

    It combines the known information about each group (fromthe sample), with better information about each groupsrepresentation (from the population rather than the

    sample) since the best information of each type is used,the result is improved

    o Median (half way point) can be computed for ordered categories(ordinal data) or for numbers

    Median: the middle value; half of the items in the set are largerand half are smaller

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    22/39

    It must be in the center of the data and provide aneffective summary of the list of data

    Find it by putting the data in order and then locating the middlevalue

    Might have to average the two middle values if there is nosingle value in the middle

    Ranks: associate the numbers 1, 2, 3, ., n with the datavalues so that the smallest has rank 1, the next smallesthas rank 2, and so forth up to the largest, which has rank n

    The median has rank (1+n)/2

    How does the median compare to the average?o When the data set is normally distributed, they will

    be close to one another since the normal distributionis so symmetric and has such a clear middle point

    o The average and the median will usually be a littledifferent even for a normal distribution b/c eachsummarizes in a different way, and there is nearly

    always some randomness in real datao When the data set is not normally distributed, the

    median and average can be very different b/c askewed distribution does not have a well-definedcenter point

    o Typically, the average is more in the direction of thelonger tail or of the outlier than the median isbecause the average knows the actual values ofthese extreme observations, whereas the medianknows only that each value is either on one side oron the other

    o Mode (most common category) can be computed for unorderedcategories (nominal data), ordered categories, or numbers.

    Mode: the most common category, the one listed most often inthe data set

    It is the only summary measure available for normal qualitativedata because unordered categories cannot be summed (as forthe average) and cannot be ranked (as for the median)

    Easily found for ordinal data by ignoring the ordering ofthe categories and proceedings as if you had a nominaldata set with unordered categories

    Is also defined for quantitative data (numbers) (is ambiguous)

    can be defined as the value at the highest point of the histogram Slightly imprecise can be two tallest bars or the

    construction of the histogram (the bar width and locationwill make some changes in the shape of the distribution,and the mode can change as a result)

    Which Summary should you use?o The mode can be computed for any univariate data set (some ambiguity

    with quantitative data)

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    23/39

    o The average can be computed only from quantitative data (meaningfulnumbers)

    o The median can be computed for anything except nominal data(unordered categories)

    Quantitative

    Ordinal Nominal

    Average YesMedian Yes YesMode Yes Yes Yes

    For quantitative data, where all three summaries can be

    computed, how are they different?

    For a normal distribution, there is very little differenceamong the measures since each is trying to find the well-defined middle of that bell-shaped distribution

    With skewed data, there can be noticeable differencesamong them

    The average should be used when the data set is normallydistributed, and in cases where the need to preserve or forecasttotal amounts is important since the other summaries do not dothis as well

    The median can be a good summary for skewed distributions

    since it is not distracted by a few very large data items It summarizes most of the data better than the average

    does in cases of extreme skewness

    Also useful when outliers are present because of its abilityto resist their effects

    Useful with ordinal data (ordered categories) although themode should be considered also

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    24/39

    The mode must be used with nominal data (unordered categories)since the others cannot be computed.

    Also useful with ordinal data (ordered categories) when themost represented category is important

    o Biweight: a promising kind of estimate, a robust estimator, whichmanages to combine the best features of the average and the median

    4.2 What Percentile is it?

    Percentiles: summary measures expressing ranks as percentages from 0% to100% rather than from 1 to n so that the 0th percentile is the smallest number,the 100th percentile is the largest, the 50th percentile is the median, and so on

    Used in two ways:o 1) to indicate the data value at a given percentage (as in the 10th

    percentile is $156,293)o 2) to indicate the percentage ranking of a given data value (as in Johns

    performance, $296,994, was in the 55th percentile)

    Extremes, Quartiles, and Box Plotso One important use of percentiles is as landmark summary valueso You can use a few percentiles to summarize important features of the

    entire distributiono The median is the 50th percentile since it is ranked hallway between the

    largest and smallesto Extremes: the smallest and largest values (0th and 100th percentiles,

    respectively) Quartiles: defined as the 25th and 75th percentiles

    Are the data values ranked one-fourth of the way in fromthe smallest and largest values ambiguity as to exactlyhow to find them

    o Five Number Summary: defined as the following set of five landmarksummaries: smallest, lower quartile, median, upper quartile, and largest

    The smallest data value (the 0th percentile)

    The lower quartile (the 25th percentile, of the way in fromthe smallest)

    The median (the 50th percentile, in the middle)

    The upper quartile (the 75th percentile, of the way infrom the smallest and of the way in from the largest)

    The largest data value (the 100th percentile) The two extremes indicate the range spanned by the data, the

    median indicates the center, the two quartiles indicate the edges

    of the middle half of the data and the position of the medianbetween the quartiles gives a rough indication of skewness orsymmetry

    o Box Plot: a picture of the five-number summary

    Serves the same purpose as a histogram provides a visualimpression of the distribution BUT in a different way

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    25/39

    Shows less detail and is more useful in seeing the big picture andcomparing several groups of numbers without the distraction ofevery detail of each group

    The histogram is still preferable for a more detailed look atthe shape of the distribution

    o Detailed Box Plot: is a box plot, modified to display the outliers, whichare identified by labels

    Outliers: those data points (if any) that are far from the middleof the data set

    a larger data value will be declared to be an outlier if it is biggerthan Upper quartile + 1.5 x (upper quartile lowerquartile)

    a smaller data value will be declared to be an outlier if it issmaller than Lower quartiles 1.5 x (upper quartile lowerquartile)

    in addition to displaying and labeling outliers, you may also labelthe most extreme cases that are not outliers

    The Cumulative Distribution Function Displays the Percentileso Cumulative distribution function: is a plot of the data specifically

    designed to display the percentiles by plotting the percentages against

    the data values Percentages from 0% to 100% on the vertical axis and percentiles

    (data values) along the horizontal axiso Has a vertical jump of height 1/n at each of the n data alues and

    continues horizontally between data pointso Finding the Percentile Ranking for a Given Number:

    1) find the data value along the horizontal axis in the cumulativedistribution function

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    26/39

    2) Move vertically up to the cumulative distribution function. If ouhit a vertical portion, move halfway up

    3) Move horizontally to the left and read the percentile ranking

    Chapter 4 Questions1. What is summarization of a data set? Why is it important?

    2. List and briefly describe the different methods for summarizing a data set.

    3. How should you deal with exceptions when summarizing a set of data?

    4. What is meant by a typical value for a list of numbers? Name three differentways of finding one.

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    27/39

    5. What is the average? Interpret it in terms of the total of all values in thedata set.

    6. What is a weighted average? When should it be used instead of a simpleaverage?

    7. What is the median? How can it be found from its rank?

    8. How do you find the median for a data set:a. With an odd number of values?

    b. With an even number of values?

    9. What is the mode?

    10.How do you usually define the mode for a quantitative data set? Why is thisdefinition ambiguous?

    11.Which summary measure(s) may be used on:a. Nominal data?

    b. Ordinal Data?

    c. Quantitative data?

    12.Which summary measure is best for:a. A normal distribution?

    b. Projecting total amounts?

    c. A skewed distribution when totals are not important?

    13.What is a percentile? In particular, is it a percentage (e.g. 23%), or is itspecified in the same units as the data (e.g. $35.62)?

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    28/39

    14.Name two way sin which percentiles are used.

    15.What are the quartiles?

    16.What is the five-number summary?

    17.What is a box plot? What additional detail is often included in a box plot?

    18.What is an outlier? How do you decide whether a data point is an outlier ornot?

    19.Consider the cumulative distribution function:a. What is it?

    b. How is it drawn?

    c. What is it used for?

    d. How is it related to the histogram and the box plot?Chapter 5: Variability Dealing with Diversity

    We need statistical analysis because there is variability in data

    Variability: the extent to which the data values differ from each other

    o Diversity, uncertainty, dispersion, and spread(similar meaning)

    Three ways of summarizing the amount of variability in a data set:

    o One standard deviation: summarizes how far an observation typically is

    from the average.

    If you multiply the standard deviation by itself, you find the

    variance

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    29/39

    o Two range: is quick and superficial and is of limited use. It summarizes

    the extent of the entire data set, using the distance from the smallest to the

    largest data value

    o Three coefficient of variation: the traditional choice for a relative (as

    opposed to an absolute) variability measure and is used moderately often

    Summarizes how far an observation typically is from the average as

    a percentage of the average value using the ratio of standard

    deviation to average

    5.2 The Standard Deviation: The Traditional Choice

    Standard Deviation: a number that summarizes how far away from the average

    the data values typically are

    o Is the basic tool for summarizing the amount of randomness in a

    situation

    EXAMPLE:

    o If all numbers are the same

    5.5, 5.5, 5.5, 5.5

    The average will be X = 5.5 and the standard deviation will be S = 0

    21

    Variability: Introduction

    Also known as dispersion, spread, uncertainty,

    diversity, risk Example data: 2, 2, 2, 2, 2, 2, 2

    Variability = 0

    Example data: 1, 3, 2, 2, 1, 2, 3

    How much variability?

    Look at how fa r each da ta value is from averageX= 2:

    Deviations from average are -1, 1, 0, 0, -1, 0, 1

    Variability should be betwe en 0 and 1

    22

    Examples

    Stock market, daily change, is uncertain

    Not the same, day after day!

    Risk of a business venture There are potentia l rewards , but possible losses

    Uncertain payoffs and risk aversion

    Which wo uld you rather have

    $1,000,000 for sure

    $0 or$2,000,000, each outcome equally likely

    Both have same average! ($1,000,000)

    Most would prefer the choice with less uncertainty

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    30/39

    o Most data sets have some variability

    43.0, 17.7, 8.7, -47.4

    The average is the same, the data values are different (and so is the

    standard deviation

    Deviations: the distances from the average (also called residuals), indicate how

    far above the average (if positive) or below the average (if negative) each data

    value is

    The standard deviation summarizes the deviations cant just take an average

    since some numbers are positive and some are negative, the end result would be

    zero which is not helpful

    o Instead, the standard method

    1) find the deviations by subtracting the average from each data

    value

    2) find the square of each number (multiply it by itself) to eliminate

    the minus sign

    3) add them up

    4) divide the resulting sum by n-1 (this is the variance)

    5) take the square root (which undoes the squaring you did earlier)

    (this is the standard deviation)

    23

    Standard Deviation S

    Measures variability by answering:

    Approximate ly how far from average are the data

    values? (same measurement units as the data)

    For a sample

    For the population

    1

    )(...)()( 2222

    1

    n

    XXXXXXS n

    )(...)()(22

    2

    2

    1

    N

    XXX N

    24

    Example

    On the histogram

    Average is loca ted near the cente r of the dis tribution

    Standard deviat ion is a distance away from the average

    Standard deviat ion is the typical distance from average

    0

    1

    2

    3

    0 1 2 3 4 5 6 7

    spending

    Frequency

    X= 2.05S= 1.83 S= 1.83

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    31/39

    o The variance (the square of the standard deviation) is sometimes used as a

    variability measure in statistics, especially by those who work directly with

    the formulas, but the standard deviation is a better choice

    The variance contains no extra information and is more difficult to

    interpret than the standard deviation practice

    o In Excel =STDEV(B3:B6)

    o The Standard Deviation for a Sample

    Interpreting the StandardDeviation

    o The standarddeviation has a

    simple, direct interpretation: it summarizes the typical distance from

    average for the individual data values the result is a measure ofthe variability of these individuals

    o The standard deviation represents the typical deviation size expect some data values to be less than one standard deviation fromthe average, while others will be more than one standard deviationaway from the average (*expect individuals to deviate to both sidesof the average)

    25

    Normal Distribution and Std. Dev.

    For a normal distribution only

    2/3 of data within one standard deviation of the average

    (either above or below)

    95% for 2 std. devs. 99.7% for 3

    2/3 of data

    95% of the data

    99.7% of the data

    onestandarddeviation

    onestandarddeviation

    Fig 5.1.3

    25

    Normal Distribution and Std. Dev.

    For a normal distribution only

    2/3 of data within one standard deviation of the average

    (either above or below)

    95% for 2 std. devs. 99.7% for 3

    2/3 of data

    95% of the data

    99.7% of the data

    onestandarddeviation

    onestandarddeviation

    Fig 5.1.3

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    32/39

    Interpreting the Standard Deviation for a Normal Distributiono When a data set is approximately normally distributed, the standard

    deviation has a special interpretation approximately two-thirds ofthe data values will be within one standard deviation of the average,on either side of the average

    o Expect to find about 95% of the data within two standard deviationsfrom the average, with error rates often limited to 5%

    o Expect nearly all of the data (99.7) to be within three standarddeviations from the average

    o If data is NOT normally distributed, the above percentages do notapply

    Since there are so many different kinds of skewed (or othernon-normal) distributions, there is no single exact rule thatgives percentages for any distribution

    The Sample and the Population Standard Deviationso Two different, but related kinds of standard deviation

    Sample Standard Deviation: for a sample from a larger

    population denoted S

    Population Standard Deviation: for an entire population denoted (lower case Greek sigma)

    The sample standard deviation is slightly larger in order toadjust for the randomness of sampling

    To resolve any remaining ambiguity, proceed as follows: if indoubt, use the sample standard deviation

    Using the larger value is usually the careful,conservative choice since it ensures that you will not besystematically understanding the uncertainty

    For computation, the only difference between the twomethods is that you subtract 1 for the sample standarddeviation, but you do not subtract 1 for the population. (alsosome notation changes)

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    33/39

    L

    The smaller the number of items (N or n), the larger thedifference between the formulas. (with reasonably largeamounts of data, there is little difference between thetwo methods)

    5.2 The Range: Quick and Superficial

    Range: the largest minus the smallest data value and represents the size orextent of the data

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    34/39

    o Range of data set (185, 246, 92, 508, 153)o = Largest Smallesto = 508 92o = 416

    On Excel =MAX(orders)-MIN(orders)

    Is a sensible measure of diversity (like seeking to describe the extent of the dataor to search for errors)

    Because of its sensitivity to the extremes, the range is not very useful as astatistical measure of diversity in the sense of summarizing the data set as awhole

    o The range does not summarize the typical variability in the data but ratherfocuses too much attention on just two data values

    The standard deviation is more sensitive to all of the data & providesa better lok at the big picture

    The range will always be larger than the standard deviation

    5.3 The Coefficient of Variation: A Relative Variability Measure

    Coefficient of Variation: defined as the standard deviation divided by theaverage, is a relative measure of variability as a percentage or proportion of theaverage

    o Most useful when there are no negative numbers in the data seto

    Note that the standard deviation is the numerator, as isappropriate because the result is primarily an indication ofvariability

    o The coefficient of variation has no measurement units it is a pure number,a proportion or percentage, whose measurement units have canceled eachother in the process of dividing standard deviation by average

    Makes the coefficient of variation useful in those situations whereyou dont care about the actual (absolute) size of the differences,and only the relative size is important

    o Using the coefficient of variation allows you to reasonably compare a largeto a small firm to see which one has more variation on a size-adjustedbasis

    o Can be larger than 100% even with positive numbers could happen witha very skewed distribution or with extreme outliers (the situation is veryvariable with respect to the average value

    Coefficient of Variation = Standard DeviationAverage

    For a sample:Coefficient of Variation = S

    x

    For a population:Coefficient of Variation =

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    35/39

    5.4 Effects of Adding to or Rescaling the Data

    If a number is added to each data value, then this same number is added to theaverage, median, mode and percentiles to obtain the corresponding summariesfor the new data set

    If each data value is multiplied by a fixed number, the average, median, mode,percentiles, standard deviation and range are each multiplied by this samenumber to obtain the corresponding summaries for the new data set (thecoefficient of variation is unaffected)

    If the data values are multiplied by a factor c and an amount d is then added; Xbecomes cX + d

    o

    The new average is c X (old average) + d; likewise for the median, modeand percentileso The new standard deviation is |c| X (old standard deviation), and the range

    is adjusted similarly (note that the added number, d, plays no role here)

    27

    Coefficient of Variation

    A relative measure of variability

    The ratio: Standard deviation divided by average For a sample: S/X

    For a population: /

    No measurement units. A pure number. Answers:

    Typically, in percentage terms, how far are data values

    from average?

    Useful for comparing situations of different sizes

    To see how variability compares after adjusting for s ize

    28

    Example: Portfolio Performance

    You have invested $100 in each of 5 stocks

    Results : $116, 83, 105, 113, 98 Average is $103, std. dev. is $13.21

    Your friend has invested $1,000 in each stock

    Results : $1,160, 830, 1,050, 1,130, 980

    Average is $1,030, std. dev. is $132.10

    Coefficients of variation are identical

    13.21/103 = 132.10/1,030 = 0 .128 = 12.8%

    Typically, results for these 5 stocks were

    approximately 12.8% from their average value

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    36/39

    Chapter 5 Questions1. What is variability?

    2.a. What is the traditional measure of variability?

    b. What other measures are also used?

    3.a. What is a deviation from the average?

    b. What is the average of all of the deviations?

    4.a. What is the standard deviation?

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    37/39

    b. What does the standard deviation tell you about the relationshipbetween individual data values and the average?

    c. What are the measurement units of the standard deviation?

    d. What is the difference between the sample standard deviation andthe population standard deviation?

    e. What is the difference between the sample standard deviation andthe population standard deviation?

    5.a. What is the variance?

    b. What are the measurement units of the variance?

    c. Which is the more easily interpreted variability measure, thestandard deviation or the variance? Why?

    d. Once you know the standard deviation, does the variance provideany additional real information about the variability?

    6. If your data set is normally distributed, what proportion of the individuals doyou expect to find:

    a. Within one standard deviation from the average?

    b. Within two standard deviations from the average?

    c. Within three standard deviations from the average?

    d. More than one standard deviation from the average?

    e. More than one standard deviation above the average? (be careful)

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    38/39

    7. How would yoru answers to question 6 change if the data were not normallydistributed?

    8.a. What is the range?

    b. What are the measurement units of the range?

    c. For what purpose is the range useful?

    d. Is the range a very useful statistical measure of variability? Why or

    why not?

    9.a. What is the coefficient of variation?

    b. What are the measurement units of the coefficient of variation?

    10.Which variability measure is most useful for comparing variability in twodifferent situations, adjusting for the fact that the situations have verydifferent average sizes? Justify your choice.

    11.When a fixed number is added to each data value, what happens to:a. The average, median and mode?

    b. The standard deviation and range?

    c. The coefficient of variation?

    12.When each data value is multiplied by a fixed number, what happens toa. The average, median and mode?

    b. The standard deviation and range?

  • 7/31/2019 Statistics Ch 1-5 Notes (2)

    39/39

    c. The coefficient of variation?