data processing and analysis of data
DESCRIPTION
power point presentation on Data processing and analysis of dataTRANSCRIPT
A TalkOn
‘Data Processing and Analysis of Data’(Research Methodology)
Introduction
• The data has to be processed and analyzed for the purpose of research plan
• This is essential for scientific study and comparisons.
• Processing implies– Editing– Coding– Classification and– Tabulation
• Analysis implies– Computation of certain measures – Searching for patterns of relationships that exists
among data groups.
Processing Operations
1. Editing– The process of examining the collected raw data
to detect errors and omission and also correct these.
– It involves scrutiny of the completed questionnaires and/or schedules.
– There are two variations of editing• Field editing.• Central editing.
• Field editing– Consists of review of the reporting forms by the
investigator for completing (rewriting) what has been written in abbreviated form at the time of recording the response.
– This editing is expected to be done as soon as possible after the interview.
– While doing field editing the investigator should not try to correct errors or omissions by simply guessing the suitable option.
• Central editing– Takes place when all forms or schedules have
been completed and returned to office.– All the forms should be edited by a single editor in
a small study or a team of editors in case of large inquiry.
– Corrections are allowed in this editing.
– There are certain points to be kept in view while performing their work
a) Editors should be familiar with instructions given to the interviewers and coders.
b) Single line should be drawn to cross out any information.
c) Entries should be made in some distinctive color and in standardized form.
d) They should initial all answers which they change or supply,.
e) Editor’s initials and the date of editing should be placed on each completed from or schedule.
2. Coding– Refers to the process of assigning numerals or
other symbols to answers so that the response can be put into limited categories.
– Necessary for efficient analysis.– Coding decision is usually taken at the design
stage of the questionnaire.
3. Classification– Individual Data should be reduced into
homogeneous groups to get meaningful relationships.
– classification is the process of arranging data in groups or classes on the basis of some common characteristics.
• Broadly there are two types of classification based on the nature of the phenomena involved.a) Classification according to attributes.
b) Classification according to class-interval.
• Classification according to attributes:– Data are classified on the basis of common
characteristics either descriptive or numerical.– Descriptive characteristics refer to qualitative
phenomenon which cannot be measured quantitatively
– Data obtained this way is known as statistics of attributes.
– This classification can be either simple or manifold– In Simple classification, we consider only one
attribute and make two classes; one possessing the considered attribute and the other devoid of it.
– In Manifold classification, more than one attributes are considered and data is divided into number of classes.
• Classification according to class-interval:– Data relating to income, production, age etc are
known as statistics of variables and are classified on the basis of class intervals.
4. Tabulation– Tabulation refers to the process of summarizing
the raw data and displaying the same in compact form.
– It is essential because:• It conserves space and reduces the explanatory
statements to minimum.• Facilitates the process of comparison.
Elements/Types of Analysis
• In case of survey or experimental data, analysis involves – estimating the values of unknown parameters of
the population,– Testing of hypotheses for drawing inferences.
• Categories of analysis:a)Descriptiveb)inferential
• Correlation analysis:– Studies the joint variation of two or more
variables for determining the amount of correlation between two or more variables.
• Casual analysis:– Studies how one or more variable affect changes
in another variable.
• Multivariate analysis:– “All statistical methods which simultaneously
analyze more than two variables on a sample of observations.”
– It involves:a) Multiple regression analysisb) Multiple discriminant analysisc) Multivariate analysis of varianced) Canonical analysis
STATISTICS IN RESEARCH
• Statistics in research functions as a tool in designing research, analyzing its data and drawing conclusions there from.
• The important statistical measures used to summarize the survey/research are:1) Measure of central tendency or statistical
averages.2) Measures of dispersion
3. Measures of asymmetry(skewness)4. Measures of relationship5. Other measures
Measure of Central Tendency
– It tells the point about which items have a tendency to cluster.
– Mean, Median ,Modes are the most popular averages.
– Mean is also known as arithmetic average– Median is the value of the middle item of series
when it is arranged in ascending or descending order.
– Mode is the most commonly or frequently occurring value in a series.
Measure of Dispersion
– It is used to give an idea about the scatter of the values of items of a variable in the series around the true value of average.
– Important measures of dispersion are:a) Rangeb) Mean deviation andc) Standard deviation
• Range– Is the simplest possible measure of dispersion – It is defined as the difference between the values of
the extreme items of a series.• Mean deviation– It is the average of difference of the values of items
from some average of the series.• Standard deviation– Most widely used measure of dispersion– Denoted by the symbol σ
– Standard deviation is defined as the square root of the average of squares of deviations.
Where
Measure of Asymmetry
– When the distribution of the elements in a series happens to be perfectly symmetrical then we get the following type of curve. Technically such curves are described as normal curve.
• If the curve is distorted, it is said to exhibit asymmetrical distribution which indicates the presence of skewness.
– Where
Measures of Relationship
– In context of bivariate and multivariate population, it is required to know the relation of the two or more variables in the data to one another.
– These association/correlation and cause-and-effect relationship are studied using correlation technique and the technique of regression
• In case of bivariate population:– Correlation can be studied through:
a) Cross tabulationb) Charles Spearman’s coefficient of correlationc) Karl Pearson’s coefficient of correlation
– Cause-and-effect relationship can be studied through simple regression technique.
1. Cross tabulation:– Useful when the data are in nominal form– Classify each variable in two or more categories
and then cross classify the variables in these categories.
– The interaction between them can be as follows:• Symmetrical• Reciprocal• Asymmetrical
• In a symmetrical relationship the two variables vary together.
• In reciprocal relationship the two variables mutually influence or reinforce each other.
• In an asymmetric relationship one variable (independent variable) is responsible for another variable (dependent variable).
2. Charles Spearman’s coefficient of correlation:― This technique deals with ordinal data where ranks are
given to the different values of the variables― The objective is to determine the extent to which the
two sets of ranking are similar of dissimilar.
3. Karl Pearson’s coefficient of correlation: – Most widely used method to measure the
degree of relationship between two variables.
• Simple regression analysis:– Regression is the determination of a statistical
relationship between two or more variables, where one variable is the cause of the behavior of another variable.
– If X is the independent variable and Y is the dependent variable then, the regression equation of Y on X is given as below
• In case of multivariate population:– Correlation can be studied through:
a)coefficient of multiple correlation.b)coefficient of partial correlation.
– Cause-and-effect relationship can be studied through multiple regression equations.
1. Multiple Correlation and Regression– When there are two or more independent
variables then the analysis concerning relationship is known as multiple correlation
– The equation describing such relationship is known as multiple regression equation.
• In the context of two independent variables and one dependent variable the equation can be given as:
• Partial correlation:– Partial correlation measures separately the
relationship between two variables such that the effect of other related variable is eliminated
– In other words the aim is at measuring the relation between a dependent variable and particular independent variable by holding all other variables constant.
Other Measures
1. Index number:– Used when the series are expressed in different
units.– In such scenario the series is converted into
series of index numbers.– For example the given figures can be expressed
in terms of percentage.
2. Time- Series Analysis:– When the data collected relates to some time
period concerning a given phenomenon, particularly in economic and business scenario, such data are labeled as ‘Time-Series’
– Factors affecting such series areI. Secular trend (T) : changes taking place at long duration of
time II. Short time oscillations: changes taking place at short
duration of time
• Short time oscillation are affected by the following factors:
a) Cyclic fluctuations (C): the fluctuations as a result of business cycles.
b) Seasonal fluctuations (S): these fluctuations are of short duration occurring at a regular sequence at specific interval of time.
c) Irregular fluctuations (I): such fluctuations takes place at completely unpredictable fashion.
• For analyzing time series there are two models:a) Multiplicative modelb) Additive modelMultiplicative model assumes that the various
component interact in a multiplicative manner to produce the given values of the overall time series and can be stated as;
The additive model considers the total of various components resulting in the given values of the overall time series and can be stated as