chapter 15 multiple linear regression analysis. multiple linear regression choice of independent...

of 50/50
Chapter 15 Multiple Linear Regression Analysis

Post on 25-Dec-2015

236 views

Category:

Documents

6 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Chapter 15 Multiple Linear Regression Analysis
  • Slide 2
  • Multiple linear regression Choice of independent variable Application
  • Slide 3
  • Goal construct the multiple linear regression model to assess the relationship between one dependant variable and a set of independent variables. Data the dependant variable is quantitative data; the independent variables are all or most quantitative data. If there are some qualitative data or ranked data,we must change them. Application explain and prediction. significance Since the things are influenced by many facts, the change of dependent variable may influenced by many others independent variables. For example, the change of diabetes blood sugar may affected by many biochemical criterions such as insulin, glycosylated hemoglobin, total cholesterol of serum, triglyceride and so on.
  • Slide 4
  • 1 Multiple linear regression
  • Slide 5
  • variable one dependant variable, a set of independent. together m+1 Sample size n Data form in Table 15-1 General model of the regression equation: 1 M ultiple linear regression model In the above model, the dependent variable y can be denoted the linear function of independent variables(x 1,x 2,x m ) approximately. 0 is the constant, 1, 2, m are partial regression coefficient, denote that when other dependent variable holds the line, x j increase or decrease one unit that mean variation of y. The residual e is random error that excludes m entries independent variable influence to y.
  • Slide 6
  • Table 15-1 Data form of multiple regression Qualification ( 1)There is linear relationship between y and x 1,x 2,x m. (2)The measured value y i (i=1,2,,n) of each case is independent. (3) The residual e is independent and normally distributed with mean 0 and variance 2, it equates to that for any independent variables x 1,x 2,x m the dependent variable y has the same variance, and obey to normal distribution.
  • Slide 7
  • General process construct regression equation (2) test and evaluate regression equation, the effect of each independent variables (1)seek the partial regression coefficient
  • Slide 8
  • 2 The construction of Multiple linear regression equation Case 15-1 the measured values of total cholesterol of serum, triglyceride, fasting blood - sugar level, glycosylated hemoglobin, fasting blood glucose are lied in table 15-2. Please construct Multiple linear regression equation with blood sugar and others indexes.
  • Slide 9
  • Table 15-2 blood sugar of 27case diabetes and measured values of relative variables
  • Slide 10
  • Partial derivative Principle least sum of squares
  • Slide 11
  • 3 Hypothesis test and evaluation 3.1.1 analysis of variance process 1 Regression equation
  • Slide 12
  • table 15-4 analysis of variance of case 15-1 Table 15-3 frame of Multiple linear regression analysis of variance
  • Slide 13
  • A Coefficient of determination R 2
  • Slide 14
  • B Multiple correlation coefficient
  • Slide 15
  • 2 for each independent variable The effect of each independent variable to y should be showed clearly in the equation. (analysis of variance and the total test of coefficient of determination. A. Sum of the squared for partial regression Significance In the equation, sum of the squared for partial regression of one of independent variables X j means that when there are others m-1 independent variables, the contribution of this independent variable to the dependent variable y. That is, after X j is excluded from the equation, the decrement of the sum of squared regression. That is, on the basic of m-1 independent variables, when X j increases, the increase of the sum of squared regression.
  • Slide 16
  • is sum of the squared for partial regression, the bigger it is the more importance of corresponding independent variable. In general condition, the effect of m-1 independent variables to the sum of squared partial regression of y should be obtained from new equation, rather than exclude the from equation of m independent variables simply.
  • Slide 17
  • Table 15-5 some part result of case 15-1 base on regression analyze Sum of squared for partial regression of each independent variable can be accounted according to draw up regression equation from different independent variables. Table 15-5 gives some part result of case 15-1.
  • Slide 18
  • results
  • Slide 19
  • B. t test A method equals to sum of squared for partial regression test. Calculate formula is is estimative value of partial regression coefficient, is standard error of
  • Slide 20
  • results
  • Slide 21
  • C Standardization regression coefficient Standardization variable is that subtract the mean of corresponding variable from original data, then divide by the standard deviation of variable. This regression equation is named standardization regression equation, and corresponding regression coefficient is named standardization regression coefficient. Standardization regression coefficient doesnt have unit, it can be used to compare with the effective intension of each independent variable Xj to y. Generally, if there is statistical significance, the larger the absolute value of standardization regression coefficient is, the more important effect of correspondent independent variable to y
  • Slide 22
  • Attention : Generally, regression coefficient has unit, and to interpret the effect of each independent variable to dependent variable. It means when other independent variables keep steady, increases or decreases one unit that the average change of y. We cant use each to compare the effect of to Standardization regression coefficient doesnt have unit, and to compare the effect of each independent variable to dependent variable, the larger is, the larger effect of to
  • Slide 23
  • results As the result showed, the size of factors affect blood sugar can be ranked as follow: glycosylated hemoglobin(X 4 ), insulin(X 3 ), triglyceride(X 2 ),total cholesterol of serum(X 1 ).
  • Slide 24
  • 2 choosing of independent variable purpose The effect of prediction and /or explanation should be in the best
  • Slide 25
  • 1 entirely choosing method Goal for better prediction significance Compare the regression formula which construct of different combined of independent variables select method
  • Slide 26
  • Slide 27
  • Slide 28
  • Case 15-2 Use entirely choose excellent method to choose independent variables in case 15-1
  • Slide 29
  • 2 stepwise selection 1. 2. 1.forward selection Import the independent variables into the regression equation one by one. This way is omitted from consideration on the whole. 2.2. backward elimination Place all the independent variables into the equation, then eliminate those without statistic significant progressively. The way of independent variables elimination is to select a variable has the lest square sum of regression, make F-test to determine whether it should be eliminated. Eliminate the one without statistic significant and then make a new regression equation with the left ones. Repeat this progress ceaselessly, until all the independent variables in the equation can not be eliminated. Theoretically, its the best way, and we strongly recommend. 2.3.stepwise regression Stepwise regression is on the basis of the two approaches hereinbefore, its a way of bidirectional filtration. Essentially speaking, its a way of forward selection.
  • Slide 30
  • Setting the test level: the test level of small sample is 0.10 or 0.15, the test level of large sample is 0.05. A lower level means a stricter standard for selecting variables, as a result, there will be less selected variables. Whereas a higher level means a wider standard, which means more variables will be chosen. Attention: the level of independent variable entered must lower than or equal to the level of independent variable moved.
  • Slide 31
  • Table 15-7 the process of stepwise regression
  • Slide 32
  • Table 15-8 analysis of variance of case 15-3 the best regression equation : Result There is linear relationship between the change of blood sugar and insulin, glycosylated hemoglobin, total cholesterol of serum, triglyceride. Insulin is negative relation. From the standard regression coefficient, we can conclude that glycosylated hemoglobin has the largest effect to fasting blood glucose.
  • Slide 33
  • Table 15-9 Estimation and test result of regression coefficient in case 15-3
  • Slide 34
  • 3 Application of Multiple Linear Regression and Attentions
  • Slide 35
  • 1. Application
  • Slide 36
  • 1.1. Analysis of the related factors For example, there are many factors that can affect hypertension, such as age, diet, habit, smoking, tension, family history and so on. So among those, its necessary to find which factors are related and which are further.
  • Slide 37
  • During clinical practice, it is difficult to ensure the agreement of all parameter of all groups, because of lots of complicated condition. For example, the regression can help compare two different therapy,with the disagreement on age, the state of illness and so on. An easy method to control confounding factors is to draw these to regression equation and analyze with other major variables.
  • Slide 38
  • 2.2. Estimation and Prediction For example, estimating the surface area of childrens hearts by their cardiac broad diameter(TCD); predicting the infants weigh by their gestational age, diameter of head, diameter at breast height (DBH) and abdomen girth(AG).
  • Slide 39
  • 2.3.Stastistical Control, Backrun Estimation For example, when we use the radio frequency therapy appearance to cure brain tumors, the impaired diameter of pallium has the linear regression relation with the temperature of radio frequency and the exposure time. The regression equation is established and it can help determine the optimal control of the temperature of radio frequency and the exposure time,by given the impaired diameter of pallium in advance.
  • Slide 40
  • 2 The problems of using multiple regression 2.1.Quantify of indices (1)quantify, non-linear linear (2)qualitative indices convert to quantitative ones: (0,1)variable, dummy variable, false variable, indicative variable. Binomial classified, use (0,1) variable,such as sex Multinomial classified, k-1(0,1)variables,such as blood type: 0 male 1 female
  • Slide 41
  • Data model regression equation Founding regression equation b 1 : the distinction of A type compares to O type b 2 : the distinction of B type compares to O type b 3 : the distinction of AB type compares to O type
  • Slide 42
  • (3) Rank Quantities We always change the rank from strong to weak into x=1,2,3, (or x=0,1,2, ). For example, education level could be classified into 4 degree: primary scholar, junior or senior student, undergraduate, graduate or PhD. stands for income. Explanation: b(b 1 ) represents that when the 1unit of x(x 1 ) increased, would increase b units(such as 500). It means junior or senior students could earn 500 more than primary scholar, undergraduates earn 500 more than junior students. Primary scholar Junior or senior Undergraduate Graduate or PhD
  • Slide 43
  • We could also change the k degree into k-1 (0,1) variables b 1,b 2,b 3 represents the income differences between junior or senior,undergraduate and graduate or PhD when compares to primary scholar.
  • Slide 44
  • 2.2. Sample size: n =(5 10)m 2.3. Stepwise regression: Dont trust in the result of stepwise blindly. The so called best regression equation does not by all means the best. The variable excluded from the equation does not mean that it has no statistical significance.For example: 15-3 if we change the entry probability of stepwise into 0.05( )and the removal probability into 0.10( ), the ultimate chosen variables should be, rather than. Which regression equation be used is decided by the professional knowledge.
  • Slide 45
  • 2.4. Multicollinearity: there maybe some strong linear relationship exists between independent variables. For example, hypertension and age, years of smoking, years of drinking et al. Those independent variables are highly related which makes founding equation through the method of least squares out of use. And it could invite some negative result:
  • Slide 46
  • Elimination of multicollinearity: discard the independent variable which makes collinearity; rebuild equation of regression; use stepwise regression. (1) standard error of the test statistic becomes large, therefore, t value becomes small; (2) regression equation becomes unstable. The evaluation could change significant when the observed datum increased or decreased; (3) inaccuracy of t test caused the discard of important variables which should be involved in model; (4) the inconsistent positive and negative sign of evaluation with objective reality.
  • Slide 47
  • 2.5. The interaction between variables In order to test whether there is interaction between the two independent variables, we usually added the product of them into the equation.
  • Slide 48
  • In analyzing the data in table 15-2, we have chosen three variables: triglyceride(x 2 ), insulin(x 3 ) and glycosylated hemoglobin(x 4 ). And now we add x 3 x 4 into the equation. If the product(x 3 * x 4 ) is statistically significant, it means that there is interaction between the insulin and the glycosylated hemoglobin. Therefore, we should define the new variable z (z=x 3 * x 4 ), and reestimate test statistic according to the new equation (y=b 0 + b 2 x 2 + b 3 x 3 + b 4 x 4 + b z z). If the hypothetic test rejected H 0 : z =0, it could be concluded that there exists interactive effect except the main effect of x 3 and x 4. In this case, the conclusion is that the use of Z is statistically significant(p