technology - winona state university - problem 1 – …course1.winona.edu/bdeppa/stat...

5
STAT 425 – Modern Methods of Data Analysis Assignment 1 – OLS Regression (105 points) PROBLEM 1 – ASKING PRICE FOR USED CARS These data come from a study of the asking price for different makes and models of cars on the used car market. The response of interest is asking price and the remaining variables are potential predictors. The dataframes to use in R are called Usedcars.working and Usedcars, which includes the make model information for these cars. For developing OLS regression models it will be easier to use the Usedcars.working data frame which you will probably want to rename. These data are also in the file Usedcars.JMP linked to the website. Variable Info Description asking Response Asking price for a used car. year Predictors Model year numopt Number of options miles Miles on odometer pricenew Price of car new loanval Remainder of original loan amount left to pay avgretail Current blue book value Grading rubric (25 points) Fiing base model, critiquing it, and discussing any deficiencies. (5 pts.) Model development, documentation, and discussion. (15 pts.) Consideration of assumptions Possible predictor transformations Stepwise procedures Fiing final model, critiquing it, interpreting it, and discussing any deficiencies. (5 pts.) 1

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Technology - Winona State University - Problem 1 – …course1.winona.edu/bdeppa/Stat 425/Assignments/ST… · Web viewProblem 3 – listing Price of homes in the twin cities metro

STAT 425 – Modern Methods of Data AnalysisAssignment 1 – OLS Regression (105 points)

PROBLEM 1 – ASKING PRICE FOR USED CARSThese data come from a study of the asking price for different makes and models of cars on the used car market. The response of interest is asking price and the remaining variables are potential predictors. The dataframes to use in R are called Usedcars.working and Usedcars, which includes the make model information for these cars. For developing OLS regression models it will be easier to use the Usedcars.working data frame which you will probably want to rename. These data are also in the file Usedcars.JMP linked to the website.

Variable Info Descriptionasking Response Asking price for a used car.year

Predictors

Model yearnumopt Number of optionsmiles Miles on odometerpricenew Price of car newloanval Remainder of original

loan amount left to payavgretail Current blue book value

Grading rubric (25 points) Fitting base model, critiquing it, and discussing any deficiencies. (5 pts.) Model development, documentation, and discussion. (15 pts.)

Consideration of assumptions Possible predictor transformations Stepwise procedures

Fitting final model, critiquing it, interpreting it, and discussing any deficiencies. (5 pts.)

1

Page 2: Technology - Winona State University - Problem 1 – …course1.winona.edu/bdeppa/Stat 425/Assignments/ST… · Web viewProblem 3 – listing Price of homes in the twin cities metro

PROBLEM 2 – THE BOSTON HOUSING DATA

The Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld, which discussed approaches for using housing market data to estimate the willingness to pay for clean air. The authors employed a hedonic price model, based on the premise that the price of the property is determined by structural attributes (such as size, age, condition) as well as neighborhood attributes (such as crime rate, accessibility, environmental factors). This type of approach is often used to quantify the effects of environmental factors that affect the price of a property.

Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census and the Boston Metropolitan Area Planning Committee. The variables used to develop the Harrison Rubinfeld housing value equation are listed in the table below. (Boston.working)

Variables Used in the Harrison-Rubinfeld Housing Value EquationVARIABLE TYPE DEFINITION SOURCECMEDV Dependent

Variable (Y)Median value of homes in thousands of dollars

1970 U.S. Census

RMStructural

Average number of rooms 1970 U.S. CensusAGE % of units built prior to 1940 1970 U.S. Census

B

Neighborhood

Black % of population 1970 U.S. CensusLSTAT % of population that is lower

socioeconomic status1970 U.S. Census

CRIM Crime rate FBI (1970)

ZN % of residential land zoned for lots > than 25,000 sq. ft.

Metro Area Planning Commission (1972)

INDUS % of non-retail business acres (proxy for industry)

Mass. Dept. of Commerce & Development (1965)

TAX Property tax rate Mass. Taxpayers Foundation (1970)

PTRATIO Pupil-Teacher ratio Mass. Dept. of Ed (’71-‘72)

CHAS Dummy variable indicating proximity to Charles River (1 = on river)

1970 U.S. Census Tract maps

DISAccessibility

Weighted distances to major employment centers in area

Schnare dissertation (Unpublished, 1973)

RAD Index of accessibility to radial highways MIT Boston Project

NOX Air Pollution Nitrogen oxide concentrations (pphm) TASSIM

2

Page 3: Technology - Winona State University - Problem 1 – …course1.winona.edu/bdeppa/Stat 425/Assignments/ST… · Web viewProblem 3 – listing Price of homes in the twin cities metro

REFERENCE

Harrison, D., and Rubinfeld, D. L., “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5 (1978), 81-102.

Develop a regression model for the CMEDV using the available predictors in the table above. In R use the dataframe Boston.working as that will allow you fit the first model using the command:

> bos.lm = lm(CMEDV~.,data=Boston.working)

As the authors of the original paper were primarily interested in the roll of air pollution in housing prices that variable should be retained throughout. Your analysis should be thorough! Document the model development process by copying and pasting relevant R commands, output, and graphics into your write-up. You may also use the Boston.JMP file linked to the website, but I would like you fit your final model from Arc using R. Include diagnostic plots for your final model from R.

Grading rubric (30 points) Fitting base model, critiquing it, and discussing any deficiencies. (5 pts.) Model development, documentation, and discussion. (15 pts.)

Consideration of assumptions Possible predictor transformations Stepwise procedures

Fitting final model, critiquing it, and discussing any deficiencies. (5 pts.) Discussion of the role of NOx in your final model, which was the predictor of

primary interest to researchers. (5 pts.)

3

Page 4: Technology - Winona State University - Problem 1 – …course1.winona.edu/bdeppa/Stat 425/Assignments/ST… · Web viewProblem 3 – listing Price of homes in the twin cities metro

PROBLEM 3 – LISTING PRICE OF HOMES IN THE TWIN CITIES METRO AREA

These data are contained in the TwinCities.csv file on the website. The variable descriptions are below.

Variable Info DescriptionID Label MLS ID NumberAddress Label Street AddressCITY Label Minneapolis, St. Paul, Shoreview,

Woodbury, Maplewood, West St. PaulSTATE Label MN (for all)ZIP Label Zip CodeListPrice Response

(Y)Current List Price ($)

BEDS # of BedroomsBATHS # of Bathrooms (can be fractional)Location Name of neighborhood or region in the

Twin Cities metro area. Don’t use for this assignment!

SQFT Square footage of home (ft.2)LotSize Square footage of lot (ft.2) – missing for several

of the homes in these data.YearBuilt Year the home was built, could be used to create

a new variable called Age = 2014 - YearBuiltParkingSpots # of Parking Spots (I assume off-street parking)HasGarage Nominal Garage or No GarageDOM Days on the market, number of days the home

has been listed for sale.BeenReduced Nominal Has the price been reduced from the original

listing price. (Y or N)OriginalList ------- Original listing price. Don’t use as a predictor!!!BeenReduced2

Has the price been reduced from the originallisting price (Y or N) – this is calculated differently thanthe one above. Use one or the other BUT NOT both!

ReductAmt ------- Amount of the reduction from the original listing price if it has been reduced. Don’t use as a predictor!!!

PerReduct ------- Percent reduction from the original listing price. I wouldn’t use this predictor either, but in might be Ok to use.

4

Page 5: Technology - Winona State University - Problem 1 – …course1.winona.edu/bdeppa/Stat 425/Assignments/ST… · Web viewProblem 3 – listing Price of homes in the twin cities metro

LastSaleDate Date MM/DD/YY of most recent previous sale of the home. Do not use!

LastSaleDiff --------- Current List Price – Last Sale Price. Don’t use!SoldPrev Nominal Has the home been sold previously (Y or N), this one

should be Ok to use!LastSalePrice Price the home sold for the last time it sold. Don’t use!Realty Realty company the home is listed with. Don’t use!Latitude Latitude (degrees)Longitude Longitude (degrees)ShortSale Is more money owed on the home than what the asking

price is? (Y or N)

Grading rubric (35 points) Fitting base model, critiquing it, and discussing any deficiencies. (5 pts.) Model development, documentation, and discussion. (15 pts.)

Consideration of assumptions Possible predictor transformations Stepwise procedures

Fitting final model, critiquing it, interpreting it, and discussing any deficiencies. (5 pts.)

5