SUBMISSION: Docx document and JMP file with your responses: (You should save each regression as script). In JMP: on the Response output (Red button-Save Script).
- (100 points): Please use this Kaggle dataset: Startupdata_kaggle.csv Download Startupdata_kaggle.csv on profit and spend data from50 startups. This data has five variables (there are few zeros).
(Note: We do not know if there were really no expenditures for these firms but let us assume that this is case. It is not uncommon for startups to not invest in marketing or R&D- they have to focus on product development.). Delete these “0” row observations for the analysis.
- R&D Spending (x1) ($)
- Administrative Spending (x2) ($)
- Marketing Spending (x3) ($)
- State where the startup is registered (x4)
- Profit: This is the Y variable. Your goal is to develop a model for explaining profit as a function of 4 variables.
- (10) Conduct an exploratory analysis of the data- scatterplots, distributions (10)
- (15) Develop the multiple linear regression of profit (y response variable) as a function of R&D spending, marketing spending and administrative spending. Report and evaluate the fit of this regression using all the metrics we covered in class for MLR
- parameters test
- R2 and Adjusted R2
- Global fit
- Standard error of the regression
- (10) Test the model assumptions (normality, independence, and constant variance) using appropriate test statistics and visuals. Is there a hetroscedasticity problem? (Please remember to use JMP based chisquare tests for heteroscedasticity we spoke in class (BP and White Tests). What could fix this? (Feel free to provide corroborating test statistics from Python).
- (10) Test for multicollinearity and identify if any transformations are needed.
- (10) Identify leverage and influence points and then take appropriate measures to address them.
- (15) Now, address how you will handle the State variable. Find the right way to include this variable in. Run the revised model with State variable (after addressing for multicollinearity and heteroscedasticity if they are found to be an issue).
- (15) Are the state variables significant? How will you test this? What do you conclude?
- (15) Compare the performance of the two models. Decide which one you will go with and use that prediction equation. Predict profit for Advertising expenditures at 300,000. What is the 95% prediction interval?
Note: There are parts this assignment that will rely on material covered on November 11 ( very last part of C, and F) (G, H). This is a heads up.