Coursera: Regression Models - Course Project

Executive Summary :

The 1974 Motor Trend US magazine dataset mtcars is used to evaluate the effect of transmission design on MPG(miles per gallon), in automobiles. Collectively for all control variables considered together, there is significant effect, while for each control variable including transmission design effect is insignificant. Individually, transmission design shows significant difference on MPG. Human effort does improve efficiency - manual transmission is more efficient than auto transmission.

Loading and Exploring the Dataset :

library(knitr)
data(mtcars);kable(summary(mtcars[1:5]));kable(summary(mtcars[6:10]))
mpg cyl disp hp drat
Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0 Min. :2.76
1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.08
Median :19.2 Median :6.00 Median :196.3 Median :123.0 Median :3.69
Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7 Mean :3.60
3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.92
Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0 Max. :4.93
wt qsec vs am gear
Min. :1.51 Min. :14.5 Min. :0.000 Min. :0.000 Min. :3.00
1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:3.00
Median :3.33 Median :17.7 Median :0.000 Median :0.000 Median :4.00
Mean :3.22 Mean :17.8 Mean :0.438 Mean :0.406 Mean :3.69
3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:4.00
Max. :5.42 Max. :22.9 Max. :1.000 Max. :1.000 Max. :5.00

The dataset evaluation shows that it has 10 control variables, 1 target variable (mpg) and 32 samples . Details related to the variables can easily be accessed in the RStudio help menu.

Fig 1 and Fig 2 in Appendix I, indicates that mileage, mpg, has a decreasing trend for an automobile for five of the control variables - carb, Cyl, disp, hp & wt and an increasing trend for other three - drat, qsec, vs, irrespective of its transmission design. For control variable gear, mileage shows a decreasing trend for manual transmission and an increasing trend for auto transmission. However, the gear types for auto transmission are 3, 4, while for manual transmission, gear types are 4, 5. This may be a reason for the difference.

Fig 3, Appendix I is a box plot of mpg(mileage) by am(transmission). The plot indicates that transmission design individually has a significant effect on mileage and yields the following hypothesis :

  1. Manual transmission has better mileage(mpg) than auto transmissions.

Data Analysis :

mtcars is data of 32 different automobiles, each of which has all of the 10 control variables as their features, hence the mileage would be the sum result of all these control variables. A multivariate regression model with significant, uncorrelated control variables, screened using step function and transmission design (am) as a factor variable, if significant, would be appropriate to evaluate statistical significance and practical importance of control variables on target variable.

fitbase=lm(mpg~factor(am)+.-am,data=mtcars) # Multivariate regression model
fittotal=step(fitbase,trace=0) #Select best fit. trace=0 to stop step printing

#List of coefficients with 95% confidence interval, X2.5=Lower limit, X97.5=Upper limit.
kable(data.frame(summary(fittotal)$coef,confint(fittotal)))
Estimate Std..Error t.value Pr…t.. X2.5.. X97.5..
(Intercept) 9.618 6.9596 1.382 0.1779 -4.6383 23.874
factor(am)1 2.936 1.4109 2.081 0.0467 0.0457 5.826
wt -3.917 0.7112 -5.507 0.0000 -5.3733 -2.460
qsec 1.226 0.2887 4.247 0.0002 0.6346 1.817
fs=summary(fittotal)$fstat;pval=pf(fs[1],fs[2],fs[3],lower.tail=F)

Second column, “Estimate” above shows the intercept, (Auto transmission, am==0), used as reference has high practical importance and coefficients (slope) of each control variable, which are essentially the rate at which the target variable, mpg increases for each unit change in the relevant control variable, keeping all other variables constant. Coefficient for manual transmission, factor(am)1, has a positive value relative to intercept indicating that manual transmission has a better mpg relative to auto transmission for all other variables held constant. Also the P value, 0.0467, (column Pr) for a confidence interval of 95%(\(\alpha\)=0.05) indicates that it is statistically significant. Coefficients also show that only two other control variables - wt & qsec - have a statitically significant effect on mpg.

Fifth, sixth and seventh columns - Pr, X2.5 & X97.5 show that all control variables, except the intercept (Autotransmission), have a P-value (Pr) less than the significance level, \(\alpha\)=0.05, the 95% confidence significance level, with confidence intervals (X2.5 & X97.5) not containing 0, hence, null-hypothesis for each of the control variables is rejected, i.e these control variables have statistically significant effect. Also, the P-value for the regression as a whole is 1.2104 × 10-11, is less than the significance level, \(\alpha\)=0.05, rejecting the null-hypothesis. Hence we can infer that the important control variables have a significant effect on mpg for 95% confidence interval. Also that Manual transmission has 30.53% better mpg relative to Auto transmission. Mulitple R-Squared shows that the analysis explains 84.97% of variance.

**For the complete model summary refer to Appendix I

Fig 4, Appendix I shows the residual plots for the multivariate regression. Plots show that the standardized residuals are within [-2,2] and Cook’s distances are less than 1, D<1. This indicates that control variables are well leveraged and there is no undue influence of any of the control variables. This indicates a very good fit of the model. However, this evaluation of residuals is based on the assumption of normality of residuals. A simple Shapiro-Wilkes test should confirm or negate this assumption.

Shapiro-Wikes Test:

s.test=shapiro.test(fittotal$resid);print(s.test)
|| 
||  Shapiro-Wilk normality test
|| 
|| data:  fittotal$resid
|| W = 0.9411, p-value = 0.08043

The Shapiro test P-value, 0.0804 is less than 0.1, confirming the normality of residuals. Hence, our evaluation of the residuals is valid and the model is a good fit for the data.

Conclusion:

Above set of analysis yields the inference that Manual transmission is better than Auto transmission with 30.53% better mileage(mpg), while accounting for significant confounders.

Appendix I :

library(caret)
par(mfrow=c(2,1))
featurePlot(mtcars[mtcars$am==0,-c(1,9)],mtcars$mpg[mtcars$am==0],type=c('p','r'),
col='red',labels=c("Fig 1",""),main="MPG Trend For Auto Transmission",pch=19)

plot of chunk unnamed-chunk-5

featurePlot(mtcars[mtcars$am==1,-c(1,9)],mtcars$mpg[mtcars$am==1],type=c('p','r'),
col='red',labels=c("Fig 2",""),main="MPG Trend For Manual Transmission",pch=19)

plot of chunk unnamed-chunk-5

Detailed Summary Of Multivariate Regression :

summary(fittotal)

Call:
lm(formula = mpg ~ factor(am) + wt + qsec, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.481 -1.556 -0.726  1.411  4.661 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.618      6.960    1.38  0.17792    
factor(am)1    2.936      1.411    2.08  0.04672 *  
wt            -3.917      0.711   -5.51    7e-06 ***
qsec           1.226      0.289    4.25  0.00022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.46 on 28 degrees of freedom
Multiple R-squared:  0.85,  Adjusted R-squared:  0.834 
F-statistic: 52.7 on 3 and 28 DF,  p-value: 1.21e-11
par(mar=c(5,4,3,4))
boxplot(mpg~am,data=mtcars,xlab="Fig 3. Transmission Type : 0-Auto, 1-Manual",ylab="Mileage mpg",
        main="MPG Vs. Transmission Design")

plot of chunk unnamed-chunk-7

par(mfrow=c(2,2),oma=c(6,2,6,2),mar=c(4,4,2,2))
plot(fittotal,1:4);title(main="Residuals For Multivariate",sub="Fig 4",outer=T,cex.main=2,cex.sub=2)

plot of chunk unnamed-chunk-8