#library(car)
library(openintro)
library(broom)
library(mosaic)
library(Stat2Data)
data("Diamonds2")
We’ll use the Diamonds2
data set as an example. It is a subset of 307 cases with the most frequent colors from the Diamonds data. We want to know if the color has a significant impact on the weight of the diamond.
You should use Carat as response and Color as predictor.
str(Diamonds2)
'data.frame': 307 obs. of 6 variables:
$ Carat : num 1.08 0.31 0.32 0.33 0.33 0.35 0.35 0.37 0.38 0.38 ...
$ Color : Factor w/ 4 levels "D","E","F","G": 2 3 3 1 4 3 3 3 1 2 ...
$ Clarity : Factor w/ 8 levels "IF","SI1","SI2",..: 5 7 7 1 7 5 5 7 1 8 ...
$ Depth : num 68.6 61.9 60.8 60.8 61.5 62.5 62.3 61.4 60 61.5 ...
$ PricePerCt: num 6693 3159 3159 4759 2896 ...
$ TotalPrice: num 7229 979 1011 1570 956 ...
Write the complete linear model:
Estimate and interpret all coefficients:
diamond_lm <- lm(Carat ~ Color, data = Diamonds2)
tidy(diamond_lm)
We can see a tabular representation of the choice made in coding the dummy regressors using the contrasts
function:
contrasts(Diamonds2$Color)
E F G
D 0 0 0
E 1 0 0
F 0 1 0
G 0 0 1
Calculate the ratio between the maximum sd and the minimum sd. What is your conclusion?
diamond_summrize <- Diamonds2 %>%
group_by(Color) %>%
summarise(mean = mean(Carat), sd = sd(Carat)) %>%
arrange(sd)
`summarise()` ungrouping output (override with `.groups` argument)
diamond_summrize
diamond_summrize$sd[4]/diamond_summrize$sd[1]
[1] 2.073307
Draw a QQ plot. What is your conclusion?
diamond_aug <- augment(diamond_lm)
diamond_aug
Draw a side-by-side box plot. What is your conclusion?
ggplot(Diamonds2, aes(y=Carat, x=Color)) + geom_boxplot()
ggplot(diamond_aug, aes(sample = .resid)) +
geom_qq()
Your conclusion:
The ratio between the maximum sd and the minimum sd is larger than 2 and the QQ plot shows that the distribution of the residuals is normal. So we need to transform the data to correct the skewness of the data.
Draw an appropriate plot of Carat and then decide a transformation for Carat.
ggplot(Diamonds2, aes(x=log(Carat))) + geom_histogram()
diamond_log_lm <- lm(log(Carat) ~ Color, data = Diamonds2)
tidy(diamond_log_lm)
Draw a residual plot.
diamond_log_aug <- augment(diamond_log_lm)
diamond_log_aug
Draw a side-by-side box plot.
ggplot(Diamonds2, aes(y=log(Carat), x=Color)) + geom_boxplot()
Draw a QQ plot.
ggplot(diamond_log_aug, aes(sample = .std.resid)) +
geom_qq()
Calculate the ratio between the maximum sd and the minimum sd. What is your conclusion?
diamond_log_summrize <- Diamonds2 %>%
group_by(Color) %>%
summarise(mean = mean(log(Carat)), sd = sd(log(Carat))) %>%
arrange(sd)
`summarise()` ungrouping output (override with `.groups` argument)
diamond_log_summrize
diamond_log_summrize$sd[4]/diamond_log_summrize$sd[1]
[1] 1.559727
diamond_log_aov <- aov(log(Carat) ~ Color, data = Diamonds2)
tidy(diamond_log_aov)
Your conclusion:
Given this very small p-value, we reject the null hyphothesis, which means that there is at least one color is significantly different from other colors.
diamond_aov <- aov(log(Carat) ~ Color, data = Diamonds2)
TukeyHSD(diamond_aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = log(Carat) ~ Color, data = Diamonds2)
$Color
diff lwr upr p adj
E-D -0.02298611 -0.227423474 0.1814513 0.9914530
F-D 0.20781627 0.005671473 0.4099611 0.0412729
G-D 0.35757624 0.154992248 0.5601602 0.0000439
F-E 0.23080238 0.053304460 0.4083003 0.0048645
G-E 0.38056235 0.202564412 0.5585603 0.0000004
G-F 0.14975996 -0.025600091 0.3251200 0.1238450