library(mosaic)
library(car)
Look at the Ornstein
data set:
str(Ornstein)
'data.frame': 248 obs. of 4 variables:
$ assets : int 147670 133000 113230 85418 75477 40742 40140 26866 24500 23700 ...
$ sector : Factor w/ 10 levels "AGR","BNK","CON",..: 2 2 2 2 2 4 9 2 9 8 ...
$ nation : Factor w/ 4 levels "CAN","OTH","UK",..: 1 1 1 1 1 1 1 1 1 4 ...
$ interlocks: int 87 107 94 48 66 69 46 16 77 6 ...
The assets
variable is severely right skewed:
ggplot(Ornstein, aes(x = assets)) +
geom_histogram()
It’s much better after a log transform:
ggplot(Ornstein, aes(x = log(assets))) +
geom_histogram(binwidth = 1)
Take a look at the UN
data.
UN
Consider infant mortality vs GDP in the following plot. (The warnings you get refer to rows that are removed because their data values are missing.)
ggplot(UN, aes(y = infantMortality, x = ppgdp)) +
geom_point()
Now observe the relationship between the log-transformed versions of these variables:
ggplot(UN, aes(y = log(infantMortality), x = log(ppgdp))) +
geom_point()
Use the mutate
command from the dplyr
package to add new calculated columns to a data frame.
Ornstein2 <- Ornstein %>%
mutate(log_assets = log(assets))
Ornstein2
str(Ornstein2)
'data.frame': 248 obs. of 5 variables:
$ assets : int 147670 133000 113230 85418 75477 40742 40140 26866 24500 23700 ...
$ sector : Factor w/ 10 levels "AGR","BNK","CON",..: 2 2 2 2 2 4 9 2 9 8 ...
$ nation : Factor w/ 4 levels "CAN","OTH","UK",..: 1 1 1 1 1 1 1 1 1 4 ...
$ interlocks: int 87 107 94 48 66 69 46 16 77 6 ...
$ log_assets: num 11.9 11.8 11.6 11.4 11.2 ...
ggplot(Ornstein2, aes(x = log_assets)) +
geom_histogram(binwidth = 1)
Consider the ppgdp
variable which is severely right skewed:
ggplot(UN, aes(x = ppgdp)) +
geom_histogram()
The Box-Cox transformation is \[ X^{(\lambda)} = \frac{X^{\lambda} - 1}{\lambda}.\] The Box-Cox transformation is accomplished in R using thebcPower
function. The lambda
argument is the power. For an easy example:
x <- c(1, 2, 3, 4, 5)
bcPower(x, lambda = 2)
[1] 0.0 1.5 4.0 7.5 12.0
(Make sure you check the above numbers to make sure you understand how they were calculated.)
Now we choose several values of \(\lambda\) to try out for the ppgdp
data:
gdp_transformed <- data.frame(gdp_neg1 = bcPower(UN$ppgdp, lambda = -1),
gdp_neg0.5 = bcPower(UN$ppgdp, lambda = -0.5),
gdp_0 = bcPower(UN$ppgdp, lambda = 0),
gdp_0.5 = bcPower(UN$ppgdp, lambda = 0.5),
gdp_1 = bcPower(UN$ppgdp, lambda = 1))
gdp_transformed
The power of 0 is actually just a log transformation:
all.equal(gdp_transformed$gdp_0, log(UN$ppgdp))
[1] TRUE
The power of 1 is just the original data, except with one subtracted. (Look at the definition of Box-Cox when \(\lambda\) is 1.)
all.equal(gdp_transformed$gdp_1, UN$ppgdp - 1)
[1] TRUE
Here are histograms of these powers:
ggplot(gdp_transformed, aes(x = gdp_neg1)) +
geom_histogram()
ggplot(gdp_transformed, aes(x = gdp_neg0.5)) +
geom_histogram()
ggplot(gdp_transformed, aes(x = gdp_0)) +
geom_histogram()
ggplot(gdp_transformed, aes(x = gdp_0.5)) +
geom_histogram()
ggplot(gdp_transformed, aes(x = gdp_1)) +
geom_histogram()
Use the symbox
command (from the car
package) to check the ladder of powers in one convenient boxplot.
symbox( ~ ppgdp, data = UN)
There is an option for checking other powers. There’s no need in this example since we’ve already seen that the log function is such a good fit, but we’ll do it anyway just to show the syntax. We could use, for example, the vector c(-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2)
in the powers
argument, but there is a shortcut function seq
available that creates a sequence of values.
symbox( ~ ppgdp, data = UN, powers = seq(-2, 2, by = 0.5))
If we have negative values in our data, we can use a “start” to make them positive. This is a number added to every data value to make them all positive. For example, we can’t perform a Box-Cox transformation on the following data:
difficult_data <- c(-2, -1, 0, 3, 5)
But we can if we add, say, 3 to each entry.
symbox(difficult_data + 3)
Or rather than adding 3 manually, we can also use the start
argument of the symbox
function.
symbox(difficult_data, start = 3)
If we have a compressed range of large positive values, we can use a negative start to put the range closer to zero, allowing for the Box-Cox transformations to work more effectively.
We use powerTransform
to find the “best” power (i.e., the one that makes the distribution most normal).
powerTransform(UN$ppgdp)
Estimated transformation parameter
UN$ppgdp
0.02599498
But this is usually a bad idea. This “optimal” power usually has no reasonable physical interpretation. (Note in this case that the power is very close to zero anyway, which is the log transformation.) On the other hand, logarithms are very meaningful; in this case because we understand that GDPs of nations follow a power law.