Introduction

The Bayes classifier is a classification method, meaning that it predicts a categorical response for each possible value of a set of explanatory variables. Based on the way it works, one can prove mathematically (though we don’t do so in these notes) that the Bayes classifier has an error rate that serves as a lower bound for the expected test error of any other possible classifier. In other words, no matter how good a fancy classification algorithm is, one cannot get (on average) an error rate that is lower than the Bayes error rate.

In the real world, we cannot actually use the Bayes classifier or compute the Bayes error rate. This is because the Bayes classifier relies on knowledge of a “true” data-generating process that we never have. (If we knew the true data-generating process, we would know everything we wanted to know statistically about the population we’re trying to study, negating the need for applying a classification algorithm to data.)

So why study the Bayes classifier? Because it’s an important theoretical concept that tells us there are natural limits on our ability to make predictions from data. There is a certain amount of irreducible error in any classification process.

Preliminaries

library(tidyverse)

Fake data

Because the Bayes classifier requires a knowledge of the true probabilities of each response category, we have to simulate some fake data in order to explore it.

The idea here is to imagine that x is the value of some explanatory variable that takes values between 0 and 1. Assume there is some response variable that is categorical and takes two value, red and blue.

For each point along the x-axis between 0 and 1, we’ll define a function that describes the probability of being blue. Here is such a function:

\[prob(x) = \frac{11}{3}x - 8x^{2} + \frac{16}{3}x^{3}.\]

And in R:

prob <- function(x) {
    (11/3)*x - 8*(x^2) + (16/3)*(x^3)
}

Here is a plot of this function:

prob_plot <- ggplot(data.frame(x = c(0, 1)), aes(x)) +
    stat_function(fun = prob) +
    labs(y = "prob")
prob_plot

To be clear, this is not a probability density function. The area under this curve is not one. We’re not describing the probability density of selecting any given x. When you pick a value of x, the curve height tells you the probability that the chosen value will have blue as its response.

As an example, suppose x = 0.1. The value of the function at x = 0.1 is 29.2%:

prob(0.1)
[1] 0.292
prob_plot +
    geom_segment(x = 0.1, xend = 0.1,
                 y = 0, yend = prob(0.1),
                 color = "blue", size = 1) +
    geom_segment(x = 0.1, xend = 0.1,
                 y = prob(0.1), yend = 1,
                 color = "red", size = 1) +
    geom_point(x = 0.1, y = prob(0.1), size = 3)

There is a 29.2% chance of this point being blue. That means that there is also a 70.8% chance of it being red. Whether such a point actually turns out to be blue or red remains to be seen when we gather data. Or another way to think about it is that in the population of all data points that have x = 0.1, 29.2% will be blue and 70.8% will be red.

This function was also chosen for convenience so that the probabilities cross the 50% mark at x = 0.25, x = 0.5, and x = 0.75. According to this function, points on the left are more likely to be red up to x = 0.25, at which point the probability shifts to favor blue slightly. At x = 0.5, the probability shifts back to favor red slightly. Finally, to the right of x = 0.75, points are more likely to be blue again. Points near the center are not very well determined because their probabilities are very close to 50%. The following illustrates this a little more vividly:

prob_plot_color <- prob_plot +
    geom_ribbon(data = data.frame(x = seq(0, 1, 0.01)),
                aes(x = x, ymin = 0, ymax = prob(x)),
                fill = "blue", alpha = 0.25) +
    geom_ribbon(data = data.frame(x = seq(0, 1, 0.01)),
                aes(x = x, ymin = prob(x), ymax = 1),
                fill = "red", alpha = 0.25) +
    labs(y = "prob") +
    geom_hline(yintercept = 0.5, linetype = "dashed") +
    geom_vline(xintercept = c(0.25, 0.5, 0.75),
               linetype = "dotted")
prob_plot_color

Please don’t confuse this picture with the pictures of Bayes classifiers from the book! Those examples in the book show probabilities defined in a two-dimensional explanatory variable space (X1 and X2). Our example is a one-dimensional space, only along the x-axis. We only need a second dimension in this picture to show the probabilities “sitting above” each x value.

Now that we have our probability function, let’s simulate some fake data. Here are 100 random x values:

set.seed(11111)
explanatory <- runif(100)
fake_data <- tibble(explanatory)
fake_data

Now we choose a response variable. We’ll pick colors at random, but constrained by the probabilities for each x value. (The Bernoulli distribution is effectively a coin-flipping mechanism, but instead of heads and tails, we are flipping blues and reds according to the probabilities given by the prob function.)

set.seed(22222)
fake_data <- fake_data %>%
    rowwise() %>%
    mutate(response =
               ifelse(rbernoulli(1, p = prob(explanatory)),
                      "Blue", "Red"))
fake_data

Let’s plot the sample below our previous graph:

prob_plot_response <- prob_plot_color +
    geom_jitter(data = fake_data,
                aes(x = explanatory, y = -0.2,
                    color = response),
                position = position_jitter(width = 0,
                                           height = 0.2,
                                           seed = 1)) +
    scale_colour_manual(values = c("Blue", "Red")) +
    guides(color = FALSE)
prob_plot_response

There’s a lot going on in this picture because we are showing the true probability function and we’ve vertically jittered the x values so they are all visible. In the real world, this is all we see:

x_plot <- ggplot(fake_data, aes(x = explanatory, y = 0)) +
    geom_segment(x = 0, xend = 1, y = 0, yend = 0,
                 color = "black") +
    scale_colour_manual(values = c("Blue", "Red")) +
    guides(color = FALSE) +
    theme(axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),
          axis.title.y = element_blank(),
          panel.grid.major.y = element_blank(),
          panel.grid.minor.y = element_blank())
x_plot_response <- x_plot + 
    geom_point(aes(color = response))
x_plot_response

The Bayes classifier

The Bayes classifier operates using a very simple rule: classify each point based on the the highest probability given the value of any explanatory variables. In symbols, we calculate the probability of the response variable taking on all possible classes given the explanatory variable:

\[Pr(Response = \text{Blue} \mid Explanatory = x)\] \[Pr(Response = \text{Red} \mid Explanatory = x)\]

In general, there may be more than two categories. Either way, just choose the category for which the probability listed above is highest.

For example, at x = 0.1, the probability of blue is 29.2%, and therefore it follows that the probability of red is 70.8%. Therefore, a point with x = 0.1 should be classified as red. Keep in mind that any actual data point located at x = 0.1 is not forced to be red. There’s still some probability of it being blue, but the Bayes classifier will predict that it is red.

This sounds easy, but keep in mind, the Bayes classifier is a theoretical construct only; in practice, we never know the true probability distribution that describes the pattern of responses.

However, our data is fake data that we simulated, so we have the true probability distributions. Therefore, we can classify points according to the Bayes classifier. Let’s create some new columns in our data frame, one for the predicted value according to the Bayes classifier, and another to indicate if the classifier made a correct prediction (comparing it with the actual response value).

fake_data <- fake_data %>%
    mutate(prediction = ifelse(prob(explanatory) >= 0.5,
                               "Blue", "Red"),
           correct = ifelse(response == prediction,
                            TRUE, FALSE))
fake_data

Here are the points as classified by the Bayes classifier:

prob_plot_pred <- prob_plot_color +
    geom_jitter(data = fake_data,
                aes(x = explanatory, y = -0.2,
                    color = prediction),
                position = position_jitter(width = 0,
                                           height = 0.2,
                                           seed = 1)) +
    scale_colour_manual(values = c("Blue", "Red")) +
    guides(color = FALSE)
prob_plot_pred

This shows only our training data, but keep in mind that any point along the x-axis from 0 to 1 can be classified into blue or red using the Bayes classifier. The dotted vertical lines mark the Bayes decision boundary. This boundary separates the regions where the classifier makes a color prediction. Across these boundary points, the classifier changes its prediction from one color to another.

Again, be careful not to confuse this one-dimensional example from the book’s two-dimensional example. In one-dimensional terms, this is what the Bayes decision boundary looks like:

x_plot_pred <- x_plot +
    geom_point(aes(color = prediction)) +
    geom_segment(x = 0.25, xend = 0.25,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1) +
    geom_segment(x = 0.5, xend = 0.5,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1) +
    geom_segment(x = 0.75, xend = 0.75,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1)
x_plot_pred

Here are the Bayes decision boundaries again, but this time with the actual response values:

x_plot_response +
    geom_segment(x = 0.25, xend = 0.25,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1) +
    geom_segment(x = 0.5, xend = 0.5,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1) +
    geom_segment(x = 0.75, xend = 0.75,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1)

If we compare the actual responses to the Bayes decision boundary, we note that quite a few points are misclassified by the Bayes classifier.

Returning to the plot with the probability function, the squares indicate correct predictions and the crosses indicate incorrect predictions:

prob_plot_all <- prob_plot_response +
    geom_jitter(data = fake_data,
                aes(x = explanatory, y = -0.2,
                    shape = correct),
                size = 4,
                position = position_jitter(width = 0,
                                           height = 0.2,
                                           seed = 1)) +
    scale_shape_manual(values=c(4, 0)) +
    guides(shape = FALSE)
prob_plot_all

The Bayes error rate is the percentage of incorrectly classified responses using the Bayes classifier.

1 - mean(fake_data$correct)
[1] 0.41

So it’s a relative large 41% here. The theory behind the Bayes classifier says that no classification strategy will give you better than a 41% test error rate. I mean, it’s possible for any given data set and any given classifier to get lucky, but overall, we can’t expect (on average) that any other classifier will do better than 59% on new test data generated from the same probability distribution.

The Bayes classifier in higher dimensions

In our one-dimensional example above, the Bayes decision boundary was a set of three points on the x-axis. The four regions thus created are regions of “constant prediction”, meaning that the Bayes classifier classifies everything within a region using a single color. The classifier only changes colors across a point of the Bayes decision boundary.

In a two-dimensional space of explanatory variables (like the book examples), the Bayes decision boundary does not consist of points anymore. It’s a curve (or multiple curves) that separates the plane into regions of constant prediction.

In a three-dimensional space of explanatory variables, the Bayes decision boundary is a surface (or multiple surfaces) separating space into regions of constant prediction.

In higher dimensions, it’s harder to visualize what’s going on. For an n-dimensional space of explanatory variables, the Bayes decision boundary is an (n-1)-dimensional hypersurface!

---
title: "The Bayes classifier"
author: "Sean Raleigh"
output: html_notebook
---


## Introduction

The *Bayes classifier* is a classification method, meaning that it predicts a categorical response for each possible value of a set of explanatory variables. Based on the way it works, one can prove mathematically (though we don't do so in these notes) that the Bayes classifier has an error rate that serves as a lower bound for the expected test error of any other possible classifier. In other words, no matter how good a fancy classification algorithm is, one cannot get (on average) an error rate that is lower than the *Bayes error rate*.

In the real world, we cannot actually use the Bayes classifier or compute the Bayes error rate. This is because the Bayes classifier relies on knowledge of a "true" data-generating process that we never have. (If we knew the true data-generating process, we would know everything we wanted to know statistically about the population we're trying to study, negating the need for applying a classification algorithm to data.)

So why study the Bayes classifier? Because it's an important theoretical concept that tells us there are natural limits on our ability to make predictions from data. There is a certain amount of irreducible error in any classification process.


## Preliminaries

```{r, message = FALSE, warning = FALSE}
library(tidyverse)
```


## Fake data

Because the Bayes classifier requires a knowledge of the true probabilities of each response category, we have to simulate some fake data in order to explore it.

The idea here is to imagine that x is the value of some explanatory variable that takes values between 0 and 1. Assume there is some response variable that is categorical and takes two value, red and blue.

For each point along the x-axis between 0 and 1, we'll define a function that describes the probability of being blue. Here is such a function:

$$prob(x) = \frac{11}{3}x - 8x^{2} + \frac{16}{3}x^{3}.$$

And in R:

```{r}
prob <- function(x) {
    (11/3)*x - 8*(x^2) + (16/3)*(x^3)
}
```

Here is a plot of this function:

```{r}
prob_plot <- ggplot(data.frame(x = c(0, 1)), aes(x)) +
    stat_function(fun = prob) +
    labs(y = "prob")
prob_plot
```

To be clear, this is not a probability density function. The area under this curve is not one. We're not describing the probability density of selecting any given x. When you pick a value of x, the curve height tells you the probability that the chosen value will have blue as its response.

As an example, suppose x = 0.1. The value of the function at x = 0.1 is 29.2%:

```{r}
prob(0.1)
```


```{r}
prob_plot +
    geom_segment(x = 0.1, xend = 0.1,
                 y = 0, yend = prob(0.1),
                 color = "blue", size = 1) +
    geom_segment(x = 0.1, xend = 0.1,
                 y = prob(0.1), yend = 1,
                 color = "red", size = 1) +
    geom_point(x = 0.1, y = prob(0.1), size = 3)
```

There is a 29.2% chance of this point being blue. That means that there is also a 70.8% chance of it being red. Whether such a point actually turns out to be blue or red remains to be seen when we gather data. Or another way to think about it is that in the population of all data points that have x = 0.1, 29.2% will be blue and 70.8% will be red.

This function was also chosen for convenience so that the probabilities cross the 50% mark at x = 0.25, x = 0.5, and x = 0.75. According to this function, points on the left are more likely to be red up to x = 0.25, at which point the probability shifts to favor blue slightly. At x = 0.5, the probability shifts back to favor red slightly. Finally, to the right of x = 0.75, points are more likely to be blue again. Points near the center are not very well determined because their probabilities are very close to 50%. The following illustrates this a little more vividly:

```{r}
prob_plot_color <- prob_plot +
    geom_ribbon(data = data.frame(x = seq(0, 1, 0.01)),
                aes(x = x, ymin = 0, ymax = prob(x)),
                fill = "blue", alpha = 0.25) +
    geom_ribbon(data = data.frame(x = seq(0, 1, 0.01)),
                aes(x = x, ymin = prob(x), ymax = 1),
                fill = "red", alpha = 0.25) +
    labs(y = "prob") +
    geom_hline(yintercept = 0.5, linetype = "dashed") +
    geom_vline(xintercept = c(0.25, 0.5, 0.75),
               linetype = "dotted")
prob_plot_color
```

**Please don't confuse this picture with the pictures of Bayes classifiers from the book!** Those examples in the book show probabilities defined in a two-dimensional explanatory variable space (X1 and X2). Our example is a one-dimensional space, only along the x-axis. We only need a second dimension in this picture to show the probabilities "sitting above" each x value.

Now that we have our probability function, let's simulate some fake data. Here are 100 random x values:

```{r}
set.seed(11111)
explanatory <- runif(100)
fake_data <- tibble(explanatory)
fake_data
```

Now we choose a response variable. We'll pick colors at random, but constrained by the probabilities for each x value. (The Bernoulli distribution is effectively a coin-flipping mechanism, but instead of heads and tails, we are flipping blues and reds according to the probabilities given by the `prob` function.)

```{r}
set.seed(22222)
fake_data <- fake_data %>%
    rowwise() %>%
    mutate(response =
               ifelse(rbernoulli(1, p = prob(explanatory)),
                      "Blue", "Red"))
fake_data
```

Let's plot the sample below our previous graph:

```{r}
prob_plot_response <- prob_plot_color +
    geom_jitter(data = fake_data,
                aes(x = explanatory, y = -0.2,
                    color = response),
                position = position_jitter(width = 0,
                                           height = 0.2,
                                           seed = 1)) +
    scale_colour_manual(values = c("Blue", "Red")) +
    guides(color = FALSE)
prob_plot_response
```

There's a lot going on in this picture because we are showing the true probability function and we've vertically jittered the x values so they are all visible. In the real world, this is all we see:

```{r}
x_plot <- ggplot(fake_data, aes(x = explanatory, y = 0)) +
    geom_segment(x = 0, xend = 1, y = 0, yend = 0,
                 color = "black") +
    scale_colour_manual(values = c("Blue", "Red")) +
    guides(color = FALSE) +
    theme(axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),
          axis.title.y = element_blank(),
          panel.grid.major.y = element_blank(),
          panel.grid.minor.y = element_blank())
x_plot_response <- x_plot + 
    geom_point(aes(color = response))
x_plot_response
```


## The Bayes classifier

The Bayes classifier operates using a very simple rule: classify each point based on the the highest probability given the value of any explanatory variables. In symbols, we calculate the probability of the response variable taking on all possible classes given the explanatory variable:

$$Pr(Response = \text{Blue} \mid Explanatory = x)$$
$$Pr(Response = \text{Red} \mid Explanatory = x)$$

In general, there may be more than two categories. Either way, just choose the category for which the probability listed above is highest.

For example, at x = 0.1, the probability of blue is 29.2%, and therefore it follows that the probability of red is 70.8%. Therefore, a point with x = 0.1 should be classified as red. Keep in mind that any actual data point located at x = 0.1 is not forced to be red. There's still some probability of it being blue, but the Bayes classifier will predict that it is red.

This sounds easy, but keep in mind, the Bayes classifier is a theoretical construct only; in practice, we never know the true probability distribution that describes the pattern of responses.

However, our data is fake data that we simulated, so we have the true probability distributions. Therefore, we can classify points according to the Bayes classifier. Let's create some new columns in our data frame, one for the predicted value according to the Bayes classifier, and another to indicate if the classifier made a correct prediction (comparing it with the actual response value).

```{r}
fake_data <- fake_data %>%
    mutate(prediction = ifelse(prob(explanatory) >= 0.5,
                               "Blue", "Red"),
           correct = ifelse(response == prediction,
                            TRUE, FALSE))
fake_data
```

Here are the points as classified by the Bayes classifier:

```{r}
prob_plot_pred <- prob_plot_color +
    geom_jitter(data = fake_data,
                aes(x = explanatory, y = -0.2,
                    color = prediction),
                position = position_jitter(width = 0,
                                           height = 0.2,
                                           seed = 1)) +
    scale_colour_manual(values = c("Blue", "Red")) +
    guides(color = FALSE)
prob_plot_pred
```

This shows only our training data, but keep in mind that any point along the x-axis from 0 to 1 can be classified into blue or red using the Bayes classifier. The dotted vertical lines mark the *Bayes decision boundary*. This boundary separates the regions where the classifier makes a color prediction. Across these boundary points, the classifier changes its prediction from one color to another.

**Again, be careful not to confuse this one-dimensional example from the book's two-dimensional example.** In one-dimensional terms, this is what the Bayes decision boundary looks like:

```{r}
x_plot_pred <- x_plot +
    geom_point(aes(color = prediction)) +
    geom_segment(x = 0.25, xend = 0.25,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1) +
    geom_segment(x = 0.5, xend = 0.5,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1) +
    geom_segment(x = 0.75, xend = 0.75,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1)
x_plot_pred
```

Here are the Bayes decision boundaries again, but this time with the actual response values:

```{r}
x_plot_response +
    geom_segment(x = 0.25, xend = 0.25,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1) +
    geom_segment(x = 0.5, xend = 0.5,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1) +
    geom_segment(x = 0.75, xend = 0.75,
                 y = -0.005, yend = 0.005,
                 color = "black", size = 1)
```

If we compare the actual responses to the Bayes decision boundary, we note that quite a few points are misclassified by the Bayes classifier.

Returning to the plot with the probability function, the squares indicate correct predictions and the crosses indicate incorrect predictions:

```{r}
prob_plot_all <- prob_plot_response +
    geom_jitter(data = fake_data,
                aes(x = explanatory, y = -0.2,
                    shape = correct),
                size = 4,
                position = position_jitter(width = 0,
                                           height = 0.2,
                                           seed = 1)) +
    scale_shape_manual(values=c(4, 0)) +
    guides(shape = FALSE)
prob_plot_all
```

The Bayes error rate is the percentage of incorrectly classified responses using the Bayes classifier.

```{r}
1 - mean(fake_data$correct)
```

So it's a relative large 41% here. The theory behind the Bayes classifier says that no classification strategy will give you better than a 41% test error rate. I mean, it's possible for any given data set and any given classifier to get lucky, but overall, we can't expect (on average) that any other classifier will do better than 59% on new test data generated from the same probability distribution.


## The Bayes classifier in higher dimensions

In our one-dimensional example above, the Bayes decision boundary was a set of three points on the x-axis. The four regions thus created are regions of "constant prediction", meaning that the Bayes classifier classifies everything within a region using a single color. The classifier only changes colors across a point of the Bayes decision boundary.

In a two-dimensional space of explanatory variables (like the book examples), the Bayes decision boundary does not consist of points anymore. It's a curve (or multiple curves) that separates the plane into regions of constant prediction.

In a three-dimensional space of explanatory variables, the Bayes decision boundary is a surface (or multiple surfaces) separating space into regions of constant prediction.

In higher dimensions, it's harder to visualize what's going on. For an n-dimensional space of explanatory variables, the Bayes decision boundary is an (n-1)-dimensional hypersurface!
