# Introducing the purpose of merging components

Let’s assume we have the following sample $$X = \{x_1,\dots,x_{100} \}$$:

Just looking to the plot, how many groups do you see? I think, it is natural to see two group; one formed with positive values and another formed with the negative ones.

## Common approach in parametric clustering

The most common approach in model based clustering starts by fitting a finite mixture of distribution, for example Gaussian distributions. One way to select the number of components is using the BIC criteria. Using this criteria, we get that the best way to represent our data is using $$3$$ components

$f(x) = \color{black}{\pi_1 f_1(x;\theta_1)} + \color{black}{\pi_2 f_2(x;\theta_2)} + \color{black}{\pi_3 f_3(x;\theta_3)}.$

After calculating the parameters $$\pi_1, \pi_2, \pi_3$$ and $$\theta_1, \theta_2, \theta_3$$, we get the following mixture

We can think a finite mixture as $$3$$ different distributions generating sample independently with $$\pi_1$$, $$\pi_2$$, $$\pi_3$$ being the proportions of sample expected to be generated by component $$f_1$$, $$f_2$$, $$f_3$$ respectively.

$f(x) = \color{red}{\pi_1 f_1(x;\theta_1)} + \color{green}{\pi_2 f_2(x;\theta_2)} + \color{blue}{\pi_3 f_3(x;\theta_3)}$

To classifiy, the observations are assigned to the Gaussian distribution, $$f_j(x;\theta_j)$$, they most likely belong to

We have considered three components and we have assigned each observation $$x_i$$ to the component $$j$$ such that posterior probability

$\tau_{ij} = \frac{\pi_j f_j(x_i;\theta_j)}{\sum_{\ell=1}^3 \pi_\ell f_\ell(x_i;\theta_\ell)}$

is maximum, in other words, to the highest component

$\left(\color{red}{\frac{\pi_1 f_1(x_i;\theta_1)}{\sum_{\ell=1}^3 \pi_\ell f_\ell(x_i;\theta_\ell)}}, \color{green}{\frac{\pi_2 f_2(x_i;\theta_2)}{\sum_{\ell=1}^3 \pi_\ell f_\ell(x_i;\theta_\ell)}}, \color{blue}{\frac{\pi_3 f_3(x_i;\theta_3)}{\sum_{\ell=1}^3 \pi_\ell f_\ell(x_i;\theta_\ell)}} \right)$

or simplifying, to the highest component

$\left(\color{red}{\pi_1 f_1(x_i;\theta_1)}, \color{green}{\pi_2 f_2(x_i;\theta_2)}, \color{blue}{\pi_3 f_3(x_i;\theta_3)} \right)$

As we can see, the number of clusters differs from the number we guess at the beginning.

A different approach consists in considering a mixture of Gaussian mixtures, and classify according to the corresponding components. For example, to decompose mixture $$f$$ with the following two components

$f(x) = \color{red}{\pi_1 f_1(x;\theta_1)} + \color{green}{ (\pi_2+\pi_3)\left\{ \frac{\pi_2}{\pi_2+\pi_3} f_2(x;\theta_2) + \frac{\pi_3}{\pi_2+\pi_3} f_3(x;\theta_3) \right\}}$

and classify observations to one of the two components most likely belong to

$\left(\color{red}{\pi_1 f_1(x_i;\theta_1)}, \color{green}{\pi_2 f_2(x_i;\theta_2)}+\color{green}{\pi_3 f_3(x_i;\theta_3)} \right)$

Following this approach we are modelling our data with the same mixture but considering different components as a single cluster.

## Hierarchical mixture sequence

This lead us to the following strategy, starting from a finite mixture with a certain number of components, we repeatedly merge components “more likely” to be considered as a single cluster

# What strategies are used to merge components?

There are different approaches to merge the components of a finite mixture

• The ridgeline unimodal method [Ray and Lindsay (2005)]
• The ridgeline ratio method [Hennig (2010)]
• The dip test method [Tantrum et al. (2003)]
• The Bhattacharyya distance method [Hennig (2010)]
• Directly estimated misclassification probabilities (DEMP) method [Hennig (2010)]
• The prediction strength method [Tibshirani and Walther (2005)]
• Entropy minimization method [Baudry et el. (2010)]

From all of them, here we are interested on those methods depending only on the posterior probabilities

$\tau_{ij} = \frac{\pi_j f_j(x_i;\theta_j)}{\sum_{\ell=1}^3 \pi_\ell f_\ell(x_i;\theta_\ell)}.$

That is methods which only use the posterior probability vectors

$\tau_{i} = \left( \tau_{i1}, \tau_{i2}, \dots, \tau_{ik} \right)$

From the previous list only the DEMP method and the Entropy minimization method are based on the posterior probability vectors. Next we revise this two methods and discuss the notion of confusion between component they are considering.

## The entropy criteria

Baudry et al. proposed to combine those two components for which after merging the components the total Shannon Entropy is minimized. The Shannon entropy is

$- \sum_{i=1}^n \sum_{j=1}^k \tau_{ij} log(\tau_{ij}).$

As suggested by Baudry et al., this is equivalent to find those two components $$I_a$$ and $$I_b$$ maximizing

$\sum_{i=1}^n (\tau_{i I_a}+\tau_{i I_b}) \log(\tau_{i I_a} + \tau_{i I_b}) - \sum_{i=1}^n \left\{ \tau_{i I_a} \log(\tau_{i I_a}) + \tau_{i I_b} \log(\tau_{i I_b})\right\}.$

The entropy criteria depends on the other components. In fact, it gets smaller when the sum of the other components increase.