In this talk we focus on methods used to merge the components of a finite mixture model hierarchically. The merging process is going to be done sequentially. After each merging step, the merged components are considered as a unique component.

We focus only on merging the components of a finite mixture using the information contained only in the posterior probabilities. The problem of interest can be stated as:

Given a sample of probabilities to belong to each component \(I_j\),

\[ \left[ \begin{array}{ccccc} (\tau_{11}, & \dots & \tau_{1j}, & \dots & \tau_{1k}), \\ \vdots & & \vdots & & \vdots \\ (\tau_{i1}, & \dots & \tau_{ij}, & \dots & \tau_{ik}), \\ \vdots & & \vdots & & \vdots \\ (\tau_{n1}, & \dots & \tau_{nj}, & \dots & \tau_{nk}) \end{array} \right] \in \mathcal{S}^k \]

merge the components (sequentially) to obtain a hierarchy over the set of components.

To solve the stated problem, we propose:

Given the set of components \(\mathcal{I}_s = \{I_1, \dots, I_s\}\) and \(\mathcal{T}_{\mathcal{I}_s} = \{\tau_{1 \mathcal{I}_s}, \dots, \tau_{n \mathcal{I}_s} \}\), we propose to merge those two components \(I_a\) and \(I_b\), into one component \(I_a \cup I_b\) which maximise \[ S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b) = \frac{\sum_{i=1}^n {\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} {\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}}{\sum_{i=1}^n {\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)} }}, \]

where

- the function \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) measures how likely is to merge component \(I_a\) and \(I_b\) given the information contained in \(\tau_{i \mathcal{I}_s}\) and
- the function \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) measures how relevant is observation \(i\) to compute the overall score to merge \(I_a\) and \(I_b\) given the information contained in \(\tau_{i \mathcal{I}_s}\).

Our proposal is generic in the sense that we do not define the measure used to combine the components. The user can define the functions \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) and \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) according to their problem.

In next sections, we show different ideas appeared in different works into the presented framework.

Hennig (2010) proposed a method based on maximizing the Directly Estimated Missclassification Probabilities (DEMP approach). It is easy to show that the DEMP apprach is equivalent to the general score \(S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)\) using

- \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{a}}\) and \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1{\hskip -2.4 pt}\hbox{I} \left( \forall j\; \tau_{i I_{b}} \geq \tau_{iI_j} \right)\).

Baudry *et al.* (2010) proposed to merged the component of a finite mixture based on minimizing the total Entropy

\[ \sum_{i=1}^n \sum_{j=1}^s \tau_{i \mathcal{I}_j} log(\tau_{i \mathcal{I}_j}). \]

In this case, it can be seen that their approach is equivalent to the general score \(S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)\) using

- \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1\) and \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = (\tau_{i I_a}+\tau_{i I_b}) \log(\tau_{i I_a} + \tau_{i I_b}) - \left\{ \tau_{i I_a} \log(\tau_{i I_a}) + \tau_{i I_b} \log(\tau_{i I_b})\right\}\)

Longford & Bartosova (2014) proposed to merge those two components maximizing the probability of classifying an observation to component \(I_a\) given that the observation is generated by either \(I_a\) or \(I_b\). To aproximate such probability a simulated sample \(X^*\) of finite mixture with components \(\mathcal{I}_s = \{I_1, \dots, I_s\}\). Using the posterior probabilities \(\mathcal{T}_{\mathcal{I}_s}^*\) of this simulated sample their approach is equivalent to the score \(S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)\) using

- \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1\) and \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \frac{\tau_{i I_{b}}}{\tau_{i I_{a}} + \tau_{i I_{b}}}\)

Each previous methods have considerd different functions \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\);

- \(1{\hskip -2.4 pt}\hbox{I} \left( \forall j\; \tau_{i I_{b}} \geq \tau_{iI_j} \right)\) for the DEMP approach
- \((\tau_{i I_a}+\tau_{i I_b}) \log(\tau_{i I_a} + \tau_{i I_b}) - \left\{ \tau_{i I_a} \log(\tau_{i I_a}) + \tau_{i I_b} \log(\tau_{i I_b})\right\}\) for the entropy approach, and
- \(\frac{\tau_{i I_{b}}}{\tau_{i I_{a}} + \tau_{i I_{b}}}\) for the version of Longford & Bartosova

we use the labels `demp`

, `entr`

and `demp.mod`

to refenciate each function respectively. Moreover, for the previous three methods we have seen two different possible definitions of \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\)

- \(\tau_{i I_{a}}\) for the DEMP approach and
- \(1\) for the approach presented by Baudry
*et al.*and for the one by Longford & Bartosova.

we refer to them as `prop`

and `cnst`

. Doing so, we are be able to refer to each method using the label for function \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) followed by the label refering to function \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\). For example,

- we refer to the DEMP approach as
`prop-demp`

, - we refer to Baudry approach as
`cnst-entr`

and - we refer to Longford and Bartosova as
`cnst-demp.mod`

.

Using this labeling, we can define three new approaches corresponding to the labelings `prop-entr`

, `prop-demp.mod`

and `cnst-demp`

.

Lambda.Omega | cnst | prop |
---|---|---|

demp |
cnst-demp | Hennig (2010) |

demp.mod |
Longford & Bartosova (2014) | prop-demp.mod |

entr |
Baudry et al. (2010) | prop-entr |

In previous approaches we have used either \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1\) (`cnst`

) or \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{b}}\) (`prop`

). In the first each observation is ponderated equally, in the second approach each observation is ponderated according to the posterior probability to belong to \(I_a\). A more extrem weighting can be introduced considering only those observations having the posterior of being classified to \(I_a\) maximum, that is \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1{\hskip -2.4 pt}\hbox{I} \left( \forall j\; \tau_{i I_{a}} \geq \tau_{iI_j} \right)\). We refer to this weighing by `dich`

. This new weighting allow us to introduce three new approaches.

Lambda.Omega | cnst | prop | dich |
---|---|---|---|

demp |
cnst-demp | Hennig (2010) | dich-demp |

demp.mod |
Longford & Bartosova (2014) | prop-demp.mod | dich-demp.mod |

entr |
Baudry et al. (2010) | prop-entr | dich-entr |

A very simple function measuring how likely is to classify observation into \(I_b\) when the observations are from \(I_a\), is to consider simply the posterior probability of classifying to \(I_b\), that is \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{b}}\). The measure is similar to the one introduced by Longford & Bartosova, but in this case we are not assuming that the observation has been generated either by component \(I_a\) or \(I_b\).

We refer to this approach as `prop`

(not to be confused with the `prop`

labeling used for the function \({\color{blue}{\omega}}\)). With this new function we have three new approaches to merge components.

Lambda.Omega | cnst | prop | dich |
---|---|---|---|

demp |
cnst-demp | Hennig (2010) | dich-demp |

demp.mod |
Longford & Bartosova (2014) | prop-demp.mod | dich-demp.mod |

entr |
Baudry et al. (2010) | prop-entr | dich-entr |

prop |
cnst-prop | prop-prop | dich-prop |

The log-ratio between the posterior probability of classifying an observation to \(I_b\) and the posterior probability of classifying an observation to \(I_a\) is a measure of how likely is to classify an observation to \(I_b\) when the observation was in fact generated by component \(I_a\). This measure can be used to measure how likely is to merge components \(I_a\) and \(I_b\).

- Consider \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = log(\tau_{i I_{b}} / \tau_{i I_{a}})\)

This approach allow us to introduce three new methods for merging components

Lambda.Omega | cnst | prop | dich |
---|---|---|---|

demp |
cnst-demp | Hennig (2010) | dich-demp |

demp.mod |
Longford & Bartosova (2014) | prop-demp.mod | dich-demp.mod |

entr |
Baudry et al. (2010) | prop-entr | dich-entr |

prop |
cnst-prop | prop-prop | dich-prop |

log |
cnst-log | prop-log | dich-log |

Finally, using the norm defined on the Simplex (Aitchison) we can have a measure of entropy similar to the one introduced by Baudry et al. (2010). In this case, the measure has some desirable properties like scale-invariance and compositional-coherence.

- Consider \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = - log(\tau_{i I_{b}} / \tau_{i I_{a}})^2\)

We have three new methods

Lambda.Omega | cnst | prop | dich |
---|---|---|---|

demp |
cnst-demp | Hennig (2010) | dich-demp |

demp.mod |
Longford & Bartosova (2014) | prop-demp.mod | dich-demp.mod |

entr |
Baudry et al. (2010) | prop-entr | dich-entr |

prop |
cnst-prop | prop-prop | dich-prop |

log |
cnst-log | prop-log | dich-log |

norm |
cnst-norm | prop-norm | dich-norm |

In this part we would like to see if the previous approaches are usefull to approximate components. We would perform a very simple example, we are going to follow the following steps.

- We generate a sample from a finite mixture a certain number of components. For example, in the next plot we show three different mixtures with \(K_0=3\) components generated using the package
`MixSim`

using different levels of overlapping \(\hat{\omega} \in (0.1, 0.3, 0.5)\) (see Melnykov*et al.*(2012) for details). Note that for each observation we know the component from where it was generated. In other words, we have an initial classification \(L_0\) which allow us to compare other clusters with the one given by the components from the generated finite mixture.