Introduction to the problem of merging components

In this talk we focus on methods used to merge the components of a finite mixture model hierarchically. The merging process is going to be done sequentially. After each merging step, the merged components are considered as a unique component.

Merging components based on the posterior probabilities

We focus only on merging the components of a finite mixture using the information contained only in the posterior probabilities. The problem of interest can be stated as:

Given a sample of probabilities to belong to each component \(I_j\),

\[ \left[ \begin{array}{ccccc} (\tau_{11}, & \dots & \tau_{1j}, & \dots & \tau_{1k}), \\ \vdots & & \vdots & & \vdots \\ (\tau_{i1}, & \dots & \tau_{ij}, & \dots & \tau_{ik}), \\ \vdots & & \vdots & & \vdots \\ (\tau_{n1}, & \dots & \tau_{nj}, & \dots & \tau_{nk}) \end{array} \right] \in \mathcal{S}^k \]

merge the components (sequentially) to obtain a hierarchy over the set of components.

Our proposal: a generic merging score

To solve the stated problem, we propose:

Given the set of components \(\mathcal{I}_s = \{I_1, \dots, I_s\}\) and \(\mathcal{T}_{\mathcal{I}_s} = \{\tau_{1 \mathcal{I}_s}, \dots, \tau_{n \mathcal{I}_s} \}\), we propose to merge those two components \(I_a\) and \(I_b\), into one component \(I_a \cup I_b\) which maximise \[ S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b) = \frac{\sum_{i=1}^n {\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} {\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}}{\sum_{i=1}^n {\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)} }}, \]

where

  • the function \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) measures how likely is to merge component \(I_a\) and \(I_b\) given the information contained in \(\tau_{i \mathcal{I}_s}\) and
  • the function \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) measures how relevant is observation \(i\) to compute the overall score to merge \(I_a\) and \(I_b\) given the information contained in \(\tau_{i \mathcal{I}_s}\).

Our proposal is generic in the sense that we do not define the measure used to combine the components. The user can define the functions \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) and \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) according to their problem.

In next sections, we show different ideas appeared in different works into the presented framework.

The DEMP approach [Hennig (2010)]

Hennig (2010) proposed a method based on maximizing the Directly Estimated Missclassification Probabilities (DEMP approach). It is easy to show that the DEMP apprach is equivalent to the general score \(S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)\) using

  • \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{a}}\) and \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1{\hskip -2.4 pt}\hbox{I} \left( \forall j\; \tau_{i I_{b}} \geq \tau_{iI_j} \right)\).

The entropy approach [Baudry et el. (2010)]

Baudry et al. (2010) proposed to merged the component of a finite mixture based on minimizing the total Entropy

\[ \sum_{i=1}^n \sum_{j=1}^s \tau_{i \mathcal{I}_j} log(\tau_{i \mathcal{I}_j}). \]

In this case, it can be seen that their approach is equivalent to the general score \(S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)\) using

  • \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1\) and \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = (\tau_{i I_a}+\tau_{i I_b}) \log(\tau_{i I_a} + \tau_{i I_b}) - \left\{ \tau_{i I_a} \log(\tau_{i I_a}) + \tau_{i I_b} \log(\tau_{i I_b})\right\}\)

Longford & Bartosova [Longford & Bartosova (2014)]

Longford & Bartosova (2014) proposed to merge those two components maximizing the probability of classifying an observation to component \(I_a\) given that the observation is generated by either \(I_a\) or \(I_b\). To aproximate such probability a simulated sample \(X^*\) of finite mixture with components \(\mathcal{I}_s = \{I_1, \dots, I_s\}\). Using the posterior probabilities \(\mathcal{T}_{\mathcal{I}_s}^*\) of this simulated sample their approach is equivalent to the score \(S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)\) using

  • \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1\) and \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \frac{\tau_{i I_{b}}}{\tau_{i I_{a}} + \tau_{i I_{b}}}\)

New aproaches using the generic merging score

Each previous methods have considerd different functions \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\);

we use the labels demp, entr and demp.mod to refenciate each function respectively. Moreover, for the previous three methods we have seen two different possible definitions of \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\)

we refer to them as prop and cnst. Doing so, we are be able to refer to each method using the label for function \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\) followed by the label refering to function \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}\). For example,

Using this labeling, we can define three new approaches corresponding to the labelings prop-entr, prop-demp.mod and cnst-demp.

Lambda.Omega cnst prop
demp cnst-demp Hennig (2010)
demp.mod Longford & Bartosova (2014) prop-demp.mod
entr Baudry et al. (2010) prop-entr

Using the indicator variable as a weighting function

In previous approaches we have used either \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1\) (cnst) or \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{b}}\) (prop). In the first each observation is ponderated equally, in the second approach each observation is ponderated according to the posterior probability to belong to \(I_a\). A more extrem weighting can be introduced considering only those observations having the posterior of being classified to \(I_a\) maximum, that is \({\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1{\hskip -2.4 pt}\hbox{I} \left( \forall j\; \tau_{i I_{a}} \geq \tau_{iI_j} \right)\). We refer to this weighing by dich. This new weighting allow us to introduce three new approaches.

Lambda.Omega cnst prop dich
demp cnst-demp Hennig (2010) dich-demp
demp.mod Longford & Bartosova (2014) prop-demp.mod dich-demp.mod
entr Baudry et al. (2010) prop-entr dich-entr

Using the posterior probability of being classified to \(I_b\)

A very simple function measuring how likely is to classify observation into \(I_b\) when the observations are from \(I_a\), is to consider simply the posterior probability of classifying to \(I_b\), that is \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{b}}\). The measure is similar to the one introduced by Longford & Bartosova, but in this case we are not assuming that the observation has been generated either by component \(I_a\) or \(I_b\).

We refer to this approach as prop (not to be confused with the prop labeling used for the function \({\color{blue}{\omega}}\)). With this new function we have three new approaches to merge components.

Lambda.Omega cnst prop dich
demp cnst-demp Hennig (2010) dich-demp
demp.mod Longford & Bartosova (2014) prop-demp.mod dich-demp.mod
entr Baudry et al. (2010) prop-entr dich-entr
prop cnst-prop prop-prop dich-prop

The log-ratio criterias

The log-ratio between the posterior probability of classifying an observation to \(I_b\) and the posterior probability of classifying an observation to \(I_a\) is a measure of how likely is to classify an observation to \(I_b\) when the observation was in fact generated by component \(I_a\). This measure can be used to measure how likely is to merge components \(I_a\) and \(I_b\).

  • Consider \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = log(\tau_{i I_{b}} / \tau_{i I_{a}})\)

This approach allow us to introduce three new methods for merging components

Lambda.Omega cnst prop dich
demp cnst-demp Hennig (2010) dich-demp
demp.mod Longford & Bartosova (2014) prop-demp.mod dich-demp.mod
entr Baudry et al. (2010) prop-entr dich-entr
prop cnst-prop prop-prop dich-prop
log cnst-log prop-log dich-log

The Aitchison distance criteria

Finally, using the norm defined on the Simplex (Aitchison) we can have a measure of entropy similar to the one introduced by Baudry et al. (2010). In this case, the measure has some desirable properties like scale-invariance and compositional-coherence.

  • Consider \({\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = - log(\tau_{i I_{b}} / \tau_{i I_{a}})^2\)

We have three new methods

Lambda.Omega cnst prop dich
demp cnst-demp Hennig (2010) dich-demp
demp.mod Longford & Bartosova (2014) prop-demp.mod dich-demp.mod
entr Baudry et al. (2010) prop-entr dich-entr
prop cnst-prop prop-prop dich-prop
log cnst-log prop-log dich-log
norm cnst-norm prop-norm dich-norm

Experiment: comparing the results for each aproach

In this part we would like to see if the previous approaches are usefull to approximate components. We would perform a very simple example, we are going to follow the following steps.

  1. We generate a sample from a finite mixture a certain number of components. For example, in the next plot we show three different mixtures with \(K_0=3\) components generated using the package MixSim using different levels of overlapping \(\hat{\omega} \in (0.1, 0.3, 0.5)\) (see Melnykov et al. (2012) for details). Note that for each observation we know the component from where it was generated. In other words, we have an initial classification \(L_0\) which allow us to compare other clusters with the one given by the components from the generated finite mixture.