# Introduction to the problem of merging components

In this talk we focus on methods used to merge the components of a finite mixture model hierarchically. The merging process is going to be done sequentially. After each merging step, the merged components are considered as a unique component.

# Merging components based on the posterior probabilities

We focus only on merging the components of a finite mixture using the information contained only in the posterior probabilities. The problem of interest can be stated as:

Given a sample of probabilities to belong to each component $$I_j$$,

$\left[ \begin{array}{ccccc} (\tau_{11}, & \dots & \tau_{1j}, & \dots & \tau_{1k}), \\ \vdots & & \vdots & & \vdots \\ (\tau_{i1}, & \dots & \tau_{ij}, & \dots & \tau_{ik}), \\ \vdots & & \vdots & & \vdots \\ (\tau_{n1}, & \dots & \tau_{nj}, & \dots & \tau_{nk}) \end{array} \right] \in \mathcal{S}^k$

merge the components (sequentially) to obtain a hierarchy over the set of components.

# Our proposal: a generic merging score

To solve the stated problem, we propose:

Given the set of components $$\mathcal{I}_s = \{I_1, \dots, I_s\}$$ and $$\mathcal{T}_{\mathcal{I}_s} = \{\tau_{1 \mathcal{I}_s}, \dots, \tau_{n \mathcal{I}_s} \}$$, we propose to merge those two components $$I_a$$ and $$I_b$$, into one component $$I_a \cup I_b$$ which maximise $S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b) = \frac{\sum_{i=1}^n {\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} {\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}}{\sum_{i=1}^n {\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)} }},$

where

• the function $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}$$ measures how likely is to merge component $$I_a$$ and $$I_b$$ given the information contained in $$\tau_{i \mathcal{I}_s}$$ and
• the function $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}$$ measures how relevant is observation $$i$$ to compute the overall score to merge $$I_a$$ and $$I_b$$ given the information contained in $$\tau_{i \mathcal{I}_s}$$.

Our proposal is generic in the sense that we do not define the measure used to combine the components. The user can define the functions $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}$$ and $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}$$ according to their problem.

In next sections, we show different ideas appeared in different works into the presented framework.

## The DEMP approach [Hennig (2010)]

Hennig (2010) proposed a method based on maximizing the Directly Estimated Missclassification Probabilities (DEMP approach). It is easy to show that the DEMP apprach is equivalent to the general score $$S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)$$ using

• $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{a}}$$ and $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1{\hskip -2.4 pt}\hbox{I} \left( \forall j\; \tau_{i I_{b}} \geq \tau_{iI_j} \right)$$.

## The entropy approach [Baudry et el. (2010)]

Baudry et al. (2010) proposed to merged the component of a finite mixture based on minimizing the total Entropy

$\sum_{i=1}^n \sum_{j=1}^s \tau_{i \mathcal{I}_j} log(\tau_{i \mathcal{I}_j}).$

In this case, it can be seen that their approach is equivalent to the general score $$S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)$$ using

• $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1$$ and $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = (\tau_{i I_a}+\tau_{i I_b}) \log(\tau_{i I_a} + \tau_{i I_b}) - \left\{ \tau_{i I_a} \log(\tau_{i I_a}) + \tau_{i I_b} \log(\tau_{i I_b})\right\}$$

## Longford & Bartosova [Longford & Bartosova (2014)]

Longford & Bartosova (2014) proposed to merge those two components maximizing the probability of classifying an observation to component $$I_a$$ given that the observation is generated by either $$I_a$$ or $$I_b$$. To aproximate such probability a simulated sample $$X^*$$ of finite mixture with components $$\mathcal{I}_s = \{I_1, \dots, I_s\}$$. Using the posterior probabilities $$\mathcal{T}_{\mathcal{I}_s}^*$$ of this simulated sample their approach is equivalent to the score $$S_{\omega, \lambda}( \mathcal{T}_{\mathcal{I}_s}, I_a, I_b)$$ using

• $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1$$ and $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \frac{\tau_{i I_{b}}}{\tau_{i I_{a}} + \tau_{i I_{b}}}$$

# New aproaches using the generic merging score

Each previous methods have considerd different functions $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}$$;

• $$1{\hskip -2.4 pt}\hbox{I} \left( \forall j\; \tau_{i I_{b}} \geq \tau_{iI_j} \right)$$ for the DEMP approach
• $$(\tau_{i I_a}+\tau_{i I_b}) \log(\tau_{i I_a} + \tau_{i I_b}) - \left\{ \tau_{i I_a} \log(\tau_{i I_a}) + \tau_{i I_b} \log(\tau_{i I_b})\right\}$$ for the entropy approach, and
• $$\frac{\tau_{i I_{b}}}{\tau_{i I_{a}} + \tau_{i I_{b}}}$$ for the version of Longford & Bartosova

we use the labels demp, entr and demp.mod to refenciate each function respectively. Moreover, for the previous three methods we have seen two different possible definitions of $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}$$

• $$\tau_{i I_{a}}$$ for the DEMP approach and
• $$1$$ for the approach presented by Baudry et al. and for the one by Longford & Bartosova.

we refer to them as prop and cnst. Doing so, we are be able to refer to each method using the label for function $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}}$$ followed by the label refering to function $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}}$$. For example,

• we refer to the DEMP approach as prop-demp,
• we refer to Baudry approach as cnst-entr and
• we refer to Longford and Bartosova as cnst-demp.mod.

Using this labeling, we can define three new approaches corresponding to the labelings prop-entr, prop-demp.mod and cnst-demp.

Lambda.Omega cnst prop
demp cnst-demp Hennig (2010)
demp.mod Longford & Bartosova (2014) prop-demp.mod
entr Baudry et al. (2010) prop-entr

## Using the indicator variable as a weighting function

In previous approaches we have used either $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1$$ (cnst) or $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{b}}$$ (prop). In the first each observation is ponderated equally, in the second approach each observation is ponderated according to the posterior probability to belong to $$I_a$$. A more extrem weighting can be introduced considering only those observations having the posterior of being classified to $$I_a$$ maximum, that is $${\color{blue}{\omega(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = 1{\hskip -2.4 pt}\hbox{I} \left( \forall j\; \tau_{i I_{a}} \geq \tau_{iI_j} \right)$$. We refer to this weighing by dich. This new weighting allow us to introduce three new approaches.

Lambda.Omega cnst prop dich
demp cnst-demp Hennig (2010) dich-demp
demp.mod Longford & Bartosova (2014) prop-demp.mod dich-demp.mod
entr Baudry et al. (2010) prop-entr dich-entr

## Using the posterior probability of being classified to $$I_b$$

A very simple function measuring how likely is to classify observation into $$I_b$$ when the observations are from $$I_a$$, is to consider simply the posterior probability of classifying to $$I_b$$, that is $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = \tau_{i I_{b}}$$. The measure is similar to the one introduced by Longford & Bartosova, but in this case we are not assuming that the observation has been generated either by component $$I_a$$ or $$I_b$$.

We refer to this approach as prop (not to be confused with the prop labeling used for the function $${\color{blue}{\omega}}$$). With this new function we have three new approaches to merge components.

Lambda.Omega cnst prop dich
demp cnst-demp Hennig (2010) dich-demp
demp.mod Longford & Bartosova (2014) prop-demp.mod dich-demp.mod
entr Baudry et al. (2010) prop-entr dich-entr
prop cnst-prop prop-prop dich-prop

## The log-ratio criterias

The log-ratio between the posterior probability of classifying an observation to $$I_b$$ and the posterior probability of classifying an observation to $$I_a$$ is a measure of how likely is to classify an observation to $$I_b$$ when the observation was in fact generated by component $$I_a$$. This measure can be used to measure how likely is to merge components $$I_a$$ and $$I_b$$.

• Consider $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = log(\tau_{i I_{b}} / \tau_{i I_{a}})$$

This approach allow us to introduce three new methods for merging components

Lambda.Omega cnst prop dich
demp cnst-demp Hennig (2010) dich-demp
demp.mod Longford & Bartosova (2014) prop-demp.mod dich-demp.mod
entr Baudry et al. (2010) prop-entr dich-entr
prop cnst-prop prop-prop dich-prop
log cnst-log prop-log dich-log

## The Aitchison distance criteria

Finally, using the norm defined on the Simplex (Aitchison) we can have a measure of entropy similar to the one introduced by Baudry et al. (2010). In this case, the measure has some desirable properties like scale-invariance and compositional-coherence.

• Consider $${\color{red}{\lambda(\tau_{i \mathcal{I}_s}, I_a, I_b)}} = - log(\tau_{i I_{b}} / \tau_{i I_{a}})^2$$

We have three new methods

Lambda.Omega cnst prop dich
demp cnst-demp Hennig (2010) dich-demp
demp.mod Longford & Bartosova (2014) prop-demp.mod dich-demp.mod
entr Baudry et al. (2010) prop-entr dich-entr
prop cnst-prop prop-prop dich-prop
log cnst-log prop-log dich-log
norm cnst-norm prop-norm dich-norm

# Experiment: comparing the results for each aproach

In this part we would like to see if the previous approaches are usefull to approximate components. We would perform a very simple example, we are going to follow the following steps.

1. We generate a sample from a finite mixture a certain number of components. For example, in the next plot we show three different mixtures with $$K_0=3$$ components generated using the package MixSim using different levels of overlapping $$\hat{\omega} \in (0.1, 0.3, 0.5)$$ (see Melnykov et al. (2012) for details). Note that for each observation we know the component from where it was generated. In other words, we have an initial classification $$L_0$$ which allow us to compare other clusters with the one given by the components from the generated finite mixture.