# A non-parametric method to estimate the number of clusters

“An important and yet unsolved problem in unsupervised data clustering is how to determine the number of clusters. The proposed slope statistic is a non-parametric and data driven approach for estimating the number of clusters in a dataset. This technique uses the output of any clustering algorithm and identifies the maximum number of groups that breaks down the structure of the dataset. Intensive Monte Carlo simulation studies show that the slope statistic outperforms (for the considered examples) some popular methods that have been proposed in the literature. Applications in graph clustering, in iris and breast cancer datasets are shown.” (Fujita et al., 2014)

Fujita et al. (2014). A non-parametric method to estimate the number of clusters, Computational Statistics and Data Analysis, 73, 27-39.

Link to the paper

Advertisements

## 31 thoughts on “A non-parametric method to estimate the number of clusters”

1. Christian Hennig says:

Hi Alexandre! Another blog to look at… need to be somewhat econimical with my time, but…
The issue of estimating the numbers of clusters interests me a lot, and would benefit from some more philosophy, which is what you’re also interested in.
I think that this paper, as most others in the area, suffers from the fact that it implicitly assumes that there is such a thing as the unique “true clustering” in a dataset, without ever defining this formally. Would you attempt to define it, you’d realise that there are different possible legitimate definitions that have substantially different implications and will favour different methods. I don’t thing that it makes sense to make statements about the general quality of any method without saying more clearly to what “cluster concept” this applies.
For example, people may say that if you’re interested in Gaussian clusters with correlation and heterogeneous variances, you shouldn’t start with k-means in the first place instead of advertising anything as some magical all-purpose combination of clustering method and validation index.

Here is some stuff you may want to read:
C. Hennig and T. F. Liao How to find an appropriate clustering for mixed type variables with application to socioeconomic stratification. Journal of the Royal Statistical Science, Series C (Applied Statistics) 62, 309-369 (2013), with discussion.
C. Hennig How many bee species? A case study in determining the number of clusters. To appear in Proceedings of the GfKl-2012, available on http://www.homepages.ucl.ac.uk/~ucakche/pp.html.

2. Hi Christian,

Welcome to my simple blog!!

thanks for your comments. I totally agree with you. This paper did not discuss in depth the philosophical issues on the cluster definition. However, there is a brief discussion about this right on the second paragraph of the introduction of the paper:

“Although there are several proposals to determine the number of clusters, it is yet an un-
solved and difficult problem due to the absence of a clear definition of cluster and especially
because it is dependent on both the adopted clustering method and the characteristics of the
data distribution (shape and scale, for instance). One difficulty for the majority of the methods
is to correctly classify the data set when the data points inside the same cluster are correlated
or are not Gaussian, in high dimensional situations or when there is a dominant cluster [5, 6]”

The paper shows that, even for data with clearly separated clods of points, the existent methods have poor performance. The proposed method outperforms (for the considered examples) the existent methods when we have such “clusters” well formed.

In my view, the paper is not blindly or implicitly claiming the existence of true clusters. In many places, the word cluster is being quoted and there is some caveats about this.

I will read your papers carefully, thank you for letting me know.

Best regards,
Alexandre.

3. Christian Hennig says:

You’re right; it wouldn’t be justified to criticise your paper too much, given that the issue that I’m going on about is not treated well in pretty much any paper in the field, and I’m not a too big fan of the alternative indexes that “lose” in your study.
Still, in your simulation study you assume true clusters. For example, you don’t mention that robustniks could claim that very small groups of points are groups of outliers rather than proper clusters, or that if one looks for clusters of Gaussian shape it could be seen as legitimate and correct to fit data from an exponential distribution by more than one Gaussian, as the BIC does. So their performance being “poor” depends on what the problem definition is, which isn’t formally written down but rather you put up some setups and told the reader case by case what the true clusters are supposed to be. (I see that one could find your choices intuitive, but intuition isn’t precise and may go wrong.) I give another reference of mine below, regarding this problem.
Also I’d like to see in every such paper a situation in which the proposed method could be believed to work but doesn’t, because I think it’s important that people know when not to use the method (without such discussion it still smells as if the method is meant as a “universal” proposal). There is much freedom in designing such numerical studies in clustering and I’m quite sure that one can find setups in which your method doesn’t “work”, as you were very successful to find setups where others fail.

Again, don’t get me wrong. It’s a nice paper and what I criticise about it holds for 99% of the “competitors”, too.

C. Hennig Methods for merging Gaussian mixture components . Advances in Data Analysis and Classification (2010) 4, 3-34.

4. Christian,

Do you recommend other approaches to compare those estimating methods than by simulations? Analytical comparisons are quite difficult here. Of course we can also give up of trying to find clusters, since there is no precise definition about it and certainly they do not exist as a state of reality, they are just human interpretations as any other concepts. In my own world view, I adopt a kind of Nietzschian Perspectivism, but, as a co-author, I cannot impose my own interpretation of the world. Moreover, making caveats in every other sentence might be annoying for those who are aware of them.

You said:

“For example, you don’t mention that robustniks could claim that very small groups of points are groups of outliers rather than proper clusters…”

you are right, but your concern is entire embedded in the following excerpt:

“Although there are several proposals to determine the number of clusters, it is yet an un-
solved and difficult problem due to the absence of a clear definition of cluster and especially
because it is dependent on both the adopted clustering method and the characteristics of the data distribution (shape and scale, for instance)”

If you cannot identify in this excerpt what you said quoted above, I just shall conclude that we fail to communicate it. We could discuss it a little bit more, but we didn’t and we found that sentence sufficient to warn the reader.

You said:

“There is much freedom in designing such numerical studies in clustering and I’m quite sure that one can find setups in which your method doesn’t “work”, as you were very successful to find setups where others fail.”

We use the scenarios:

1. Five clusters with equal number of data points and variances
2. Five clusters with different number of data points and different variances
3. Five clusters with different number of data points, different variances and with correlation
among data points
4. Three clusters composed of different distributions
5. Five clusters composed of different distributions
6. Four clusters in high dimensional data
7. One cluster

and we also consider “Overlapping data”. We did not find a setup where the method totally fails (maybe on very weird situations). Let us know which kind of scenario our method could fail. You can use other metrics and clustering algorithms not considered in the paper, but the comparison in the paper is restricted to the Euclidean distance and the k-means algorithm.

My main interest is concentrated in foundations of statistics (evidence measures, hypothesis testing and so forth). I am finishing a paper about it right now. If you are interested I can share with you.

Best,
Alexandre.

5. Christian Hennig says:

Dear Alexandre,
I have been on holidays so only see this now.
“Do you recommend other approaches to compare those estimating methods than by simulations? Analytical comparisons are quite difficult here. Of course we can also give up of trying to find clusters, since there is no precise definition about it and certainly they do not exist as a state of reality, they are just human interpretations as any other concepts. In my own world view, I adopt a kind of Nietzschian Perspectivism, but, as a co-author, I cannot impose my own interpretation of the world. Moreover, making caveats in every other sentence might be annoying for those who are aware of them.”
I realise that often in this problem theoretical comparison is too difficult and that one needs to make do with simulations and toy examples. My opinion is the following: It is not enough to say that “there is no precise definition of a cluster” and then to simulate just *something* that seems intuitive. I think that one should go on and say that “although it is not a general definition of the term cluster, in the present paper we try to find clusters defined/characterised by the following properties:(…)”, followed by a more or less formal characterisation. Then, the chosen simulations can be assessed against to what extent problems with the given characterisation are covered, whether the clustering method is appropriate for this kind of problem etc. That is, the problem supposed to be solved by the method should be made more precise admitting that this the is more restrictive than solving “the clustering problem” in general.
“We did not find a setup where the method totally fails (maybe on very weird situations). Let us know which kind of scenario our method could fail.”
Well, this depends on the precise problem definition, which you didn’t give. For example, a mixture of two distinct Gaussian distributions that is unimodal will be one cluster according to a cluster definition based on modality, and two clusters according to a cluster definition based on Gaussian (or even approximately Gaussian – as long as the two components are not too similar) components. So *any* method that gets it right according to one of these definitions gets it wrong according to the other one, proving by the way that there is absolutely no way to solve the clustering problem without further restriction.
Another potential problem is the robustness problem with the average silhouette width that I give in the following paper (a pdf is on my website), although I didn’t fully think this through regarding your method and may be wrong:
C. Hennig Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis 99 (2008), 1154-1176.
Yet another situation in which many distance-based approaches fail and yours may well fail, too, is clusters with vastly (!) different within-cluster variation.
“My main interest is concentrated in foundations of statistics (evidence measures, hypothesis testing and so forth). I am finishing a paper about it right now. If you are interested I can share with you.”
You can send me whatever you like and I will have a quick look at it. I may well find this interesting but usually I’m not sure before having seen it.

6. Hi Christian,

Thanks again! Can you please point out where we said that we solved “the clustering problem” in general?

You said:
“although it is not a general definition of the term cluster, in the present paper we try to find clusters defined/characterised by the following properties:(…)”

and

“Well, this depends on the precise problem definition, which you didn’t give.”

I already pointed where in the paper there is a caveat about the general definition of a cluster. In the paper there is also a sentence about the clusters considered in the simulations:

“We generated items from several distributions assuming that **items from the same distribution belong to the same cluster**. The seven scenarios are described as follows:…”

That is, we are using that definitions of clusters and as far as I know we did not claim the we have a general solution for any clustering problem, did we?

7. Christian Hennig says:

Alexandre: You didn’t say explicitly that you gave a general solution for the clustering problem, but you said in your previous posting that “We did not find a setup where the method totally fails”, which, depending on the meaning of the term “setup”, implies that you were not aware of *any* possible definition of clusters that could fault your approach. As long as you are not restricting the problem to be solved explicitly, people will read your paper as attempting a general solution, even if you don’t say this explicitly.

““We generated items from several distributions assuming that **items from the same distribution belong to the same cluster**. ”
This is not a useful definition because a “distribution” is nothing well defined in these circumstances. A mixture of k Gaussians is a *single* distribution, too, despite being a mixture of k distributions at the same time, and a single Gaussian is a mixture of two half-Gaussians, which are distributions as well. Identifying clusters with “coming from the same distribution” needs to be complemented by a definition of what kind of distribution qualifies as “cluster-generating”. (Note that if you say that a Gaussian mixture component qualifies as a cluster in general, you cannot have a t-distribution qualifying as a cluster, too, because a t-distribution is better approximated by a mixture of several Gaussians than by a single one. So the business of identifying clusters with mixture components is by far not as easy as most people think.)

8. Christian,

Thanks, I got your point and agree with it in some extend. It is clear for me that the paper does not state any universal or general solution for all clustering problems. From the very beginning it is claimed that:

“Intensive Monte Carlo simulation studies show that the slope statistic outperforms (for the considered examples) some popular methods that have been proposed in the literature.”

that is, the statement “for the considered examples” is an important caveat here. Those examples are explicitly stated in the Simulation section.

I do not want to go through an infinitely debate about these things, which in my view were discussed in the paper, but not in the profoundity they deserve.

9. Christian Hennig says:

Alexandre, my main issue is not attacking your paper. As I said before, it’s a nice paper and there is absolutely no more reason to criticise it than pretty much everything else that is around on this topic. I really don’t want to discuss what you said and what you didn’t in the paper, but rather promote a number of connected thoughts that I have, including the one that facing the impossible aim of defining the clustering problem in general, one should define a restricted problem more precisely (not just “by example”) and demonstrate to what extent this is solved.

10. Hi Christian, I appreciate very much your interventions here. I strongly believe that I understand pretty much where you want to reach.

But, really, I do not believe that it is possible to “define a restricted problem more precisely” in a tentative to reach even approximately some ultimate truth. This platonic world where there exist “real” objects that can be precisely stated in any (in)formal statements does not make sense to me. The platonic world is a fiction, it is itself a negation of the strong dynamism of life. In my opinion, if we are to accept life as it is, we must give up definitively this platonic ideas and start treating science artistically in a deep sense. We have just human interpretations about nature, we use a language and sensors that emerged by necessity of survival rather than necessity of seeing/touching/hearing any ultimate truth.

Any (in)formal definitions are just perspectives, those perspectives cannot be more or less precise in an objective sense regarding a ultimate truth. These perspectives can be strong or weak, it depends on the type of the society these perspectives are being proposed. The powerful of a given perspective is driven by its seduction power: the power of convincing other to adhere. Nowadays, in science, the power of a perspective is driven by some “objective metrics”, but maybe in future this will change as the natural course of life. Of course, this cannot be stated in any statistical paper, people will laugh of it, because they prefer to live in a nihilistic platonic world.

I am not against to defining precise and powerful tools to treat practical problems. However, the idea of defining a problem precisely has a strong platonic background. There are much more to say about it, but, sorry, I cannot fully describe this perspectivistic view in a blog by using a foreign language.

All the best,
Alexandre.

11. Christian Hennig says:

Dear Alexandre, as you may note, I’m not too unhappy to get more philosophical… 😉
I am rather notorious for my antirealist tendencies than for any traces of platonism, so I object against your identification of “defining a problem precisely” with platonism; I rather agree with your general philosophical tendency, but I don’t think that what I’m proposing here is opposed to it.

A problem definition is *within mathematics*, and it is man-made. It doesn’t come from any higher “real” platonic world.
If I define that I want to have a procedure that is good for estimating the number of clusters in Gaussian mixtures where clusters are identified with mixture components, this does *not* mean that I’m saying that real data really comes from Gaussian mixtures. It rather serves to make the discussion clearer, because then we can properly separately discuss two issues:
1) Is the proposed procedure indeed good at estimating the number of Gaussian mixture components?
2) Is (maybe depending on a given application) estimating the number of Gaussian mixture components the appropriate thing to do if we want to estimate the number of clusters?

Using mclust with the BIC is well known (from some theory and an excessive range of simulations by many people) to be fairly good at estimating the number of Gaussian mixture components. However, as I argued in some of the papers cited above, in many applications this is *not* exactly what is required. (There are some well known problems with mclust/BIC for Gaussian mixtures, too, if clusters are too small or near-collinear, although there is research about improvements.)
This however doesn’t kill of mclust/BIC as a method for clustering. It rather means that it is good for a specific range of problems, and all research controbutes to understanding what is within this range of problems and what is outside (and what to do then, of course). The interesting question here is to what extent the “Gaussian mixture components” cluster concept is useful in a given application, (more or less) regardless of whether the underlying data generating mechanism is “really” a Gaussian mixture.
The same I’d like to see for all good methods: An as precise as possible exploration if what the specific problem is at which the method is good, exploring its “implicit assumptions”, and what is outside its application range. Of course, this can in general not be expected from a single paper; people do research about the BIC for decades and neither you nor me could compete with this amount of work immediately when proposing a new method.
Because this is needed so that people can decide in a given application between the many different methods proposed in the literature most of which come with good-looking simulations initially.

But we should always keep in mind that precisely defined problems are just models and don’t match any platonic reality. It’s about how we think about a problem, not what its platonic “objective reality” is.

I use the opportunity to refer to another one of my papers, not yet published:
P. Coretto, C. Hennig “Finding approximately Gaussian clusters via robust improper maximum likelihood”, on arxiv http://arxiv.org/abs/1309.6895
In this paper we propose a new method (not for the number of clusters, though) and give an informal but still to some extent precise description of what kind of clusters we’re after, followed by a simulation study that is more explicit about the “cluster concept” that we explore there.

12. Christian,

Let’s get into the philosophical issues.

“A problem definition is *within mathematics*, and it is man-made. It doesn’t come from any higher “real” platonic world.”

I depends on the scholar beliefs, some thinkers consider that the observed world can be faithfully represented by mathematical objects. Then, following this reasoning, defining a problem more precisely would mean to understand more deeply the nature itself. However, (any sort of) language is itself a detachment of nature and part of the platonic world. If you believe that language has the potentiality of knowing the observed world, in my view, your are being platonic (I’m not saying you said that). Of course, we can get into the meaning of “knowing some thing” and get deeper into other concepts and on and on infinitely… we will invariably conclude that language is made of by ill-defined concepts: postulations of an equality for those that are essentially unequals.

For me, to be ill-defined is *NOT* a bad issue, it is an intrinsic feature of any language and we have to face it. All concepts are metaphors, since we cannot speak of “real objects”, we always speak of something around loaded with human values. Some of these metaphors were so repeated that we rooted them as concrete objects leading us to think that they really exist and make strong sense.

“If I define that I want to have a procedure that is good for estimating the number of clusters in Gaussian mixtures where clusters are identified with mixture components, this does *not* mean that I’m saying that real data really comes from Gaussian mixtures.”

Of course it does not mean that. The problem here is with the cluster definition. In my view, a cluster is a cloud of points that are near in some sense, this depends on the zoom you specify and the metric of being near. The definition of clusters can be the one that the employed clustering algorithm detects as a cluster (e.g., the clusters found by the k-means can be regarded as the definition of clusters). Straightforwardly simple! In my view, we don’t really need any meta-physical precise definition for it (i.e., a platonic definition), this can be adjusted by need. If someone wants to detect different types of clusters, then (s)he can propose a new algorithm with other metrics.

Cluster is a fuzzy word as any other word in a language. And as such, it cannot be precisely defined, since it is a metaphor loaded with human values. If a cluster is defined to be the thing that a clustering algorithm finds, then we have a perspectivistic definition of cluster driven by a mathematical artifact. We do not have to believe on this definition, it may serve just as a direction, not the right direction (since a human-independent right direction does not exist).

We can also invent a platonic world where clusters are unquestionably well defined: even an extreme point would be regarded as part of the cluster…

13. Christian Hennig says:

Alexandre,

I don’t disagree with anything you write here. I’m certainly not part of the “platonic” group characterised by you above. However, I think that we should distinguish between the general use of the term “cluster”, which is necessarily fuzzy, as you write, and a use of modeling and mathematics for the aim of understanding what a specific cluster analysis method does, and what it doesn’t do.

When I define, as above, for example, “clusters” (in the sense of “the specific type of clusters that a specific method tries to find, and that may be needed in a specific application”) as “Gaussian mixture components”, I’m not giving a definition that attempts to be a general definition of the term “cluster”. Instead I define a mathematical/statistical problem, which under some circumstances can be seen as “the problem of cluster analysis”, but in other circumstances, other kinds of clusters may be of interest, and then a different definition is needed.

We apparently agree that there are different possible definitions of “clusters”, according to which in a given dataset (or from a given underlying model) there may be different “true” clusters. But if we agree on this, why isn’t it obvious that stating the specific definition underlying a specific research project/method would be helpful to analyse and understand about its properties?

I am insisting on having the “cluster concept” defined not because I think that there is or should be a generally true one, but rather as a communication and analysis device to understand the methods (such as yours) better.

Consider the distribution 0.3*N(0,1)+0.3*N(0,10000)+0.4*N(6,1). This distribution has three Gaussian components but two modes. Do you understand your method well enough in order to tell me whether it will give two or three clusters here (or even more) without trying out (for a sufficiently large dataset)? Wouldn’t you agree that it would be nice to have a problem definition that actually specifies whether we want 2 or 3 clusters here, so that it can be tested whether this is achieved? (Or that you can give such a specification post hoc, after having tried out, in order to communicate then to potential users what your method will or will not do in such a situation?)

14. HI Christian,

“I am insisting on having the “cluster concept” defined not because I think that there is or should be a generally true one, but rather as a communication and analysis device to understand the methods (such as yours) better.”

If someone defines precisely a cluster and find the best algorithm for attaining it, that is fine for me.

Just a remark: our method does not find clusters and does not need a cluster definition. This definition is embedded into the chosen clustering algorithm, that is, the cluster is being defined by the clustering algorithm. We proposed a method to estimate the number of clusters considering fixed the clustering algorithm (k-means or any other), that is, the cluster definition was already given by this algorithm. All methods were compared by using the same cluster definition that was implicitly imposed by the k-means algorithm.

“Consider the distribution 0.3*N(0,1)+0.3*N(0,10000)+0.4*N(6,1). This distribution has three Gaussian components but two modes. Do you understand your method well enough in order to tell me whether it will give two or three clusters here (or even more) without trying out (for a sufficiently large dataset)? ”

Maybe the k-means algorithm cannot detect more than one cluster for these data, since they are overlapped by the large variability of the second one.

15. I will just repeat the same thing:

In my view, a cluster is a cloud of points that are near in some sense, it depends on the zoom you specify and the metric to say what is near.

A cluster can be defined by the clustering algorithm (e.g., the clusters found by the k-means can be regarded as the definition of clusters). Straightforwardly simple!

In my view, we don’t really need any meta-physical definition for it (i.e., a platonic definition), this can be adjusted by need. If someone wants to detect different types of clusters, then (s)he can propose a new algorithm with other metrics. That is fine…

In my view, the more precise is a cluster definition, the more platonic you are being. We can invent reasons such that even a very distant point from the cloud can be regarded as part of the cloud. This reason is just an interpretation, a perspective. We can always re-interpret in a number of ways the very same “fact” and propose a number of different alternatives. That is fine… but when someone imposes only one perspective as any kind of truth, then I think we have a problem. I am not saying that your are doing this.

Please see the remark of my previous post above.

16. Christian Hennig says:

Alexandre:
“A cluster can be defined by the clustering algorithm (e.g., the clusters found by the k-means can be regarded as the definition of clusters). Straightforwardly simple!”
But this gives a different definition for every k, and doesn’t define a proper problem against which your method could be tested.

Let me put the problem as follows. Assume that mclust (fitting Gaussian mixtures) is used to compute a two- and three-component fit for data from the model I set up above. Would you expect your method to pick the two- or the three-cluster-solution? Why? And does it do what you expect it to do? And do you really see no need in a problem definition that specifies what the method *should* do, or at least what it does post hoc? If you don’t have such a definition, why are the results of your simulation not completely arbitrary? There is no scientific basis for saying that they are “good”, just intuition. Or am I wrong?

By the way, the concept of the silhouette does make implicit assumptions indeed. Consider data from 0.5*N(0,100)+0.5*N(6,1). Many points from the first mixture component will have a negative silhouette width if mixture components are told apart properly. So the silhouette statistic and statistics derived from it like yours will not like a clustering that gets these components right. (Try!)

“In my view, the more precise is a cluster definition, the more platonic you are being.”
That’s nonsense, given my way of interpreting such a definition, as explained before. As said before, the problem definition is *not* there for imposing only one perspective. It’s for elaborating what perspective is implicit in what method, aiding understanding, accepting that from a general perspective there is no objective way to say that one is better than the other.

17. Christian,

“But this gives a different definition for every k, and doesn’t define a proper problem against which your method could be tested.”

As in general science, statistics and mathematics, there exist some deep differences between my personal definitions and the definition that the scientific community adopts. In our discussion, I gave my personal definition, my personal philosophy has nothing to do with this technical result.

Our method just corrects the Rousseeuw’s original proposal. The slope statistic chooses the number of clusters (k) that not only has the highest silhouette (k =arg max s(k)) but also clusters that cannot be partitioned into smaller clusters, such that increasing the number of clusters would result in a much smaller silhouette, i.e., a gap in the silhouette value would be observed.

That is our proposal, if someone wants to find this type of clusters they can use our method. The Rousseeuw’s original proposal fails because the average silhouette can varies around a baseline, the maximum average silhouette will work only in very stable scenarios.

What is wrong with having different definitions of clusters for each k? Let X = {x1,…, xn} be the data, the clustering algorithm finds a partition {C1, …, Ck} of X in which the elements of each Cj have some similarities, which are imposed by the clustering algorithm. I call {C1, …, Ck} by clusters of X. That is the clustering algorithm is defining the clusters, when we decide which kind of similarity we want, we decide also what is a cluster.

It depends on what we mean by “definition of something”, but I can easily accept that for each k the “definition” of the clusters will change: if k=1 we have one cluster; if k=2 we have two clusters, and so on. They can be viewed as clusters with different definitions, since they contain different elements. However the same similarity criteria is being used for all k=1,…n, namely: the chosen clustering algorithm.

A question: how many clusters do you find reasonable to identify in this Graphic? you can say that k=1, 2,…, 500 and find an argument to justify your choice.

About the examples: I do not have the codes here. If you share the code (or at least parts of it), I can try it here to see your results.

18. Christian Hennig says:

Alexandre,
I do agree that there is a problem with maximising the original silhouette width. Another way to account for this is in my Hennig and Liao (2013) paper, but what you’re proposing may as well be fine.

The whole discussion is, from my point of view, about what kind of definition of a problem should be given when a method is introduced to tackle the problem. And I think that such a definition should be so that the proposed method can be tested against that definition. That’s the point. Nothing is wrong in general with having a definition of clusters for fixed k. But this doesn’t allow to test your method against it. You didn’t specify the problem in such a way that anyone could come up with a counterexample, because it isn’t clear what exactly would constitute a counterexample.

“A question: how many clusters do you find reasonable to identify in this Graphic? you can say that k=1, 2,…, 500 and find an argument to justify your choice.”
From a statistician’s point of view, it would be preferable to define “true” clusters in terms of the underlying model. I am not in principle against definitions based on the data alone, but in that case it would be all too easy to say, “if the true clusters can be defined based on the data alone, why not just use them and don’t use any other clustering method?”
I’m not sure how you generated the data, but I guess that in the graph counting density modes (or modes of a sufficiently smoothed density; there are some problems with densities I have discussed on Mayo’s blog from time to time) would define the number of clusters to be 5. There are a number of other conceivable general definitions (“general” not in the sense that they define the clustering problem in general but rather in the sense that they can be applied to general datasets/models) that would give 5 here, too. But try to find a definition that people would agree makes sense and that gives you something else than 5! That will be pretty tough.

“About the examples: I do not have the codes here. If you share the code (or at least parts of it), I can try it here to see your results.”
If you don’t want to engage with them, that’s up to you, but all I have explained here can be generated straightforward by a few lines of code in any language, which is much less work than sending stuff around after having done all the work that is required to make sure we use the same software, documentation etc.

19. Christian,

The principal author of that paper is André Fujita, he invited me to help with some parts of the paper. He conducted the simulations and I am not totally familiarized with the code. We discussed which kind of well-spaced clouds of data the existing methods may fail and we found a way to avoid this. We generated clouds of data like this one I showed in my previous post. If one of these clouds have more data than others (or different dispersions), then some of the existing methods cannot detect such 5 clusters, even for those well-spaced clouds of data. We really tried to find examples where our method fails, but for those well-spaced clouds of data we failed to find it (I tried to make it clear in our discussion, but it seems you did not understand). I think I already explained how the data were generated and all about our method. We only worked with WELL-SPACED clouds of data. Again, when I went through the philosophical issues of the cluster definition, I was not referring to the cluster definition used in that paper. I thought it was clear. I will not say anything more about that. I want to discuss philosophical issues now, but my personal views are not applied to that paper. If it is not clear for you, I will have problems to explain every other post and I am not willing to do that.

From no on, my comments have nothing to do with the methods of that paper, they are my personal views.

“From a statistician’s point of view, it would be preferable to define “true” clusters in terms of the underlying model.”

Some (the majority of) statisticians prefer to define “true” probabilities, “true” hypothesis, convergence to the “true” value, “true” clusters and so on. This meta-physical statements have platonic influences and our language is impregnated with such kind of reasoning, it is hard to avoid them without acting like a sophist. Of course, one can use these statements as shorthands for more complex interpretations without believing in such static true world. However, the discussion is much more complex than just make some caveats. It involves the structure of language and the way we use our rationality as a Kantian believe system where there exists a “thing in itself”. This is a fiction.

“I am not in principle against definitions based on the data alone, but in that case it would be all too easy to say, “if the true clusters can be defined based on the data alone, why not just use them and don’t use any other clustering method?””

I did not say anything about definition based on the data alone. I said that the clustering algorithm defines what is a cluster, simply because the algorithm imposes what is near and what is far away. When we decide which kind of similarity we want (which is embedded in the clustering algorithm), we decide also what is a cluster. Each clustering algorithm provides a different perspective of a cluster definition and I do not have problems in considering different definitions for clusters (***), this is fully consistent with our observable world: a definition changes according to the perspective and the observable world is ultimately a human interpretation and the interpretation changes according to the underlying values.

Someone, who is looking for a “true” static definition, is locked into the platonic world; perhaps is someone who adopted the Kantian believe system as a dogma without questioning its validity in our observable world. That is, someone who strongly believes in the structure of the human language as a trustful tool for reaching a ultimate truth.

As I said, it is a long conversation (maybe a lifetime conversation) and it is not possible to make it clear in a couple of posts, sorry.

(***) You may question our method based on this statement, but I already said what kind of cluster we considered: Well-spaced clouds of data, where you can detect them by eye, just as you made in the above post. This perspective can be formalized by using the silhouette statistics and our slope statistics.

20. Christian Hennig says:

Alexandre,
I think it is better not to go totally away from the topic of the paper, because that’s our major example for discussing the philosophy. Actually in your posting, you don’t go away from the paper really, which is fine by me.
Actually, looking for “well-spaced” point clouds goes some way in the direction I have in mind. It is not a clear formal definition, though. What counts as “well-spaced”? Can you define this formally? “What the eye can see” is not very helpful in other than <=3-d Euclidean settings, and even there it's very subjective. Furthermore, if this is your aim, why would one want to use k-means as the clustering method? Nothing in the definition of k-means supports clusters being "well-spaced" and it is well known that it will not properly separate "well-spaced" clusters if they have very different within cluster variation. (Try a large enough data set from 0.1*a uniform distribution on a unit square around (0,0) and 0.9*a uniform distribution from a rectangle with corner points (1,10),(1,200),(-10,10),(-10,200).)
Again I'd ask for a definition that defines what you would want to find here, which is not entirely clear from what you wrote, although it seems you would want to have the two uniforms separated, which your method won't do with k-means (because k-means won't do it in the first place), although it may with Single Linkage.

"Someone, who is looking for a “true” static definition, is locked into the platonic world"
Well we're also somewhat "locked" in this discussion, aren't we? From the beginning I said that I am *not* asking for a *static* definition, but for a problem-dependent one that serves to analyse the properties of methods and applications; a definition that doesn't claim that it is *really true*, but that is man-made in order to set some clear benchmarks against which a method can be tested.

"I do not have problems in considering different definitions for clusters"
Neither do I, but for a given fixed method, don't you think that it would be helpful to have a definition that makes clear what the method attempts? (Actually I think you already agreed on that by giving your "well-spacedness"-condition.)

21. HI Christian,

“What counts as “well-spaced”? Can you define this formally?”

I think we can say that the clouds are well-spaced if the clustering algorithm can detect those well-spaced clouds.

“Nothing in the definition of k-means supports clusters being “well-spaced” and it is well known that it will not properly separate “well-spaced” clusters if they have very different within cluster variation. (Try a large enough data set from 0.1*a uniform distribution on a unit square around (0,0) and 0.9*a uniform distribution from a rectangle with corner points (1,10),(1,200),(-10,10),(-10,200).)”

Yes, you are right. But, we compared the methods only for the cases where the k-means “rightly” identify the well-spaced clouds of data (it is said in the paper on Section 5.4). We also can use other clustering algorithms in cases where your definition of cluster disagree with the one provided by the algorithm.

“…, but for a given fixed method, don’t you think that it would be helpful to have a definition that makes clear what the method attempts? (Actually I think you already agreed on that by giving your “well-spacedness”-condition.)”

I do not know if you are referring to our method, but I think I already answered what our method does. That is,

We defined a sort of derivative of the silhouette to identify its abrupt decreases: The slope statistic chooses the number of clusters (k) that not only has the highest silhouette (k =arg max s(k)) but also clusters that cannot be partitioned into smaller clusters, such that increasing the number of clusters would result in a much smaller silhouette, i.e., a gap in the silhouette value would be observed. That is what the method does. The mathematical definition is in the paper. We use well-spaced clouds of data (which the k-means can identify “rightly”) to compare with the existing methods.

As the majority of philosophical problems, we got into a language problem. There are many different types of conceivable situations here, it will be impossible to treat each one of them and I do not know exactly what situation you are thinking in your question, maybe because we have different concerns.

1. The proposed method can be rigorously defined by using mathematical tools.

2. The problem of interest will always be defined from a perspective, that is, from an interpretation of the world.

Our proposed method is well-defined in terms of the silhouette statistic. It does what its definition states: it chooses a number of clusters that yields a specific feature in the silhouette statistics.

22. Christian Hennig says:

I give you credit for summarizing your point of view properly. I can see clearly what you’re getting at, but can I make it clear to you what I think is missing?
Let’s try once more.
You’re right that you give a precise mathematical definition of the method (which is “what the method does” in a certain sense, I accept). This relies on clusters equally precisely defined by the underlying clustering method, be it k-means or something else.
However, by doing this, you do not give a *problem definition* against which the method can be tested. Referring to the definition of the method, there is no way of using this to test the method, because obviously it does what it does.
Something to test the method against cannot be the method definition itself.
Actually you do *something* in this direction. Two things: a) you have all the simulated examples and b) you talk about the aim to find well-spaced clusters.
This, however, is not precise enough that somebody else could come with a counterexample that could be used to test the method and where there is a realistic possibility that the method could fail, because the problem definition is not precise enough to make clear what would constitute a counterexample.
For example, if you consider my uniform example from the previous post, it isn’t clear how you would accept that this could be a counterexample. Obviously, in the current form, you can argue that the two clusters that I want to see aren’t clusters in k-means, so that it is not the fault of your method together with k-means that it won’t find the two “correct” clusters (“correct” according to a precise definition to be given, but it is no problem to make “well-spaced” precise enough that it works here).
In this sense, also the method definition is not precise enough, because it isn’t clear in what situation what underlying basic clustering method should be used,

This has something to do with severity in Mayo’s more general philosophical conception. You only have evidence for the method being “good” if it was subjected to “severe” tests that would give it the real possibility to fail. You did some tests, but you didn’t specify what would constitute “failure” precisely enough that readers could test your method with their own examples on their own against your claim.

23. “… if you consider my uniform example from the previous post, it isn’t clear how you would accept that this could be a counterexample.”

Find a clustering algorithm that is able to detect two clouds and compute the slope statistics (by using the silhouette statistics). If the silhouette statistics don’t have an abrupt decreasing right after the “right” number of clusters, then our method fails.

We assume that the clustering algorithm is able to detect the clouds of data. The users must choose the clustering algorithm they prefer. Our method will return the number of clusters that yield a silhouette with some features. It is a kind of descriptive method.

You can use a clustering algorithm that finds clusters for which the silhouette is not a good measure of fitting. Then, our method will fail too, then the silhouette should be corrected to be used together with this clustering algorithm. This is kind of weird clusters I was referring to in a previous post. However, for well-spaced clusters, where the k-means is able to detect, I strong believe that our method is quite good and deserves to be used for those how want to find the number of clusters with the cited features.

24. Christian Hennig says:

That’s all fair enough but where do you stand on the philosophical issue?
Do you think there should be a problem definition that is clear enough that a reader knows precisely what would constitute a counterexample? (As explained before, this is different from a precise definition of the method.)

25. As I said before: I stand on a type of perspectivism.

Let’s concentrate on your question:

“Do you think there should be a problem definition that is clear enough that a reader knows precisely what would constitute a counterexample?”

It depends on what you mean by “clear”, “know”, “definition”, “counterexample”. The concepts can be reinterpreted in a number of ways (perspectives) and therefore the meaning of “clear”, “know”, “definition” and “counterexample” vary with the adopted perspective and the underlying values. It is not trivial at all.

My definitions for “clear” and “know” follow:

“Clear”
Something is clear to us when this thing has strong connections with others concepts that we got used by repetition. That is, something is clear when we are familiar with it (i.e., when it is strongly connected with other concepts).

OBS: A concept is aways ill-defined, since it postulates the equality of intrinsically different things. A concept is “clear” when we became accustomed by repetition.

“Know”
We know something when we create relations between the studied thing and all the other concepts that we got used by repetition. That is, we know something when we familiarize an unknown thing.

That is, different persons can have strongly different “clear knowledges” about the same perceived object. Some of them can be experimentally tested, others cannot.

Just for starts: do you think that someone knows precisely what would constitute a counterexample for language as a way of knowing things?

26. Depending on the adopted perspectives, the reader will never ‘know’ what would constitute a counterexample for a problem definition.

If they are strongly accustomed with a sort of reasoning and never questioned the underlying values behind their language and reasoning, they will probably find obscure something that may be quite clear under other perspective.

I think your question has some hidden assumptions about knowledge, reasoning, coherence and related things that induce to a certain type of conclusion…

27. Christian Hennig says:

Alexandre: Again there is nothing in your responses that I really disagree with, but you don’t really engage with my point (which may well be the consequence of my not having made it clear enough, but I already tried several times).
The thing is that the mathematical framework of statistics gives us the opportunity to give clear and precise definitions. People will not agree on what a cluster generally is and certainly I’d be very skeptical if somebody would claim that he or she “knows precisely” what that is. But mathematical modelling implies that we define “surrogate problems” which allow us to diagnose clearly to what extent they are solved. This, by the way, agrees with your definition of the word “clear” – there is pretty much general agreement about the content of mathematical definitions, which makes them “clear”. Indeed, it is in my interpretation the major aim of mathematics to have such a “clear” language – even in your sense. (Do you know my paper about mathematical modelling and reality?) For example, I can claim the k-means is the best method in terms of misclassification probabilities for clustering data that come from a mixture of k spherical Gaussian distributions with equal covariance matrices, and this is a claim that can be tested by computing the misclassification probabilities from specific such mixture distributions, and everybody with enough knowledge of the relevant bits of cluster analysis knows exactly what the claim points to (don’t get hang up here about details; I know that I’d need to do a little bit more formal work to make the math really clear, but at least it can be done). Or I can say that I assume an underlying density with a certain smotthness constraint on the density, and then the number of clusters is defined as the number of modes, and I can then claim that my method estimates this number well, and this can then be tested submitting any distribution that fulfils the constraints. I’m not saying that these definitions are the ones that you or anyone should use; my point is that they are definitions that serve to make a problem well and clearly defined.

“That is, different persons can have strongly different “clear knowledges” about the same perceived object. Some of them can be experimentally tested, others cannot.”
True. What I’d ask people is to make an effort to make then testable. For which mathematical modelling is a wonderful tool.

What we need is an acknowledgement that such mathematical problems, which can be mathematically tested, are *not* identical to the real problems that we want to solve. The models should *not* be taken literally. I’m *not* talking about knowledge that in reality data behave like this, “true” clusters are those defined like this etc.
Mathematical definitions set up toy problems that are there for testing purposes, and in every case discussion is required about whether the success of a method in a mathematically well defined problem really indicates that it is good at what we are interested in the non-mathematical reality. This may or may not be the case, or partly. The usual situation is probably that the problem definition is such that it defines idealistic situations in which we want a method to work, but indeed we want it to have a more general range of applicability than what’s in the problem definition. But at least it’s a strating point. On the other hand, if somebody sets up a mathematical problem definition and somebody then finds a counterexample, what is achieved is that it can be *learned* that something went wrong here, and with closer analysis, what went wrong. It may be that the method has indeed a flaw that was not known before to the author. Or it may be that the problemn definition was too general and allowed certain examples that with hindsight people would find pathological, so that the problem definition needs revision. In any case, the effect is that we learn something, and that’s a good thing, isn’t it?
Models are not reality. We can use them for transparency; for having a clear situation to play with despite the fact that reality is not so clear. Models are helpers.
From what you write I get the impression, at times, that you think that in order to have a precise mathematical problem definition, one needs to somehow assume or believe that reality is really like the model. Not so. (I grant you that many people in this respect wouldn’t agree with me. But my job is not to defend them.)

28. Christian, I understand what you are saying, mainly because I think, in the core, this is the modern view of mathematics/statistics, at least, in an underlying basis. I believe that you understand what I am saying here.

“For example, I can claim the k-means is the best method in terms of misclassification probabilities for clustering data that come from a mixture of k spherical Gaussian distributions with equal covariance matrices, and this is a claim that can be tested by computing the misclassification probabilities from specific such mixture distributions, …”

Yes, we do this all the time. I think we are saying the same thing by different paths, but I believe my way avoids some language problems about reality. In my view, each clustering algorithm defines what is a cluster (and this is a precise definition), the practitioners should verify if that definition is good or not for their problem: they can verify it by studying the definition of the clustering algorithm or by simulations or theoretical developments or whatever. That is, they should confront their cluster definitions with the one imposed by the clustering algorithm. Usually, the proponent of the method conducts some descriptive simulations in order to illustrate the method.

My opinion is that: a clustering algorithm offers a perspective, the user should verify if this perspective is sufficient to solve the problem at hand.

I think there are many levels of problems, some can be written in very precise terms, others cannot. An intrinsically mathematical problem can and evidently should be written in precise terms: a problem in the number theory, measure theory, probability theory, functional analysis, real analysis and so on. However, when an actual problem is translated into mathematical terms, it always will by dependent of the adopted underlying values and beliefs.

We can try to define coherence and information in precise terms, but those definitions should be wide enough to include many types of coherence and information. In my view, any type of restricted definition for coherence does more harm than good, since some scholars are willing to blindly believe on such restricted versions without questioning them at all. Such wide enough definitions would be considered obscure for some scholars. In general people do not want only a precise definition they want a very precise and restricted definition in order to have a strong principle driving their behaviors.

29. Christian Hennig says:

That’s all fine by me and I agree with much of it, but of course as designers of methods we should help the practitioners as good as we can to make sense of our method definitions. I grant you that these definitions are precise already, but their implications are often not obvious; not only for the practitioners they may be difficult to understand, also for the statisticians.
Anyway, chances are that we got as close to agreement as we can for the moment!?

30. Christian,

I agree with you.

“not only for the practitioners they may be difficult to understand, also for the statisticians.”

I think it is a matter of reinterpreting the old plastered concepts. I also strongly believe that science/math/stats would become much stronger by adopting an artistic perspective rather than a normative one.

Thanks for all of your comments. I am reading your paper: http://www.homepages.ucl.ac.uk/~ucakche/papers/modelpaper.pdf

Later on we can start a discussion about the ideas contained there. I am quite interested in them.

Best regards.