# Talk:Information theoretic clustering

You might want to ignore my comment about including a sentence that cites connection to graph cut in the sample-estimatopr definition section. It was inserted before I reached the end.

## Reviewer B

The entry is about an interesting and important topic, of developing principled approaches to clustering. Clustering is one of the data analysis tasks that is used a lot and which is very useful in practice, but most of the existing approaches are more or less heuristic. Hence it is very easy and unfortunately common to apply them in ways that produce misleading results.

That is why principled methods are needed, and the described machinery includes nice connections between clustering approaches and information-theoretic measures.

The text entry is mostly well-written and understandable. I have one bigger concern and a few suggestions on how to make the entry better.

The bigger concern is that the title suggests a much broader entry, out of which the current contents only cover one aspect. For instance, many works by Tishby, Dhillon, Kaski, and some works using minimum description length principles could be included under the title. Maybe you could consider titles like "Renyi entropy-based clustering", or then broaden the entry.

**AUTHOR COMMENTS: We agree, and have changed the title accordingly.**

Here is a list of relatively minor things that should be clarified to make the entry more accessible:

Intro: "... divergence measure is obtained via Parzen windowing, without having to estimate the full density function". But Parzen windowing is a density estimation method; could you clarify?

**AUTHOR COMMENTS: We erased this sentence in order not to create confusion.**

Problem specification: "... include measures based on the Cauchy-Schwarz (CS) inequality and the integrated squared error (ISE)". These are technical side-issues that are still unclear at this point of the text.

**AUTHOR COMMENTS: Agreed. Sentence removed.**

Renyi entropy-based divergence measures: The section introduces several variants, but it does not become clear why several ones are needed, how they differ, and how can we choose which one to use in practice. Could you clarify these issues? In particular, it would be important to know in what sense each of them is principled; otherwise the reader is left with the impression that many kinds of arbitrary approximations were made.

**AUTHOR COMMENTS: Based on your comments, we removed all mention of ISE, since it really is not that important in the current context. We tried to explain in more detail how the CS measure is principled, and why this measure is used.**

Just above eqn 4: "exactly similar".

**AUTHOR COMMENTS: Erased "exactly".**

Gradient descent for clustering optimization: Please explain where the
first equation comes from. It is clearly a measure of the difference
between clusters, but why precisely this form of normalization etc.

**AUTHOR COMMENTS: Based on your comments, we removed the equation in question. **

Gradient descent for clustering optimization: Towards the end of section, the level of detail is disproportionately high compared to the rest of the text.

**AUTHOR COMMENTS: We agree, and have basically removed all technical details. These are in the referred papers anyway.**

Connection to graph-theoretic clustering: The first reference to equation (10) seems to go to a wrong formula. Aparently the first formula should be re-numbered to 10?

**AUTHOR COMMENTS: There is something strange with the numbering, and it is out of our control. Will seek advice from Editor-in-chief upon acceptance.**