This is a nice and concise description of text categorization by two of the leading experts on the topic. I would like to add two remarks.
1. Text categorization is different from traditional statistical classification problems in that the number of features (words) is very large, and text is represented as a sparse vector with many zero components. Computationally, it is important to use a learning algorithm that can work efficiently with sparse data, and the performance of the algorithm should not be affected by the large dimensionality.
2. Although bag of words is the most fundamental representation for text categorization, in many industrial applications, other types of domain-specific features may be helpful. For example, for web-page classification, the page layout, link structure, etc, may yield useful information. Challenges such as effective methods to integrate these additional information sources, and methods to deal with large taxonomy, etc, make text categorization an exciting topic for continuing research.
This is a great summary on text categorization. I just have the following comments, in hope that the article can become even more comprehensive.
1) In addition to the methods reviewed in the article, Boosting is also be a very promising text classification method. It should better be introduced with some details as well.
2) Feature selection is also an important part of text categorization. I know Yiming Yang has a very nice paper on that. Maybe some information can be included here.
3) As for multi-class and multi-label text categorization, there are many strategies and algorithms proposed in the literature. Although one can say that binary classification is the most fundamental case, some important aspects of how to connect it with multi-class / multi-label cases should better be introduced and explained as well.
4) As mentioned in the article, it is still challenging when we have to deal with very large scale text categorization. In this case, some discussions on how to distribute the computations might also be helpful.