Media Content Analysis
|Xian-Sheng Hua and Hong-Jiang Zhang (2008), Scholarpedia, 3(2):3712.||doi:10.4249/scholarpedia.3712||revision #91478 [link to/cite this article]|
Rapid advancement of the Internet and digital storage technologies has caused an explosion of media data. Media computing enables users to access desired content from these data with less effort. Media computing is an area that uses computer or computing device based intelligent technologies to facilitate the acquisition, creation, analysis, understanding, sharing, and experiencing multimedia content, including manipulating graphics, audio, video, and text, in an integrated way through programming. Typical sub-topics in media computing include media content analysis, management, rendering, representation, compression, streaming, etc. This article mainly introduces one of the key sub-topics of media computing - media content analysis.
Feature extraction is the process of converting an image, video and/or audio signal into a sequence of feature vectors carrying characteristics about the signal. These vectors are used as basis for various types of media computing algorithms. Feature extraction is the basis of most media computing tasks.
Typical visual features for video and image include color, texture, shape, color layout and spatial location. Visual features can be either extracted from the entire image/key-frame or from regions. Global feature is comparatively simpler, while region-level representation of images is proved to be more consistent to human’s perception. Video also has motion features and temporal-based feature, such as motion intensity and variance, frame difference, edge difference, motion pattern and optical flow. Audio features typically are computed on a window basis. These window based features can be considered as short time description of the signal for that particular moment in time.
More details about image-based features please refer to [Lowe 2004, Smeulders 2000, Rui 1999, Yoshitaka 1999 and Smith 1996], and for audio features please please see [Gold 1999].
Color feature is one of the most widely used features in media computing. Colors are defined on some certain color spaces, some widely used ones include, RGB, LAB, LUV, HSV and YCrCb. It is relatively robust to the background complication and independent of image size and orientation. Color features can either be extracted from the entire image or the regions.
Color histogram is the most commonly used color feature for image and video retrieval. Statistically, it denotes the joint probability of the intensities of the three color channels. Besides the color histogram, several other color feature representations have been applied in image and video retrieval, including color moments and color sets. To overcome the quantization effects in the color histogram, Stricker and Orengo proposed to use the color moments. To facilitate fast search over large-scale image collections, Smith and Chang proposed color sets as an approximation to the color histogram [Smith 1996].
Texture refers to the visual patterns that have properties of homogeneity that do not result from the presence of only a single color or intensity. It provides important structural information describing the content of many real-world images such as fruit skin, trees, clouds and fabric.
The texture features usually used to describe the visual information include spectral features, such as Gabor texture or wavelet texture, statistical features characterizing texture according to local statistical measures, such as the six Tamura texture features and the wold features. Among the six Tamura features, that is, coarseness, directionality, regularity, contrast, line-likeness, contrast and roughness, the first three are more significant, and the other three are related to the first three and have less effectiveness on texture description. Among the various texture features, Gabor texture and wavelet texture are widely used for image retrieval and have been reported to well match the perception of human vision. Texture can also be extracted from both the entire image and the regions.
Some content based visual information retrieval applications require the shape representation of object to be invariant to translation, rotation, and scaling, while others do not. These features include aspect ratio, circularity, Fourier descriptors, moment invariants, consecutive boundary segments, etc.
Besides color and texture, spatial location is also useful for region-based retrieval. For example, ‘sky’ and ‘sea’ could have similar color and texture features, but their spatial locations are different with sky usually appears at the top of an image, while sea at the bottom. There are two ways to define the spatial location: the first one is absolute spatial location, such as upper, bottom, top, and centroid, according to the location of the region in an image, and the other one is relative spatial relationship, such as the directional relationships between objects: left, right, above, and below.
Handling layout information
Although the global feature (color, texture, edge, etc.) is simple to calculate and can provide reasonable discriminating power in visual information retrieval, it tends to give too many false positives when the image collection is large. Many research results suggested that using layout (both features and spatial relations) is a better solution to image retrieval [Smith 1996]. To extend the global feature to a local one, a natural approach is to divide the whole image into sub-blocks and extract features from each of the sub-blocks. A variation of this approach is the quadtree-based layout approach, where the entire image was split into a quadtree structure and each tree branch had its own feature to describe its content.
It is typical for audio analysis algorithms to be based on features computed on a window basis. These window based features can be considered as short time description of the signal for that particular moment in time. A wide range of audio features exist for audio computing tasks. These features can be divided into two categories: time domain and frequency domain features. The most typical audio features include Mel Frequency Cepstral coefficient (MFCC), zero crossing rates and short time energy [Gold 1999].
Temporal features are calculated from a set of consecutive frames (or a period of time), which have two typical forms, scalar and vector. Scalar temporal features generally are statistical measures along a set of consecutive frames (for visual features) or a period of time (for audio features), say, a shot, a scene, or a 1-second window. Typical scalar temporal feature is the average of the features on a set of frames (for visual features) or a set of time windows (for audio features), for example, average color histogram of the frames in a shot, and average onset rate in a scene. Another exemplary scalar temporal feature is average motion intensity and motion intensity variance within a shot.
Vector temporal feature generally describes the temporal pattern or variance of a given video clip. For example, a trajectory of a moving object, curve of frame-based feature differences, camera motion speed, speaking rate, onset, etc.
Video and audio are time series, and image series can also be regarded as a time series. Media structuring is to generate table-of-content of a video or audio clip, a series of image, or a collection of video or audio clips, according to the content consistence or similarity of the media data [Zhang 1993].
Video segmentation, or video content structuring, is to generating temporal structure of a video sequence. The basic unit of a video sequence is shot. A shot is defined as an uninterrupted temporal segment in a video sequence, and often defines the low-level syntactical building blocks of video content. Shots are comprised of a number of consecutive frames. A shot is generally filmed with a single camera with variable durations.
Depending on the style of transitions between two consecutive shots, shot boundaries are classified into two types, cut and graduate transitions. Cut is the most frequently used transition in video. The common way for shot detection is to evaluate the difference between consecutive frames represented by certain feature. It is typically determined by checking the abrupt peaks in the frame difference curve, where frame difference can be obtained directly on pixels, or based on frame-based features, such as color histogram, and edge map.
Graduate transitions are of variance types according to the method of “combing” adjacent frames of the consecutive shots. Typical graduate transitions include wipe, dissolve, cross-fade, etc. As frame differences of graduate transitions changing gradually at shot boundaries, we need to check the frame difference curve for a period of time to determine whether it is a graduate transition. A typical method is the twin-comparison approach that adapts a difference metric to accommodate gradual transitions.
Once shot boundaries are detected, video temporal structure is further analyzed using two approaches. One approach divides the shots into smaller segments, namely, sub-shots. Most typical definition is accordant with camera motions within a shot. Sub-shot is a sub segment within a shot, or we may say, each shot can be divided into one or more consecutive subshots. Sub-shots have different definitions according to different applications. Typically subshot segmentation is equivalent to camera motion detection, which means one subshot corresponds one unique camera motion. For example, if in a shot the camera panned from left to right, and zoomed in to a specific object, then paned to the top, and zoomed out, and then stopped, then this shot consists of four sub-shots including one pan to right, on zoom in, one pan to top, and one zoom out.
The other approach is to merge shots into groups of shots, i.e., scenes. A scene is defined as a collection of one or more adjacent shots that focus on one topic. For example, a child is playing at backyard would be one scene, even though different camera angles might be shown. Four camera shots showing a dialog between two people may be one scene even the primary object may be different for these shot. One intuitive method to scene detection is to merges the most similar adjacent shots/scenes step-by-step into bigger ones. The similarity measure can be the same as shot detection and use the same features. More sophisticated similarity measures take multiple shots (instead of two only) into considerations. For example, similarity of any two consecutive shots (or we may say it is the similarity measure of the “connection point” of these two shots) can be defined as the weighted sum of a series of histogram intersections of the shots at the two sides of the “connection point”. Scene segmentation is then equivalent to finding a set of “cut points” among this set of “connection point”.
For videos in specific domains, higher level semantic-specific segment detection is frequently applied, such as “play” detection in football videos, “news piece” detection in news videos, and “commercial block” detection in TV programs. Generally multimodal features, including visual, audio, and even textual information (closed captions or recognized speech), are employed in these tasks. The algorithms for this high-level segmentation are different from general scene detection, and often depend on the semantic definition of the “segment”.
Image organization is to organize a set of images according to certain structures. A typical approach to image organization is to cluster a set of images into groups according to their content consistency or similarity. There are types basically, including time-constrained grouping and time-free grouping. Time-constrained group is similar to video scene detection, but the time-stamp can be taken into account when doing grouping. Time-free grouping actually is image clustering. Image base features, such color moment and color histogram, are applied when doing image clustering and grouping.
For personal photo collections, a higher-level objective to this issue is to temporally segment the photos into episodes or meaningful events, and sort both events and the photos within an event chronologically. Automatic event or ontology detection of personal photos still remains a challenging issue, while most works in the literature reduce it to partitioning the photos’ timestamp into contiguous segments that correspond to the underlying events. Typically, event is defined as the group of photos captured in relatively close proximity in time.
Most of photo grouping systems focused either on time or on content only, or used both but treated each in an independent way. However, a digital photo is usually recorded together with multimodal metadata such as image content (perceptual features) and contextual information (time and camera settings). A more sophisticated solution to event clustering of personal photos is to automatically incorporate all these multimodal metadata into a unified framework, without being provided any a prior knowledge [Mei 2006].
Audio segmentation, in general, is the task of segmenting an audio clip into acoustically homogenous intervals, where the rule of homogeneity depends on different applications. Typical audio segmentation tasks include beat detection, speech/music segmentation, speaker change detection and speaker clustering [Lu 2005].
As video is a time series, frequently it is difficult for viewers to grasp the main content of a video in a short period of time or a glance. Video summary offers a concise representation of the original video clips by showing the most representative synopsis of a given video sequence. Generally speaking, there are two fundamental types of video summarization: static video abstract and dynamic video skimming. A static abstract, also known as a static storyboard, is a collection of salient images or key-frames extracted from the original video sequence. A dynamic skimming consists of a collection of associated audio-video sub-clips selected from the original sequence, but with the much shortened length [Ma 2005].
The static video abstract heavily depends on the key frame extraction algorithms. A well known approach to extracting multiple key-frames from shots is based on frame content changes computed by features such as color histogram or motion activity. These methods require a pre-defined threshold or desired key-frame number to control the density of key-frames in a shot. Recently, some shot-independent approaches are also proposed. For example, key-frames can be extracted from an entire video program by using a time-constrained clustering algorithm. Other more sophisticated methods include the integration of the motion and spatial activity analysis with skin-color and face detection technologies, a progressive multi-resolution key-frame extraction techniques, and object-based approach.
Static video summary, although effective in presenting visual content of video, cannot preserve the time-evolving dynamic nature of video content. Moreover, audio track which is an important content channel of video is lost. A skimming sequence, on the other hand, is able to provide users a more impressive preview of entire video. Many literatures have addressed that dynamic video skimming is an indispensable tool for video browsing. One of the most straightforward approaches is to compress the original video by speeding up the playback. However, the abstract factor in this approach is limited by the playback speed in order to keep the speech comprehensible. The InforMedia system generates short synopsis of video by integrating audio, video and textual information. By combining language understanding techniques with visual feature analysis, this system gives reasonable results. However, satisfactory results may not be achievable by such a text-driven approach when speech signals are noisy, which is often the case in life video recording. Another approach to generating the semantically meaningful summaries is event-oriented abstraction scheme.
In summary, the video summarization problem was studied along two directions: one is to combine multi-modal features, say, image, audio and text, which can be used in some specific application, such as movie tailors, news highlight; and the other is to attempt to find a generic solution by resorting to formalized mathematic methods. Despite the numerous efforts in generating static or dynamic video summary, the results are still far from satisfactory. The direct sampling or low level feature based approaches are often inconsistent with human perception. The semantic oriented methods, in general, are far from human expectations, because the semantic understanding of video content is beyond current technologies. In addition, textual information may not be always available to drive summarization, while systems that totally neglect audio track are not able to generate impressive results. Also, the algorithms involving large number of summarization rules or over-intensive computation are normally impractical to many applications.
With rapid advances in storage devices, networks, and compression techniques, large-scale visual data (image or video) is becoming available to more and more average users. How to efficiently access these data according to their content becomes an urgent research topic in recent years. To deal with this issue, it has been a common theme to develop visual analysis techniques aiming at understanding the data at syntactic and semantic levels. Based on the analysis results, tools and systems for retrieval, summarization and manipulation of these data can be easily created [Liu 2007, Mei 2007, Li 2007, Qi 2007, Hauptmann 2007].
The main task of media understanding is to extract semantics from visually and/or aurally perceptible signal of the media data. From example, to detect whether a video frame contains a human being is a kind of understanding. Some other examples: to detect whether there is a shoot in basketball video; to determine whether a video clip is about a soccer game; to detect whether there is a car in an image; to determine the mood of a music clip; and so on. This semantic information is also called “high-level” features for describing, indexing and searching media content.
Three typical tasks in media understanding are object detection, scene classification, and event detection. These will answer the 3W questions for a visual sample (an image or a video clip), i.e., who (objects), where (scene), and what (event), as illustrated in Figure 1. It is worth noting that these three techniques are not isolated. For example, as aforementioned, scene classification and event detection can benefit from object detection results. Trying to establish unified scheme that can simultaneously addressing these three issues may be a promising solution.
The main two issues in object detection are the representation of visual samples and learning approach. For representation, local features extracted from small patches, such as SIFT, have been widely applied. They have shown good generalization abilities. For example, they are effective even objects are occluded. The learning methods that applied for object detection can be mainly categorized into generative and discriminative approaches. Generative methods model the joint probability distribution of features and labels and then derive posterior probabilities according to Bayes’ rule, whereas discriminative methods directly infer the posterior probabilities. Either generative or discriminative learning has its advantages and disadvantages, thus combining these two approaches have been proved to be more promising to tackle this problem.
There are two approaches for scene classification: one is to detect objects first and then classify visual samples according to the detection results, and the other one, which is more widely applied, is to directly implement scene classification based on low-level features. In the second approach, the task is formulated as a standard classification problem, and many learning methods can be applied, such as Support Vector Machine (SVM), k-NN, etc. Both local features and global features can be applied, including color histogram, color moment, edge distribution histogram, etc.
Different from object detection and scene classification that can be regarded as learning from 2D information sources, event detection from video clips involves 3D sources since an event is described by a series of continuous frames. Note that there are also several studies on investigating events from an individual image, but event detection is mainly from videos. Thus, the exploration of temporal information plays a crucial role in event detection. Many methods regard events as stochastic temporal processes in low-level feature space and mine them with several existing statistical models, such as Hidden Markov Model (HMM), Hierarchical Hidden Markov Model (HHMM), and Conditional Random Fields (CRF). There are also several methods that try to detect events based on the results of object and scene detection.
Audio understanding is to distinguish audio signals into these different audio types. Like many other pattern classification tasks, audio classification is made up of two main steps: feature extraction and classification.
Besides speech/music classification and speaker identification (which can also be regarded as audio segmentation), other typical audio understanding tasks include audio effect classification, repetitive pattern discovery, music mood/style detection, etc [Lu 2005].
Media search is to search desired media content from media databases or the Internet. According to the media type, media search is categorized into video search, image search and audio search. According to the types of queries used in searching, it can be classified into text-based, example-based, concept-based and multimodal based.
With rapid advances in storage devices, networks and compression techniques, large-scale multimedia data has become available to average users. How to index and search multimedia data according to its real content is a challenging problem, which has been studied for decades. Since 1990s, Content-Based Multimedia Retrieval (CBMR) had become an increasingly active field, which is defined as searching for the relevant multimedia data (images or video/audio segments/clips) with issued queries, which can be examples, keywords, phrases, sentences, or any combination of them [Smeulders 2000, Yashitaka 1999, Rui 1999, Smith 1996].
Different from text retrieval, CBMR is a more challenging task, as it needs an understanding of the content of multimedia data. It mainly involves two basic problems. One is how to represent queries and multimedia content. And the other is how to map the representations of queries and multimedia content.
There have already been extensive works on CBMR, and different paradigms and techniques have been proposed, such as Query-By-Example (QBE), Annotation-Based Retrieval (ABR), and multi-modality retrieval. Here we conclude that they are all proposed aiming at addressing the above two issues. Table 1 illustrates several techniques and their corresponding solutions of the above two issues. Query-by-example (QBE) is the most typical scenario for multimedia retrieval before the year of 2000, while query by text (ABR) and by combination of text and examples have become two new mainstream scenarios thereafter [Hauptmann 2007, Qi 2007].
Queries and multimedia content can be described by low-level features or high-level features, or both of them. QBE adopts low-level features to retrieval desired data, whereas ABR uses high-level features. In QBE, multimedia data are indexed by low-level features, where users provide examples (such as images or video clips) to retrieve similar results. However, in many cases average users would prefer using text to providing examples to describe what they want. ABR aims to address this issue, where multimedia data is annotated with a lexicon of semantic concepts (i.e., high-level features), and then queries are mapped to these concepts and multimedia data can thus be retrieved using text-based retrieval techniques. Another advantage of ABR based approach is that we can leverage existing text-based indexing and searching technologies to index and search multimedia content.
Search based on the text transcripts (from automatic speech recognition and machine translation) is an important component of any video search system. This is because ASR, while not fully accurate, is reliable and largely indicative of the topic of videos. Scene and story boundaries are usually automatically detected and used to expand ASR of shots; it has been proved that the boundary information can improve the search performance significantly.
For videos on the Internet, surrounding text, file name, video metadata in video files, user input tags and comments are also informative textual information to index and search image, video and audio data.
Whether using transcripts or surroundings, textual document based indexing and searching techniques are actually applied for media search.
As the amount of available multimedia data has steadily increased lately, users need to be able to access and manage such enormous multimodal corpora efficiently and effectively. Thus, content-based retrieval (CBR), which can analyze the actual contents of the multimedia and facilitate users to access large-scale video data, has been an increasingly active research area since the 1990s.
In typical CBR systems ( Figure 2), kinds of features of media in the database are extracted and to form feature database, including colors, texture, shapes, spatial layout or any other information about the media itself, and usually they are described by multi-dimensional feature vectors. When the multimodal queries, such as text, images, and videos, are input to the retrieval system, the same type of features must be extracted from the media data in database. Finally, the similarity or distance between queries features and features in the database is calculated, the retrieval is performed with the help of an indexing structure which is built based on the feature database. A good indexing scheme can highly speed up the information retrieval.
However, the retrieval accuracy of today’s CBR algorithms is frequently limited. The reason is that human perception of media content is subjective, semantic, and task-dependent, and there still exist a gap between the low-level features and semantic contents in the media. Hence, if the retrievals are only based on the pure low-level feature, it is difficult to obtain the necessarily perceptually and semantically meaningful results.
To address this problem, relevance feedback (RF) techniques are often used in the interactive retrieval system. The main idea is to provide positive or negative (relevant or not relevant) feedback about the retrieval results from the users, indicating content which they are interested to. So that the system can refine the retrieval results based on the feedback and present new results to the user. The whole processing can be repeated several times until the user is satisfied with the final results.
Concept-based media search
Early multimedia search, especially image search, can be traced back to the 1970s. Since the 1990s, it has witnessed strong renaissance in the multimedia search, especially the classical content-based retrieval. There exist three paradigms on the methodological spectrum of the content-based multimedia search. At the earliest extreme, it is the pure manual labeling paradigm that labels multimedia content, e.g., images and video clips, manually with some text labels or concepts and then use text retrieval techniques to search multimedia content indirectly. At the other extreme, it is the automatic content-based search paradigm that can be fully automatic by using the low-level features from multimedia analysis. Query -by-example is the most typical method in this paradigm. However, some difficulties arise in these two paradigms. As for the first manual-based method, a large amount of human labors are required and the manual labels suffer from the subjectivity of human perception on multimedia content. On the other hand, the latter paradigm of the fully-automatic method is subject to the well-known “semantic gap” between the low-level features and high-level semantic concepts.
In the past few years, a promising paradigm of the concept-based multimedia search has been brought into many practical search systems. Compared to the above two extremal paradigms on the methodological spectrum, the third concept-based paradigm strikes a better balance in the middle and it is an automated method as well. However, this approach is not purely automatic since we need to label some content at the beginning as the training set. It is not purely manual either because once a concept detector is trained based on the labeled training set the detector can automatically annotate the same concept for other new images and video clips.
In a general framework for such a paradigm, firstly a set of pre-labeled training samples are used to learn the models of a set of semantic concepts or keywords. These learned models are trained based on some extracted features from training samples, such as color moments, color histograms, color correlogram, wavelet textures and some region features (e.g., SIFT descriptors, shape-based features etc.). These obtained models can then be used to predict the keywords or concepts of any unlabeled images or video clips. Accordingly the trained models here are referred to “classifiers”. With these predicted keywords or concepts, the text-based retrieval techniques can be adopted to search the multimedia collections.
The key factor of the concept-based paradigm is the classifiers which are used to annotate the keywords or semantic concepts to the images and video clips. Some well-defined classifiers have been successfully adopted in content-based retrieval system. These classifiers can be categorized into two main approaches. The first one is the generative models. They explicitly assume the multimedia data is generated by some predefined distributions, such as Hidden Markov Model, Gaussian Mixture Model etc. These models define a joint probability distribution P(x, y) over the observed low-level features x and their corresponding labels y. In the step of prediction, a conditional distribution P(y|x) is formed to predict the keywords of images or video clips. Opposite to the generative model, the other genre of classifiers is discriminative model. It directly models the dependence of the labels y on the low-level features x, i.e., the conditional distribution P(y|x). Some examples of discriminative models used in multimedia retrieval include Support Vector Machine (SVM), Boosting, Conditional Random Fields (CRF), etc. Both generative and discriminative models are called by “supervised models” for they are all constructed on a pre-labeled training set.
In recent years, another type of so-called “semi-supervised model” is also applied into concept-based retrieval. Different from supervised models which only involve training samples in the modeling step, the semi-supervised model takes into account the unlabeled samples as well. By leveraging the distribution information revealed by a large amount of unlabeled samples, the semi-supervised model can tackle the problem of insufficiency of training samples and hence the prediction accuracy can be improved. In multimedia retrieval community, existing semi-supervised models include manifold models, co-training models and so on.
Extracted semantic concepts can be used to directly index multimedia content using text-based technologies, but typically they are integrated with other textual information, such as recognized speech, closed captions, and surrounding text, or even integrated into a multi-modality multimedia retrieval system which also uses QBE techniques. Researchers proved that when the number of semantic concepts is relatively large, even the annotation accuracy is low, semantic concepts are still able to significantly improve the accuracy of the search results.
The most substantial work in this field is presented in the TREC Video Retrieval Evaluation, organized by the National Institute of Standards and Technology. It focuses its efforts to promote progress in content-based retrieval from video via an open, metrics-based evaluation, based on the common video datasets and a standard set of queries. The queries include text plus example images and example videos optionally.
A typical multimodal video search system consists of several main components, including query analysis, uni-modal search, re-ranking and multimodal fusion. A generic video search framework is illustrated as Figure 3. By analyzing the query, the multimodal query (i.e., text, key-frames and shot) are input to individual search models, such as text-based, visual example-based and concept-based model. Then a fusion and re-ranking model is applied to aggregate the search results.
Usually, video retrieval systems tend to get the most improvement in a multimodal fusion and re-ranking by leveraging the above three uni-modal search models. In most multimodal fusion systems for video search, different fusion models are constructed for different query classes, with the involvement of human knowledge. However, some query classification methods are designed for a certain video collection, and may not be appropriate to other collections. How to fuse these uni-model search models remain a challenge and meaningful research topic.
- Gold, B., Morgan, N. (1999) Speech and Audio Signal Processing: Processing and Perception of Speech and Music. John Wiley and Sons.
- Hauptmann, A. G., Yan R., Lin, W.-H. (2007) How many high-level concepts will fill the semantic gap in news video retrieval? International Conference on Image and Video Retrieval (CIVR).
- Li, L.–J., L, Fei-Fei. (2007) What, where and who? Classifying event by scene and object recognition. In Proceedings of International Conf. on Computer Vision.
- Liu, Y., Zhang, D., Lu, G., Ma, W.-Y. (2007) A survey of content-based image retrieval with high-level semantics, Pattern Recognition, 40, 262-282.
- Lowe, D. G. (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2).
- Lu, L., Cai, R., Hanjalic, A. (2005) Towards A Unified Framework for Content-based Audio Analysis. Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP05), Vol II, 1069-1072, Philadelphia, PA, USA, March 19-23.
- Ma, Y., Hua, X.-S., Lu, L., Zhang, H.-J. (2005) User Attention Model based Video Summarization. IEEE Transactions on Multimedia Journal.
- Mei, T., Hua, X.-S., etc. (2007) MSRA-USTC-SJTU at trecvid 2007: high-level feature extraction and search. In NIST TRECVID Workshop.
- Mei, T. et al. (2006) Probabilistic Multimodality Fusion for Event Based Home Photo Clustering. In Proceedings of IEEE International Conference on Multimedia & Expo (ICME), 1757-1760, Toronto, Canada.
- Qi, G.-J., Hua, X.-S., Rui, Y., Tang, J., Mei, T., Zhang, H.-J. (2007) Correlative Multi-Label Video Annotation. ACM International Conference on Multimedia, Augsburg, Germany.
- Rui, Y., Huang, T.S., Chang, S.-F. (1999) Image Retrieval: Current Techniques, Promising Directions, and Open Issues, Journal of Visual Communication and Image Representation 10, 39–62.
- Smeulders, Arnold W. M., Worring, M., Santini, S., Gupta, A., Jain, R. (2000) Content-Based Image Retrieval at the End of the Early Years. IEEE Trans. on Pattern Analysis and Machine Intelligence. 22(12), 1349-1380.
- Smith, J., Chang, S.-F. (1996) Tools and Techniques for Color Image Retrieval. Proc. SPIE Storage and Retrieval for Still Image and Video Databases IV(2670), 426–437.
- Uiusoy, I. Bishop, C. M. (2005) Generative versus discriminative methods for object recognition. In Proceedings of International Conf. on Computer Vision and Pattern Recognition.
- Yoshitaka, A., Ichikawa, T. (1999) A Survey on Content-Based Retrieval for Multimedia Databases. IEEE Trans. Knowl. Data Eng. 11(1): 81-93.
- Zhang, H.-J., Kankanhalli, A., Smoliar, S.W. (1993) Automatic Partitioning of Full-Motion Video. Multimedia Systems, 2(6), pp 256-266.
- Zhou, X.-S., Huang, T.-S. (2002) Relevance feedback in image retrieval: a comprehensive to image retrieval. ACM Multimedia Systems Journal.
- Mark Aronoff (2007) Language. Scholarpedia, 2(5):3175.
- F. Gregory Ashby and Daniel M. Ennis (2007) Similarity measures. Scholarpedia, 2(12):4116.
- Jeffay, K., and Zhang, H.J. (2001) Readings in Multimedia Computing and Networking. Morgan Kaufmann. 1st edition. ISBN-10: 1558606513, ISBN-13: 978-1558606517.
- Shah, M. (Series Editor) (2001-2005) The International Series in Video Computing. Springer. ISSN: 1571-5205.