Video Content Structuring

From Scholarpedia
Meng Wang and Hong-Jiang Zhang (2009), Scholarpedia, 4(8):9431. doi:10.4249/scholarpedia.9431 revision #91922 [link to/cite this article]
Jump to: navigation, search
Post-publication activity

Curator: Hong-Jiang Zhang

Video content structuring (also named video structuring, video structure analysis or video segmentation) is defined as the process of hierarchically decomposing videos into units and building their relationships. Direct access to a video without indexing is not an easy task due to its length and unstructured format. On the other hand, analogous to text documents that can be decomposed into chapters, paragraphs, sentences and words, videos can be segmented into units like scenes, shots and keyframes. By building the structure for a video, a table-of-content can be generated and access and manipulations to the video data can be thus facilitated.


Basic Description

With the advances in storage devices, networks and compression techniques, more and more video data has become available to ordinary users. But the efficient access to an unstructured video is a challenging task due to the huge number of frames. Video structuring is proposed aiming at tackling this difficulty. For example, videos can be partitioned into shots and then one or more keyframes can be selected from each shot for indexing. From 2001, video shot detection is established as an evaluation task in TREC Video Retrieval Evaluation (TRECVID) benchmark ( by the National Institute of Standard and Technology (NIST). Shots can be grouped into scenes, such that the video content can be structured at higher level. For several video genres, such as news and movies, “story” is also widely applied as the unit for content organization. Shots can also be further segmented into subshots in order to facilitate finer implementations. From each shot or subshot, one or more keyframes can be extracted to represent its content. Therefore, as illustrated in Fig. 1, generally a video can be structured in a hierarchical form as “videos->stories->scenes -> shots->subshots->keyframes”. The definitions of these terms are as follows:

  • Shot: a shot is an uninterrupted clip recorded by a single camera. It is a physical entity which often forms the building block of video content.
  • Scene: a scene is defined as a collection of semantically related and temporally adjacent shots, depicting and conveying a high-level concept. A scene usually comprises a series of consecutive shots that are recorded in the same location.
  • Story: a story is referred to as a clip that captures a continuous action or a series of events and it may be composed of several scenes and shots. Note that the stories lines are usually only clear for rigidly structured videos. Currently, most story identification methods are developed for news videos. Therefore, here only “news story” is considered. A definition of a news story in TRECVID is “a segment of a news broadcast with a coherent news focus which contains at least two independent, declarative clauses”.
  • Subshot: subshot is a segment within a shot that corresponds to a unique camera motion. A shot can be divided into one or more consecutive subshots according to the movement of the camera.
  • Keyframe: a keyframe is the frame which best represents the content of a shot or a subshot. According to variation of the content, one or more keyframes can be extracted from each shot or subshot. Keyframes can be used as the entries of the video data for manipulations, such as indexing and browsing.

Therefore, video content structuring may involve the following five techniques: shot detection, scene grouping, story identification, subshot segmentation, and keyframe extraction.

Figure 1: Hierarchical decomposition and representation of video content

Shot Detection

Shot detection is the process of identifying the boundaries between two consecutive shots, such that the frame sequence can be grouped into a set of shots. Zhang et al. (1993) proposed the first scheme that partitions videos into shots and then selects the first frame of each shot as the keyframe for indexing. According to the transition style of the consecutive shots, the shot boundaries can be mainly categorized into two types, i.e., cut and gradual transitions. Cut indicates that the change between the two shots is abrupt, whereas gradual transition means that there is a gradual special effect in the transition of the two shots. Many different shot detection methods have been proposed (including the detection and categorization of gradual transitions), and a most straightforward approach to shot detection is to measure the change between every two consecutive frames and a shot boundary can be declared if there is a significant change.

Figure 2: A schematic illustration of shot detection

As illustrated in Fig. 2, a typical shot detection scheme consists of the following three components: frame representation, difference estimation, and boundary/non-boundary discrimination. First, each frame is represented by numerical features. Many different representation methods have been investigated, including pixel-based, color histogram, edge, and intensities. A comparative study on these methods can be found in (Gargi et al., 2000). Based on the representation, the difference of each two continuous frames is estimated. Taking color histogram-based representation as an example, the widely-applied difference estimation methods include:

(1) Bin-to-bin difference

\[\tag{1} D_{k,k+1} = \sum_i \left| H_k(i) - H_{k+1}(i) \right| \]

(2) Normalized bin-to-bin difference

\[\tag{2} D_{k,k+1} = \sum_i \frac{| H_k(i) - H_{k+1}(i) |}{\max\{H_k(i), H_{k+1}(i)\}} \]

(3) Histogram intersecton

\[\tag{3} D_{k,k+1} = 1 - \sum_i \min\{H_k(i), H_{k+1}(i)\} \]

(4) \(\mathcal{X}^2 \)-test difference

\[\tag{4} D_{k,k+1} = \sum_i \frac{\left( H_k(i) - H_{k+1}(i) \right)^2}{H_{k+1}(i)} \]

In the above equations, \(H_k(i)\) denotes the \(i\)-th bin of the histogram in \(k\)-th frame and \(D_{k,k+1}\) indicates the difference between \(k\)-th and \((k+1)\)-th frames.

Shot boundaries are then detected based on the estimated frame differences. The simplest way is to set a global or locally adaptive threshold based on certain rules and consequently the differences above the threshold will be regarded as shot boundaries. But how to establish the thresholding rules that can deal with videos with varying genres and content is a problem. Recently, several efforts regard the discrimination of shot boundaries and non-boundaries as a classification problem and consequently apply several machine learning algorithms (Hanjalic, 2002; Yuan et al., 2007). In comparison with rule-based methods, learning-based methods can be made more robust by building the classification rules from training data (i.e., video sequences in which shot boundaries are labeled).

Nowadays, most researchers regard shot detection as a mature technology and it has already become an elementary function in many video editing software products. The reports in TRECVID show that the precision and recall measurements of cut detection are both able to be above 0.95. The detection of gradual transition is more difficult, but the precision and recall measurements can still be above 0.8 (Over et al., 2007). In fact, the shot detection task has already been removed from TRECVID from 2008, since it is considered that the performance of the state-of-the-art methods already meets the needs of most practical applications. A more comprehensive survey on video shot detection can be found in (Over et al., 2007; Yuan et al., 2007).

Scene Grouping

Scene grouping is usually implemented based upon the results of shot detection. From a global point of view, scene grouping can be simply formulated as a shot clustering task. With an appropriate pairwise similarly definition, many different clustering algorithms can be applied to accomplish this task. Intuitively, two criteria should be considered in a scene grouping algorithm, namely, content similarity and temporal continuity. Content similarity means that the shots within the same scene should contain close content, whereas the temporal constraint indicates that these shots should be temporally close. The content similarity is usually estimated in a low-level feature space and many different features can be employed, such as visual, audio and text features (the text features can be extracted by speech recognition, optical character recognition, etc.).

The existing scene grouping methods can be categorized into three approaches: merging-based, splitting-based, and model-based. The merging-based approach groups shots in a bottom-up way. Kender and Yeo (1998) proposed a method to groups shots into scenes based on their coherence. Analogously, Rui et al. (1999) implement the scene grouping via a two-step process: (1) assign the shots into groups according to their visual similarities and temporal continuities; and (2) merge similar groups into a unified scene. Rasheed and Shah (2005) proposed another two-step scene grouping process: (1) cluster shots according to Backward Shot Coherence (BSC); and (2) merge the clusters into scenes based on an analysis of the shot length and the motion content in the potential scenes. Hanjalic et al. (1999) proposed a method that is able to accomplish the scene grouping on MPEG compressed videos without complete decoding. The method only needs a single pass through the video and is particularly efficient in computation.

As opposed to the merging-based approach, splitting-based scene grouping methods are implemented in a top-down style, i.e., a video is split into a set of coherent scenes in turn. In (Yeung, 1998) and (Ngo, 2003), two different graph-based splitting methods are proposed. In these two methods, a video is represented by a graph, in which the vertices are denoted by the shots and the edges are determined by the similarities of the shots and their temporal localities. A graph is then partitioned into several subgraphs and each subgraph can be regarded as a scene. In (Yeung, 1998), the graph is named as Scene Transition Graph (STG) and it is partitioned into subgraphs with the complete-link method, whereas in (Ngo, 2003) the graph is named as Temporal Graph and it is partitioned by the normalized cuts method.

Different from the above two approaches, model-based methods group shots into scenes with statistical models. Tan and Lu (2002) implemented scene grouping using Gaussian Mixture Models (GMM) with each Gaussian component indicating the distribution of a scene, and the number of scenes can be determined by Bayesian Information Criterion (BIC) method. Gu et al. (2007) proposed an energy minimization scheme, in which the constraints of time and content of scene grouping are indicated by energy items. It is able to take both the global and local constraints into consideration, and the number of scenes can be established by Minimum Description Length (MDL) principle. Recently, Wilson and Divakaran (2009) proposed a supervised learning approach to scene grouping. It classifies shot boundaries into scene boundaries and non-scene-boundaries by learning models from labeled training data, and in this way the discrimination rules can be made more robust and can deal with videos with varying content.

Since scene grouping is actually a subjective task, and how to systematically evaluate and compare different algorithms is a problem, Vendrig and Worring (2002) have conducted such a study. They proposed a method to evaluate the performance of different scene grouping algorithms as well as investigate their dependencies on the shot detection results.

Story Identification

Story identification needs more semantic understanding of video content and it is usually only applied for certain rigidly structured video genres such as news video. In TRECVID 2003 and 2004, news story identification has been established as an evaluation task. The existing story identification methods can be mainly classified into two categories, i.e., rule-based and learning-based approaches. The rule-based story identification methods usually explore certain domain knowledge. For example, it has been observed that many news stories begin with an anchorperson shot and end with the start of another anchorperson shot, as illustrated in Fig. 3. In (Zhang et al., 1995), the pattern of news stories has been analyzed and the task of story identification is accomplished via detecting the shots of certain types, such as anchorperson shot. But this approach highly depends on the adopted knowledge and it can hardly handle diverse video sources with different features and production rules. The learning-based approach is able to tackle this difficulty. In typical learning-based story identification methods, a set of story boundary candidates is first established (such as the shot boundaries and audio pauses) and then each candidate is labeled as “story boundary” or not based on the model learned from a training set (Chua et al., 2004). More details about these story identification methods can be found in (Chua et al., 2004; Zhang et al., 1995) and references therein.

Figure 3: A typical news story in a video. The keyframes of the video have been illustrated, and the story boundaries can be identified according to the anchorperson analysis.

Subshot Segmentation

A subshot is a sub segment within a shot. Generally, it is defined to contain a unique camera motion. Therefore, subshot segmentation can be accomplished through camera motion detection. For example, consider a shot in which the camera moves as follows: zoomed out, then panned from left to right and zoomed in to a specific object, and then stopped. This shot then can be divided into three subshots, including one zoom out, one pan to right, and one zoom in. The camera motion between two adjacent frames can be estimated based on a two-dimensional affine model, in which the motion vector \((v_x, v_y)\) at pixel \((x, y)\) can be expressed as

\[\tag{5} \left( \begin{array}{c} v_x \\ v_y \end{array} \right) = \left( \begin{array}{c} a_1 \\ a_4 \end{array} \right) + \left( \begin{array}{cc} a_2 & a_3 \\ a_5 & a_6 \end{array} \right) \left( \begin{array}{c} x \\ y \end{array} \right) \]

where \(a_i (i = 1, 2, \ldots, 6)\) denote the motion parameters. The motion parameters can be represented by a more meaningful set of terms as follows

\[\tag{6} \displaystyle\left\{ \displaystyle\begin{array}{ll} pan = a_1 \\ tilt = a_4 \\ zoom = \displaystyle\frac{a_2+a_6}{2} \\ rot = \displaystyle\frac{a_5-a_3}{2} \\ hyp = \displaystyle\frac{|a_2-a_6|+|a_3+a_5|}{2} \end{array} \right.\]

where pan corresponds to the pan movement of camera, tilt corresponds to tilt and boom, zoom corresponds to dolly and the change of focus, rot corresponds to roll, and hyp indicates that object motion is predominant. For more details about video motion analysis, please refer to (Kim et al., 2000).

Based on the analysis of camera motion, the subshot segmentation can be implemented and each subshot will be categorized into one of the following six classes: pan, tilt, zoom, rot, object motion, and static. For an individual frame, its motion category can be determined by thresholding the related terms in Eq. (6). But it is observed that generally a camera movement will be maintained for a period, say, at least half a second (Kim et al., 2000). Thus, a typical subshot segmentation process consists of three steps, i.e., frame-level motion detection, segment-level motion detection, and post-processing. More details can be found in (Kim et al., 2000). Figure 4 illustrates an example, in which the shot can be segmented into seven subshots.

Figure 4: An illustrative example of subshot segmentation

Keyframe Extraction

Keyframes are the frames in a video sequence that can best preserve the content of its shots or subshots. They can be used as the entry points of the video for access. The most widely-applied keyframe extraction methods can be categorized into two approaches, i.e., analysis-based and clustering-based. The analysis-based methods extract keyframes by analyzing video content, such as the quality and the attractiveness of frames. For example, Ma et al. (2002) adopted an attention model and the frames that attract the most user attention are extracted as keyframes. In clustering-based approach, a clustering process is carried out and then the cluster centroids can be established as keyframes. Hanjalic and Zhang (1999) adopted a partitioning-based clustering method, and the number of clusters (i.e., the number of keyframes) can be determined by a cluster-validity approach.

A recent development in keyframe extraction is to formulate it as a learning task (Kang et al., 2005). It is observed that the representativeness of a video frame involves several elements, such as its image quality, user attention, and visual details. Thus, the frame representativeness can be modeled through a training set with regarding these elements as features. This learning process is able to simulate human’s perception on keyframe extraction. The method in (Kang et al., 2005) builds a model based on four elements extracted from each frame, including frame quality, visual details, content dominance and attention measurement, and encouraging results have been reported in both subjective and objective evaluations.


Video structuring is often regarded as the elementary step in video content analysis, i.e., it is a necessary process in many different applications (Xiong et al., 2005). Here are several examples:

(1) Video summarization. It refers to the process that creates a set of images or a shorter video clip that can help viewers quickly get an abstract knowledge of the original video (Agnihotri et al., 2004; Divakaran et al., 2004; Shao et al., 2006; Sundaram et al., 2006). One popular approach is to first parse videos into shots and extract keyframes, and then the summarization can be easily accomplished by presenting the keyframes on a board in a user-friend way (Yeung and Yeo, 1997; Mei et al., 2009).

(2) Video search. In many cases, users want to find segments instead of whole videos that contain specific persons, objects, events, locations, etc. Therefore, the videos need to be segmented first. Typically, the shot is adopted as the basic unit for search, such as in the evaluation task in TRECVID. But for several video genres such as movies and news videos, scene and story can be better units since they can convey more complete and coherent information.

(3) Video annotation. It refers to the process of manually or automatically assigning descriptive keywords to video data in order to facilitate other applications, such as video browsing, search and advertising. Obviously, annotating each frame in a video will introduce redundant efforts, as frames in a shot are usually visually and semantically close to each other. Therefore, the shot is often adopted as the unit for annotation. For several video genres, such as home videos, in which long shots are frequently used that may contain significantly varying content, subshot is also widely used.


  • Agnihotri, L., Dimitrova, N., Kender, J. R., and Zimmerman, J. (2004). Design and evaluation of a music video summarization system. Proc. International Conference on Multimedia & Expo.
  • Chua, T. –S., Chang, S. –F., Chaisorn, L., and Hsu, W. (2004). Story boundary detection in large broadcast news video archives – techniques, experiences and trends. Proc. ACM Multimedia.
  • Divakaran, A., Peker, K. A., Chang, S. –F., Radhakrishnan, R., and Xie, L. (2004). Video mining: pattern discovery versus pattern recognition. Proc. International Conference on Image Processing.
  • Gargi, U., Kasturi, R., and Strayer, S. H. (2000). Performance characterization of video-shot-change detection methods. IEEE Transactions on Circuits and systems for video technology, vol. 10, no. 1, pp. 1-13.
  • Gu, Z., Mei, T., Hua, X. –S., Wu, X., and Li, S. (2007). EMS: energy minimization based video scene segmentation. Proc. International Conference on Multimedia & Expo.
  • Hanjalic, A. (2002). Shot-boundary detection: unraveled and resolved? IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 2.
  • Hanjalic, A., Lagendijk, R. L., and Biemond, J. (1999). Automated segmentation of movies into logical story units. IEEE Transactions on Circuits and Systems for Video Technology, 9(4): 580-588.
  • Hanjalic, A. and Zhang, H. –J. (1999) An integrated scheme for automated video abstraction based on unsupervised cluster-validaty analysis. IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 8, pp. 1280-1289.
  • Kang, H. –W. and Hua, X. –S. (2005). To learn representativeness of video frames. Proc. ACM Multimedia.
  • Kender, J. R., Yeo, B. –L. (1998). Video scene segmentation via continuous video coherence. Proc. International Conference on Computer Vision and Pattern Recognition.
  • Kim, J. –G, Chang, H. S., Kim, J., and Kim, H. M. (2000). Efficient camera motion characterization for MPEG video indexing. Proc. International Conference on Multimedia & Expo.
  • Ma, Y. F., Hua, X.-S., Lu, L., and Zhang, H. –J.(2005). A Generic Framework of User Attention Model and Its Application in Video Summarization. IEEE Transactions on Multimedia Journal. Vol 17, No. 5, October 2005.
  • Mei, T., Yang, B. Yang, S. –Q., and Hua, X. –S. (2009). Video collage: presenting a video sequence using a single image. The Visual Computer, vol. 25, pp. 39-51.
  • Ngo, C. -W., Ma, Y. F., Zhang, H. -J. (2003). Automatic Video Summarization by Graph Modeling. Proc. International Conference on Computer Vision.
  • Over, P., Kraaij, W., and Smeaton, A. F. (2007). TRECVID 2007 – Overview. Proc. of TRECVID.
  • Rasheed, Z. and Shah, M. (2005). Scene detection in holleywood movies and tv shows. Proc. International Conference on Computer Vision and Pattern Recognition.
  • Rui, Y., Huang, T. S., and Mehrotra, S. (1999). Constructing table-of-content for video. Multimedia Systems, 7: 359-368.
  • Shao, X., Xu, C., Maddage, N. C., Tian, Q., Kankanhalli, M. S., and Jin, J. S. (2006). Automatic summarization of music videos. ACM Transactions on Multimedia Computing, Communications and Applications, 2(2): 127-148.
  • Sundaram, H., Xie, L., and Chang, S. –F. (2002). A utility framework for the automatic generation of audio-visual skims. Proc. ACM Multimedia.
  • Tan, Y. –P and Lu, H. (2002). Model-based clustering and analysis of video scenes. Proc. International Conference on Image Processing.
  • Vendrig, J. and Worring, M. (2002). Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia, 4(4): 492-499.
  • Wilson, K. W. and Divakaran, A. (2009). Broadcast video content segmentation by supervised learning. Multimedia Content Analysis, ISBN: 978-0-387-76569-3, pp. 1-17.
  • Xiong, Z., Radhakrishnan, R., Divakaran, A., Rui, Y., Huang, T. S. (2005). A unified framework for video summarization, browsing & retrieval. Academic Press.
  • Yeung, M. M. and Yeo, Liu, B. (1997). Video visualization for compact presentation and fast browsing of pictorial content. IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 5, pp. 771-785.
  • Yeung, M., Yeo, B. and Liu, B. (1998). Segmentation of videos by clustering and graph analysis. Computer Vision and Image Understanding, vol. 71, no. 1, pp. 94-109.
  • Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F., and Zhang, B. (2007). A formal study of shot boundary detection. IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, pp. 168-186.
  • Zhang, H. –J., Kankanhalli, A., and Smoliar, S. W. (1993). Automatic Partitioning of full-motion video. Multimedia Systems. vo. 1, pp. 10-28.
  • Zhang, H. –J., Tan, S. Y., Smoliar, S. W., Yihong, G. (1995). Automatic parsing and indexing of news video. Multimedia Systems, vo. 2, no. 6, pp. 256-265.

Recommended Reading

  • Feng, D., Siu, W. C., and Zhang, H. –J. (2003). Multimedia information retrieval and management: technological fundamentals and applications. Springer.
Personal tools

Focal areas