Multimedia Question Answering
With the proliferation of text and multimedia information, users are now able to find answers to almost any questions on the Web. Meanwhile, they are also bewildered by the huge amount of information routinely presented to them. Question-answering (QA) is a natural direction to address this information over-loading problem. The aim of QA is to return precise answers to users’ questions. Text-based QA research has been carried out for the past 15 years with good success especially for answering fact-based questions. The aim of this article is to extend the text-based QA research to multimedia QA to tackle a range of factoid, definition and “how-to” QA in a common framework. The system will be designed to find multimedia answers from web-scale media resources such as Flickr (http://www.flickr.com/) and YouTube (http://www.youtube.com/).
From Text QA to Multimedia QA
The amount of information on the Web has been growing at an exponential rate. An overwhelming amount of increasingly multimedia contents are now available on almost any topics. When looking for information on the Web, users are often bewildered by the vast quantity of information returned by the search engine, such as the Google or Yahoo. Users often have to painstakingly browse through large ranked lists of results in order to look for the correct answers. Hence question-answering (QA) research has been evolved in an attempt to tackle this information-overload problem. Instead of returning a ranked list of documents as is done in current search engines, QA aims to leverage on deep linguistic analysis and domain knowledge to return precise answers to users’ natural language questions (Prager, 2006).
Research on text-based QA has gained popularity following the introduction of QA in the Text Retrieval Conference (TREC: http://trec.nist.gov/) evaluations in the late 1990s. There are many types of QA, depending on the type of questions and the expected answers. They include: factoid, list and definitional QA, and more recently, the “how-to”, “why”, “opinion” and “analysis” QA. Typical QA architecture includes stages of: question analysis, document retrieval, answer extraction, and answer selection and composition (Prager, 2006). In factoid and list QA, such as “What is the most populous country in Africa?” and “List the rice-producing countries”, the system is expected to return one or more precise country names as the answers (Yang et al., 2003a). On the other hand, for definitional QA, such as “What is X?” or “Who is X?”, the system is required to return a set of answer sentences that best describe the question topic (Cui et al., 2007). In a way, definition QA is equivalent to query-oriented summarization, in which the aim is to provide a good summary to describe a topic. These three types of QA have attracted a lot of research in the last 10 years (Prager, 2006). They provide fact-based answers, often with the help of resources such as the Wikipedia (http://www.wikipedia.org/) and WordNet (Fellbaum, 1998). In fact, factoid QA has achieved good performance and commercial search engines have been developed, such as the Powerset (http://www.powerset.com/) that aims to return mainly factoid answers from Wikipedia.
More recently, attention has been shifted to other types of QA such as the “how-to”, “why” and “opinion” type questions. These are harder questions as the results require the analysis, synthesis and aggregation of answer candidates from multiple sources. To facilitate the answering of “how-to” questions, some recent research efforts focus on leveraging the large question-answer archives available in community QA sites such as the Yahoo!Answers (http://answers.yahoo.com/) to provide the desired answers. Essentially, the system tries to find equivalent questions with readily available answers in Yahoo!Answers site, turning the difficult “how-to” QA into a simpler similar question matching problem (Wang et al., 2009).
Given that the vast amount of information on the Web is now in non-textual media, it is natural to extend the text-based QA research to multimedia QA. There are several reasons why multimedia QA is important. First, although most media contents are indexed with text metadata, most such metadata, such as those available in YouTube, is noisy and incomplete. As a result, many multimedia information contents are not retrievable, unless advanced media content analysis techniques are developed to uncover the contents. Second, many questions are better explained in or with the help of non-textual medium. For example, in providing textual answers to a definition question such as “What is a thumb drive?”, it is better to also show image or video of how thumb drives look like. Third, media contents, especially videos, are now used to convey many types of information as evident in sites such as YouTube and other specialized video/image sharing sites and blogs. Thus many types of questions now have readily available answers in the form of video. This is especially so for the difficult “analysis” and “how-to” type questions. Answering such questions is hard even for text because further analysis and composition of answers are often needed. Given the vast array of readily available answers, it is possible now to find video for “how-to” questions such as “How to transfer photos from my camera to the computer?”. From user’s point of view, it would be much clearer and instructive to show them video detailing the entire transfer process, rather than a text descriptions of the steps involved.
Multimedia QA can thus be considered as a complement to text QA in the whole question-answering paradigm, in which the best answers may be a combination of text and other media answers. Essentially, multimedia QA includes image, video and audio QA. They all aim to return precise images, video clips, or audio fragments as answers to users’ questions. In fact, the factoid QA problem of finding precise video contents at the shot level has partially been addressed by TRECVID (http://trecvid.nist.org/), a large-scale public video evaluation exercise organized yearly in conjunction with TREC. This is done in the form of automated and interactive (shot) video retrieval, where the aim is to find a ranked list of shots that visually contain the desired query target, such as finding shots of “George Bush”. An early system specifically designed to address the multimedia factoid QA is presented in (Yang et al., 2003b) for news video. It follows a similar architecture as text-based QA, with video content analysis being performed at various stages of QA pipeline to obtain precise video answers. It also includes a simple video summarization process to provide the contextual aspects of the answers. Other than factoid QA, as far as we know, no research in the equivalent of multimedia definition and “how-to” QA has been attempted.
This article discusses the extension of text-based QA research to tackle a range of factoid, definition and “how-to” multimedia QA in a common framework. The system will be designed to find multimedia answers from Web-based media resources such as Flickr and YouTube. This article describes the architecture and the recent research on various types of multimedia QA for a range of applications. It focuses only on visual media such as image and video.
Overview of the Paradigm
The aim of question-answering is to present precise information to users instead of a ranked list of results as is done in the current search engines. Most text-based QA system follows the pipeline as shown in Figure 1 comprising: question analysis, document retrieval, answer extraction, and answer selection (Prager, 2006). Question analysis is a process which analyzes a question to extract the question context in the form of list of keywords, and identify the answer type and target in order to formulate question strategy. Document retrieval is a step that searches for relevant documents or passages from a given corpora. Answer extraction then extracts a list of answer candidates from the retrieved documents, by selecting sentences that cover the expanded query terms and contain the expected answer target. Finally, answer selection aims to pin-point the correct answer(s) from the extracted candidate answers. For definition QA, the last step also involves composing the selected answer sentences into a coherent summary. Step 2 on document retrieval aims to achieve high recall, while the last 2 steps aim to identify precise and correct answers. The last 2 steps often involve the use of deep linguistic analysis together with the use of domain knowledge.
Multimedia QA uses a similar retrieval pipeline as that in text-based QA as depicted in Figure 1. For the case of video, we can draw the analogy between text and video by equating: word to video frame, sentence to shot, paragraph to scene and document to video sequence. The first step on query analysis aims to expand query by identifying context of text search terms and inferring key high-level visual concepts in query. In addition, multimedia query expansion (Zha et al. 2009) may be performed to identify relevant visual examples to query. This gives rise to an expanded multimedia query. In step 2, the expanded query is matched against the meta-data of stored video sequences as well as their keyframes to retrieve the relevant ones. For the case of YouTube video, the metadata includes user-assigned tags and categories; in addition, it can include automatically generated high level visual concepts (Chang et al., 2006) of a pre-defined concept set, as well as relevant visual examples. In step 3, a combination of multi-modal analysis involving visual content, high-level visual features and metadata text is normally performed to identify good shot candidates. This step is similar to the processing done in most shot retrieval algorithms used in TRECVID video retrieval evaluations (Snoek and Worring, 2009). Re-ranking algorithm involving addition content features or domain knowledge is applied to produce the final ranking of shots in Step 4. For the case of definition QA, further processes are performed to identify the set of key shots and the sequencing of these shots into a coherent summary that meets both the semantic and temporal constraints.
In the next two Sections, the issues and challenges faced in multimedia definition and “how-to” QA research are presented.
Definitional QA was first introduced to the Text REtrieval Conference (TREC) QA Track main task in 2001 (Voorhees, 2001). Questions like “what is X” and “where did X happened”, which correspond to the event/entity definition, account for a large number of queries submitted to the Web search engines. To answer such questions, many search engines such as Google and Yahoo tend to rely on existing online definition resources, such as Google Definitions, Wikipedia, Biography.com, s9.com etc, depending on the types of queries. However, the definition of entities often changes over time, with many new ones being introduced daily. Therefore, automatic definitional QA systems are needed that can extract definition sentences that contain descriptive information about the target entity from multiple documents and summarizes these sentences into definitions.
A good overview of text-based definition QA can be found in (Cui et al., 2007). It differs from factoid QA in Step 4 in the way that it selects the relevant sentences that depict the diverse key aspects of the target entity. Most approaches select sentences to meet criteria of good information coverage, diversity, as well as exhibiting good definition patterns. Kor and Chua (2007) added human interest model to the selection criterion by preferring those sentences that are of greater interests to the users, and showed good performance on TREC QA dataset. For semi-structured text such as Wikipedia, Ye et al. (2009) incorporated knowledge of structure by leveraging on links between concepts and info-box to extract good definition summary for Wikipedia pages. Most techniques employed in text definition QA are applicable to community-contributed videos such as those available in YouTube.
In recent years, we have witnessed the exponential growth of community contributed social media on Flickr and YouTube, etc, where users collaboratively create, evaluate and distribute vast quantity of media contents. Take YouTube as an example, which is one of the primary video sharing sites. Studies have shown that it serves 100 million distinct videos and 65,000 uploads daily; and traffic of this site accounts for over 20% of the web in total and 10% of the whole internet, covering 60% of the videos watched on-line (Cha et al., 2007). The prevalence of Web 2.0 activities and contents has inspired intensive research to exploit the freely available metadata in multimedia content analysis. In the scenario of definitional video QA, it is important to exploit both visual and textual metadata information for selecting good video shots and generating high quality video summaries.
However, given an event type query, the retrieved videos from current popular video sharing sites tend to be diverse and somewhat noisy. For example, from the retrieved list of “September 11 attacks” from YouTube, we will see not only relevant video excerpts from news TV, but also re-assembled excerpts of news video clips produced by general users, as well as many irrelevant ones that are extensions to the event such as interviews to politicians etc. To navigate the mass of information, we need to be able to identify shots representing key sub-events while removing those auxiliary shots, similar to the text-based approach of identifying key relevant sentences while removing those irrelevant ones. Fortunately for the case of YouTube, most videos retrieved tend to share many video shots depicting key sub-events. In fact, recent studies (Wu et al., 2007) on video sharing sites have shown that there exists a significant amount of over 25% of duplicate videos in the search results. The content redundancy on web videos is categorized into two classes: near duplicate and partial overlap. The former indicates that most of the frames from the two videos are duplicates and the latter indicates that the video pair shares some near duplicate frames. The case of partial overlap can be utilized from a different perspective: they demonstrate the importance of this repeated video clips. We can exploit such content overlap in Web video sharing system to automatically answer definitional questions of events (Hong et al., 2009). For a given event or entity, the few scenes that convey the main messages, such as the principal sub-events or key shots, tend to be re-used in multiple news reports, and copied in many other self-assembled videos. Thus we can identify such shots by performing near duplicate detection on the set of retrieved videos.
Key shots identified through near duplicate detection meet the criteria of being salient and popular as they are re-use in multiple video sources. Further textual analysis of meta-text may also yield information on key sub-events. Together, they form the basis for definition QA for video for event and entity queries. Based on the above analysis, a video definition QA system was designed and implemented (Hong et al., 2009). It could give users a quick overview for a definition query by leveraging on the content overlap of the Web video search results. The framework consists of four main stages, namely, web videos and meta-data acquisition; visual processing of key shots (key shot linking, ranking and threading); semantic analysis of key shots (tag filtering and key shot tagging); and summary generation. Figure 2 shows the flowchart of the video definition QA system.
In Stage 1, given an event query, a ranked list of videos and the corresponding metadata for each video (tags, descriptions and titles) are retrieved from YouTube.
In Stage 2 on key shot processing, the system first performs shot segmentation and extracts the keyframes from the shots. For each keyframe, it then extracts its local point feature, such as the scale-invariant feature transform features (Lowe, 2004), for matching. To reduce the computation cost, local point features are mapped to a fixed dimension and some keyframes are filtered by offline quantization of keypoint descriptors. The keyframe pairs with similarity value above a given threshold are retained as near duplicate keyframes and their corresponding shots are defined as key shots. The key shots are then ranked according to their informative score, which is defined as the linear combination of relevance and normalized significance. After that, the shots are chronologically threaded. Here, the chronological order lies in the original videos is utilized and the threading is formulated as the minimization of the time lag between the key shot pairs in different original videos.
In Stage 3, the system performs tag analysis using the visual similarity based graph. Given the set of tags associated with each video, the metrics of representativeness and descriptiveness are defined to measure the ability of the tags to represent and describe the event or sub-events. These two metrices are sued to remove the noisy and less informative tags with respect to the query. A random walk (Hsu et al., 2007) is then performed on visual similarity-based graph to spread the tags to other key shots connected by near duplicate keyframe links.
Stage 4 performs the summarization in the form of video skim by sequencing the selected key shots using a greedy algorithm. The tag descriptions are embedded into the key shots to help users better comprehend the events. Both the duration of the video skims and the number of frames are flexible according to users’ requirements. Figure 3 shows some examples of the generated video summaries or skims by using the above approach (Hong et al. 2009).
Beyond definition QA, the next set of challenges in question-answering is to handle the “how-to” and “why-type” questions. Example of an “how-to” questions is “How to transfer pictures in my digital camera to computer?”. The ability to answer such questions requires understanding of the relevant contents, and often involves the generation of specific answers. This is beyond the capability of current technologies unless it is for a very narrow domain. Because of the strong demands for such services, community-based QA services, such as Yahoo!Answers, has become very popular. Through Yahoo!Answers services, people ask a wide variety of "how-to" questions and obtain answers either by searching for similar questions on their own or waiting for other users to provide the answers. As large archives of question-answer pairs are built up through user collaboration, the knowledge is accumulated and is ready for sharing. To facilitate sharing of such knowledge, one emerging research trend in text-based QA is to develop techniques to automatically find similar questions in Yahoo!Answers archives that have ready answers for the users (Wang et al., 2009).
However, even when the best text-based answer is presented to the users, say, for the “picture transfer” question, the user may still have difficulty grasping the answers. This is because from the textual answers, the users may still have no idea on how to deal with the USB cable, from such answer as "... connect your digital camera though USB cable ...". However, if we can present visual answers such as videos, it will be more direct and intuitive for the users to follow. Overall, in addition to normal textual references or instructions, visual references or instructions such as videos should be an ideal complementary source of information for the users to follow.
Some commercial websites like ehow (http://www.ehow.com/videos.html) do provide “how-to” videos. They do so by recruiting general users to produce problem-solving videos, so that other users can easily search or browse them. However the coverage of topics in this Web site is limited, as only carefully selected videos by certain photographers will be published on their website (Chen et al., 2007). On the other hand, community video sharing sites, such as YouTube and Yahoo Video, contain huge collections of videos contributed by the users. Many videos in such sites do provide “how to” instructions on a wide variety of popular topics in the domains of electronics, traveling, cooking etc. This makes such video sites ideal sources for offering answers to many popular “how-to” questions. In general, metadata tags on community-shared videos tend to be sparse and incomplete. Hence attempt to use the original user text queries to retrieve such videos from sites such as YouTube tend not to be effective. Also, we should exploit the richness of visual contents within the video in conjunction with textual information mentioned above to identify the best video answers. To address the above two issues, we need to leverage on the recent advances in text-based and visual-based query matching and expansion techniques to find related multimedia concepts and examples. On the text front, given an user issued query, effective technique has been developed to find semantically related questions with readily available text-based answers from Yahoo!Answers (Wang et al. 2009). The result is a set of semantically similar questions posed in different styles, forms and vocabulary by real users. A natural approach is therefore to utilize the text questions and answers found from Yahoo!Answers to expand the text query to incorporate the wide variations of terms used to express the same query. The expanded query will permit better recall in search of “how-to” videos from YouTube. In addition, multimedia query expansion may be performed to locate image examples by using the technique described in Zha et al. (2009). The multimedia examples may be used during the query stage to find more relevant “how-to” videos or during the re-ranking stage to identify the most relevant videos. In this way, we are able to combine the strength of both text-based and visual-based approaches to perform multimedia “how-to” QA.
The two main stages of “how-to” video QA system are as follows (Li et al., 2009). Stage 1 focuses on recall-driven related video search. It first finds similar questions posed using different language styles and vocabulary from Yahoo! Answers site and relevant visual examples from Flickr. It then uses the similar questions and visual examples found to improve the coverage of the original query. In a way, this is a multimedia query expansion step where text terms and visual examples related to the query commonly used by users are extracted. However, as community video site like YouTube can only take in precise text queries; hence instead of issuing the long expanded text query, we extract only the key text phrases from those questions found in Yahoo!Answers, and use that to formulate multiple search queries to ensure high recall of search results. An immediate research is how to utilize the visual samples found to further improve the recall of search results in an effective and efficient manner.
Stage 2 is the precision-driven video re-ranking step, where related videos based on their relevance to the original questions are re-ranked. There are three sources of information that can be utilized to perform the re-ranking to improve the search precision. The first is the redundancy of shots within the videos found through near duplicate detection as is done in previous section on “Definition QA”. In a similar idea, duplicate shots signify important key concepts within the query and can be used to rank the importance of videos. Second is to detect the presence of key visual concepts within the videos by using pre-defined visual concept detectors. As it is hard to build a comprehensive set of visual detectors with sufficient accuracy to cover a wide range of queries posed by the users, a more general approach is to utilize the visual examples found during the multimedia query expansion stage to re-rank the videos. An interesting result issue is how to utilize both approaches to perform visual re-ranking of videos. The third is to leverage on the community viewers comments. The fusion of the above three sources of information will permit effective re-ranking of videos to improve search precision.
The above 2-stage framework for “how-to” video QA has been implemented in Li et al. (2009), in which they focused on electronic domain. They utilized semantically similar questions from Yahoo!answers to perform text-based query expansion to perform the recall-driven retrieval in Stage 1. They then pre-defined a set of high-level visual concepts such as the camera, computer etc., that are important in this domain. They manually selected training images for these concepts using Google Image Search, and performed salient object recognition based on image matching techniques to detect the presence of these concepts in the video. They also analyzed community viewers' comments to assess the community’s opinion on video's popularity. In a way, this is similar to opinion voting. Finally, a rank fusion scheme was adopted to generate a new ranking list based on the evidences from visual cues, opinion voting and video redundancy. Figure 4 shows the top two results of the “how-to” video QA system developed in Li et al. (2009).
With the proliferation of multimedia contents on the Web, research on multimedia information retrieval and question-answering are beginning to emerge. The early work that addresses the issues of QA in video is the system named VideoQA as reported in Yang et al. (2003b). This system extends the text-based QA technology to support factoid QA in news video by leveraging on the text transcripts generated from ASR (Automated Speech Recognition) in news video as its visual contents. Users interact with the system using short natural language questions with implicit constraints on contents, duration, and genre of expected videos. The system comprises two stages. In the preprocessing stage, it performs video story segmentation and classification, as well as video transcript generation and correction. During the question answering stage, it employs modules for: question processing, query reinforcement, transcript retrieval, answer extraction and video summarization.
Following this work, several video QA systems were proposed with most of them relying on the use of text transcripts derived from video OCR (optical character recognition) and ASR outputs. Cao and Nunamaker (2004) developed a lexical pattern matching-based ranking method for domain-dependent video QA. Wu et al. (2004) designed a cross-language (English-to-Chinese) video QA system based on retrieving and extracting pre-defined named entity entries in text captions. The system enables users to query with English questions to retrieve the Chinese captioned videos. The system was subsequently extended to support bilingual video QA that permits users to retrieve Chinese videos through English or Chinese natural language questions (Lee et al., 2009). Wu and Yang (2008) presented a robust passage retrieval algorithm to extend the conventional text-based QA to video QA.
As discussed earlier, shot retrieval as proposed in TRECVID can also be regarded as a kind or base technology for factoid video QA. For example, if the user issues a query “Who is Barack Obama?”, the shot retrieval system would aim to return a video that visually depict the query subject. In this sense, the body of work done on shot retrieval (Snoek and Worring, 2009) as part of TRECVID efforts can be considered as research towards factoid multimedia QA. The first step in shot retrieval is to extract relevant semantic information for the shot. This includes ASR text, as well as possible presence of high level concepts, such as the face, car, building etc (Chang et al., 2006). Given a query, most shot retrieval systems follow similar retrieval pipeline of: query analysis, shot retrieval, shot ranking and answer selection (Neo et al., 2006). Query analysis performs query expansion and inference of relevant high-level concepts by considering the correlation between query text and concepts. In order to cover concept relations that cannot be inferred from statistics, knowledge-driven approaches to relating high-level concepts to queries have been incorporated. Given the expanded query, a combination of text and high-level concept matching is performed to retrieve relevant list of shots. A multi-modal approach is then employed to re-rank the shots for presentation to the users (Snoek and Worring, 2009).
Few works have been done on image-based QA except the one presented in (Ye et al., 2008) that describes a photo-based QA system to find information about physical objects. Their approach comprises three layers. The first layer performs template matching of query photo to online images to extract structured data from multimedia databases in order to help answer questions about the photo; it uses question text to filter images based on categories and keywords. The second layer performs searches on internal repository of resolved photo-based questions to retrieve relevant answers. In the third human-computation QA layer, it leverages community experts to handle the most difficult cases.
Overall, it can be seen that work on factoid multimedia QA has just been started, whereas little work has been done on the more challenging and practical tasks of definition and “how-to” QA.
With the proliferation of text and multimedia information, users are now able to find answers to almost any topics on the Web. On the other hand, they are also overwhelmed by the huge amount of information routinely presented to them. Question-answering (QA) is a natural direction to overcome this information over-loading problem. The aim of QA is to return precise answers to users’ questions. Text-based QA research has been carried out in the past 15 years with great success especially for answering fact-based questions with commercial offerings. This article extended the text-based QA research to multimedia QA to tackle a range of factoid, definition and “how-to” QA in a common framework. The system was designed to find multimedia answers from Web-based media resources such as Flickr and YouTube. This article described some preliminary research in definition QA on events, and “how-to” QA on electronic domain. The existing research results showed that it is feasible to perform factoid, definition and “how-to” QA by leveraging on large community-based image and video resources on the Web.
The research on multimedia QA is preliminary. Several follow-on research directions can be identified. First, there is an urgent need to setup large test corpora to promote multimedia QA research, especially on definition and “how-to” QA. Second, there is a need to develop better techniques for visual matching and visual concept detection. Visual concept detection techniques are important to un-covering additional visual contents in image/video clips. To ensure scalability of such techniques to Web-scale problems, we need to exploit the various online visual databases with comprehensive visual concept coverage and visual examples, such as the Wikimedia (http://wikimedia.org/) and online visual dictionaries. Finally, we need to extend the existing approaches to general domains.
- Cao, J. and Nunamaker, J. F. (2004). Question answering on lecture videos: a multifaceted approach. Proc. ACM/IEEE Joint Conference on Digital Libraries.
- Cha, M., Kwak, H., Rodriguez, P., Ahn, Y.-Y., and Moon, S. (2007). I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system. Proc. ACM SIGCOMM Conference on Internet Measurement.
- Chang, S.-F., Hsu, W., Jiang, W., Kennedy, L.S., Xu, D., Yanagawa, A., and Zavesky, E. (2006). Columbia University TRECVID-2006 video search and high-level feature extraction. Proc. TRECVID Workshop.
- Chen, K.-Y., Luesukprasert, L., and Chou, S. T. (2007). Hot topic extraction based on timeline analysis and multidimensional sentence modeling. IEEE Transactions on Knowledge and Data Engineering. 19(8):1016-1025.
- Cui, H., Kan, M.-Y., and Chua, T.-S. (2007). Soft pattern matching models for definitional question answering. ACM Transactions on Information Systems, 25(2).
- Fellbaum, C. (1998). WordNet: an electronic lexical database. Cambridge, USA: The MIT Press.
- Hong, R., Tang, J., Tan, H., Yan, S., Ngo, C.-W., and Chua, T.-S. (2009). Event driven summarization for web videos. Proc. ACM Multimedia 1st Workshop on Social Media.
- Hsu, W. H., Kennedy, L. S., and Chang, S.-F. (2007). Video search reranking through random walk over document-level context graph. Proc. ACM Multimedia.
- Kor, D. and Chua, T.-S. (2007). Interesting nuggets and their impact on definitional question answering. Proc. ACM SIGIR Conference, pp. 335-342.
- Lee, Y. S., Wu, Y. C. and Yang, J.C. (2009). BVideoQA: online English/Chinese bilingual video question answering. Journal of the American Society for Information Science and Technology, 60(3): 509–525.
- Li, G., Ming, Z., Li, H., Zheng, Y., and Chua, T.-S. (2009). Video reference: question answering on YouTube. Proc. ACM Multimedia.
- Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer, 60: 91–110.
- Natsev, A. P., Haubold, A., Tesic, J., Xie, L., and Yan, R. (2007). Semantic concept-based query expansion and re-ranking for multimedia retrieval, Proc. ACM Multimedia, pp. 991–1000.
- Neo, S.-Y., Zhao, J., Kan, M.-Y., and Chua, T.-S. (2006). Video retrieval using high level features: exploiting query matching and conﬁdence-based weighting, Proc. CIVR.
- Prager, J. M. (2006). Open-domain question-answering. Foundations and Trends in Information Retrieval, 1(2).
- Snoek, C. G. M. and Worring, M. (2009). Concept-based video retrieval, Foundations and Trends in Information Retrieval, vol. 4, issue. 2, pp. 215-322.
- Voorhees, E. M. (2001). Overview of the TREC 2001 question answering track. Proc. TREC workshop.
- Wang, K., Ming, Z., and Chua, T.-S. (2009). A syntactic tree matching approach to finding similar questions in community-based QA services. Proc. ACM SIGIR Conference.
- Wu, X., Hauptmann, A. G., and Ngo, C.–W. (2007). Practical elimination of near-duplicates from web video search. Proc. ACM Multimedia.
- Wu, Y. C., Lee, Y. S., and Chang, C.H. (2004). CLVQ: cross-language video question/answering system. Proc. the 6th IEEE international symposium on multimedia software engineering.
- Wu, Y. C. and Yang, J. C. (2008). A robust passage retrieval algorithm for video question answering. IEEE Trans. on Circuits and Systems for Video Technology, 18(10).
- Yang, H., Chua, T.-S., Wang, S., and Koh C.-K. (2003a). Structured use of external knowledge for event-based open-domain question-answering. Proc. ACM SIGIR Conference, pp. 33-40.
- Yang, H., Chaisorn, L., Zhao Y., Neo S.-Y., and Chua T.-S. (2003b). Video QA: Question Answering on News Video. Proc. ACM Multimedia.
- Ye, S., Chua, T.-S., and Lu, J. (2009). Summarizing definition from Wikipedia. Proc. ACL.
- Yeh, T., Lee, J. J., and Darrell, T. (2008). Photo-based question answering. Proc. ACM Multimedia.
- Zha, Z. J., Yang, L., Mei, T., Wang, M., Wang, Z. (2009). Visual Query Suggestion. Proc. ACM Multimedia.
- Xian-Sheng Hua and Hong-Jiang Zhang (2008) Media Content Analysis. Scholarpedia, 3(2):3712.
- Roy Wise (2009) Reinforcement. Scholarpedia, 4(8):2450.