State of the art and requirements analysis
The state of the art on a number of analysis topics that are of interest to LinkedTV have been detailed and requirements for the analysis-related work within LinkedTV detailed. The resulting document (D1.1) addresses the topics of: Visual information pre-processing and representation, Visual object and scene labelling, complementary text and audio analysis, event and instance-based labelling of visual information, and user assisted annotation.
Hypervideo analysis: video fragmentation
A preliminary evaluation of a shot segmentation technique that detects both abrupt and gradual transitions performed remarkably well (>90% correct detection), with the main cause of error being photographer’s flashes. Augmenting the technique with flash detection, over three quarters of erroneous detections caused by flashes were corrected. We have also looked at scene segmentation, as well as techniques for spatial and spatio-temporal video segmentation.
Hypervideo analysis: concept labelling
A key focus in LinkedTV is labelling fragments of video with concepts. We made an initial test based on SURF descriptors and one SVM classifier, using the 346 concepts of TRECVID SIN as labels. Then we explored the use of multiple combinations of features, interest point detectors and codebook definition strategies, together with multiple SVMs per concept. We work on more elaborate SVM classifiers, such as Linear Subclass SVM. For face detection, we develop an OpenCV-based solution.
Hypervideo analysis: text and audio analysis
We analysis text using a statistical approach to keyword extraction. Audio analysis is focused on an Automatic Speech Recognition (ASR) tool for German, combined with speaker identification (tested with German parliament speeches from 253 different speakers, there was a 8% error rate). Finally, work has been done on extracting non-speech audio features, to support the event detection.
Hypervideo analysis: event and instance based labelling
Combining the concept labels with the results of text and audio analysis to derive event detection, we participated with a first version of our hybrid technique in the Multimedia Event Detection (MED) task of TRECVID. Object re-detection – where an object is labelling manually once in a video and then all other appearances of the same object can be identified throughout the video – addresses the problem of instance based labelling. We use a baseline technique based on SURF and RANSAC which we will extend.
The annotation tool combines the analysis techniques supported in the hypervideo analysis step and provides a user-friendly interface to visualise and correct the results before exporting the annotation in an appropriate format to the subsequent components in LinkedTV. A first version is developed based on the EXMARaLDA toolkit, which can aggregate results of shot segmentation, ASR, speaker recognition, etc.
Complementing improved accuracy of our automated analysis techniques, an Editor Tool has been developed which allows human curators to check, correct and complete the video annotations via an intuitive, Web based GUI.