2. Related Works

Story segmentation and video classification are hot topics of research for many years now and much interesting research has been done. Because of the difficulty and often subjective nature of video classification, most early works examined only certain aspects of video classification and story segmentation in a structured domain such as sports or news.

Ide et al. [12] used videotext, motion, and face as the features to tackle the problem of news video classification. They first segmented the video into shots and used multiple techniques including clustering to classify each shot into one of the five classes of: Speech/report, Anchor, Walking, Gathering, and Computer Graphics categories. Their classification technique seems effective for this restricted class of problems. Zhou et al. [22] examined the classification of basketball videos into a set of restricted categories of Left-court, Middle-court, Right-court, and Closed-up. They considered only motion, color and edges as the features and employed a rule-based approach to classify each video shot (represented using a key frame). Chen and Wong [4] also used a rule-based approach to classify news video into six classes of news, weather, reporting, commercials, basketball, and football. They used the feature set of motion, color, text caption, and cut rate in the analysis.

Another category of techniques incorporated information within and between video segments to determine class transition boundaries using mostly the HMM approaches. Eickeler et al. [9] considered 6 features, deriving from the colour histogram and motion variations across the frames, and employed HMM to classify the video sequence into the classes of Studio Speaker, Report, Weather Forecast, Begin, End, and Editing Effect. Huang et al. [11] employed audio, colour, and motion as the features and classified the TV program (an input news video) into one of the categories of news report, weather forecast, commercial, basketball game, and football game. Alatan et al. [1] aimed to detect dialog and its transitions in fiction entertainment type videos. They modelled the shots using the features of audio (music/silence/speech), face and location changed, and used HMM to locate the transition boundary between the classes of Establishing, Dialogue, Transition, and Non-dialogue. Greiff et al. [10] used only text from transcript as the feature and employed the HMM framework to model the word sequence. Each word was labelled with a state number from 1 to 251, and a story boundary was located at a word produced from state 1 for the maximum likelihood state sequence.

In summary, most reported works considered only a limited set of classes and features, and provided only partial, intermediate solutions to the general video organization problem. In our work, we want to consider all essential categories of shots and scenes to cover potentially all types of news video. A major difference between our approach and existing works is that we perform the story segmentation analysis at two levels, similar to the approach successfully employed in NLP research. Furthermore, we aim to organize video at shot and story levels to facilitate user access.