The document proposes a method to automatically generate trailers for animation movies. It analyzes movies at both the shot and frame level. At the shot level, it identifies "action shots" as segments with a high ratio of shot changes, indicating rapid scene changes. At the frame level, it computes histograms of color differences between frames to identify shots with more spatial activity. The proposed trailer is generated by selecting segments from the identified action shots, giving more coverage to shots with more inter-frame differences. The method was tested on 10 animation movies from film festivals.
1. Animation Movies Trailer Computation
Bogdan Ionescu
LAPI - University "Politehnica"
Bucharest
061071 Postal Code
Bucharest, Romania
bionescu@alpha.imag.pub.ro
Patrick Lambert
∗
LISTIC - University of Savoie
B.P. 806, 74016 Annecy Cedex
Annecy, France
patrick.lambert@univ-
savoie.fr
Didier Coquin
LISTIC - University of Savoie
B.P. 806, 74016 Annecy Cedex
Annecy, France
didier.coquin@univ-
savoie.fr
Laurent Ott
LISTIC - University of Savoie
B.P. 806, 74016 Annecy Cedex
Annecy, France
laurent.ott@ifrance.com
Vasile Buzuloiu
LAPI - University "Politehnica"
Bucharest
061071 Postal Code
Bucharest, Romania
buzuloiu@alpha.imag.pub.ro
ABSTRACT
This paper presents a method for the automatic generation
of animation movie trailers. First, the movie is divided into
shots, by detecting the video transitions (cuts, fades and dis-
solves) and an animation movie specific color effect named
short color change or SCC (i.e. explosions, thunders). The
movie action content is further highlighted by analyzing the
movie at two different granularity levels. First, an inter-shot
analysis is performed by measuring the video transition tem-
poral distribution. As repetitive shot changes are related to
action, we define as an action shot, a movie segment con-
taining a high shot change ratio. On the other hand, an
inter-frame analysis is performed: for each shot within an ac-
tion shot, an histogram of cumulative inter-frame distances
is computed to serve as a measure of the frame spatial ac-
tivity. Repetitive color changes are also related to action.
Since movie trailers only show some of the most attractive
movie scenes, the proposed trailer is a moving-image ab-
stract computed on the retained action shots. It provides
the user with a compact and efficient representation of the
movie action content. The proposed approach was tested on
several animation movies.
Categories and Subject Descriptors
I.2.10 [Computing Methodologies]: Artificial Intelligence—
vision and scene understanding; H.3.m [Information Sys-
tems]: Information Storage and Retrieval—miscellaneous
General Terms
Algorithms
∗We thank CICA - International Animated Film Center and
Folimage animation company for providing us with anima-
tion movies and for their support.
Copyright is held by the author/owner(s).
MM’06, October 23–27, 2006, Santa Barbara, California, USA.
ACM 1-59593-447-2/06/0010.
1. INTRODUCTION
Since the volume of the multimedia content is continu-
ously growing, a very large amount of video data is avail-
able. In order to efficiently browse through the significant
video content, summarization is required. Thanks to the
”The International Animated Film Festival” [1], which has
taken place every year at Annecy, France, since 1960, a very
large database of animation movies is available. Managing
thousands of videos is a tedious task; therefore having an
efficient content abstract would be more than welcome.
There are two fundamentally different types of video ab-
stracts. The still-image abstracts, or video summary, are
a small collection of salient images (keyframes) that best
represent the underlying content. The available methods
differ in the way the keyframes are constructed. For exam-
ple in [3] the keyframes are extracted by using derivatives
on a curve of characteristic frame vectors, while mathemat-
ical modeling is used in [4] (for a literature survey on video
summary see [5] [14]). On the other hand the moving-image
abstract, or movie skimming, consists of a collection of
image sequences. The possibly higher computational effort
during the abstract process pays off during the playback
time: it’s usually more natural and more interesting for
users to watch a trailer than watching a slide show [5]. There
are basically two types of movie skimming: summary se-
quence which is used to provide users with an impression
about the entire movie content and movie highlight which
only contains the most interesting parts of the movie [6].
One simple and straightforward method to create a movie
skimming is to include some of the neighborhood frames
of the keyframes extracted for the movie summary, as pre-
sented in [7]. On the other hand the existing approaches for
the generation of movie highlights are being related to the
characteristics of extracted events, for example: events with
a specific semantic label [8] or events that evoke certain re-
actions from the narrator [9]. A particular case of the movie
highlight is the movie trailer which only shows some of
the most attractive action scenes (i.e. scenes containing
special events like dialogs, explosions, text occurrences and
631
2. general action [10]). Very little work has been done in this
field as it requires a semantic movie content understanding.
Defining which segments are the highlights is actually a very
subjective process [5].
In this paper we propose a method for the automatic gen-
eration of animation movie trailers in the context of the
”The International Animated Film Festival” [1]. Animation
movies from [1] are different from classical cartoons or con-
ventional movies in many respects: every animation movie
has its own color distribution, artistic concepts are used,
the predominant motion is the object motion [11]. Under-
standing the movie content is sometimes impossible: some
animation experts say that more than 30% of the animation
movies from [1] does not have any logical meaning.
The proposed method is based on highlighting the movie
action segments by analyzing the movie both at shot and
frame level. At the shot-level the movie rhythm is ana-
lyzed by evaluating the mean shot change ratio using the
video transition time distribution. Also, for each shot a
frame-level analysis is performed by computing a cumula-
tive inter-frame distance histogram in order to capture the
frames spatial activity. An action shot is defined as a movie
segment containing both a high shot change ratio and high
histogram values. The movie trailer is further generated as
a moving-image abstract of the obtained action shots.
The article is organized thus: Section 2 describes the
proposed movie trailer generation method, in Section 3 we
present some experimental results and Section 4 contains
final considerations and future improvements.
2. THE PROPOSED METHOD
As discussed in Section 1, the proposed trailer generation
method analyzes the movie both at shot and frame level.
The method diagram is illustrated with Figure 1.
Movie
trailer
shot 1
shot N
. . .
action shot M
action shot 1
skimming
segmentation
fusion
inter−frame analysis
Movie . . .
Annotation
Figure 1: Trailer generation method diagram.
2.1 Movie temporal segmentation
First, the movie is divided into shots by detecting the
video transitions (cuts, fades and dissolves) and an anima-
tion movie specific color effect named ”short color change”
or SCC (i.e. thunders, explosions). Specially designed de-
tection algorithms that were developed to manage the diffi-
culties raised by the peculiarity of animation movies are used
[12]. Shots are determined by fusing the detected video tran-
sitions and then by removing less relevant frames as they
do not contain meaningful information (i.e. black frames
between transitions). A video transition annotation is
generated in order to capture the movie global transitions
distribution. The proposed annotation describes the movie
temporal evolution as a time-continuous signal interrupted
by the occurrence of the video transitions. Different signal
shapes are associated to each particular transition by pre-
serving the transition length, i.e. a cut is a 0 signal value, a
SCC is a small peak (see the red line graph in Figure 2).
2.2 Inter-shot analysis
The shot-level analysis aims at highlighting the movie ac-
tion segments. Experimental tests proved that almost in all
the animation movies from [1] the most attractive scenes
are related to fast repetitive shot changes. On the proposed
video annotation, these situations correspond to graph re-
gions containing high densities of vertical lines (see the two
marked action segments in Figure 2). Based on these con-
siderations we have defined an action shot, as a movie
segment containing a high density of video transitions.
The video transition density is analyzed by estimating the
mean shot change speed, v̄T , on a time basis of T seconds.
High values of v̄T correspond to a high number of video
transition occurrences within the time interval T and reflect
a fast movie rhythm.
First, we define a basic indicator, ζT , being related to the
time structure of the movie and representing the relative
number of shot changes, Nsc, within the frame interval of
T · 25 frames (as 1s = 25 frames) or time interval of T sec-
onds: ζT = Nsc|T ·25. Regarding ζT as a discrete random
variable, its distribution for the entire movie could be eval-
uated by computing the Nsc values for all the overlapped
time windows of size T seconds. Using ζT , we define the
mean shot change speed, v̄T , as:
v̄T = E{ζT } =
T ·25
t=1
t · fNsc (t) (1)
where fNsc is the probability density of Nsc defined as:
fNsc (t) =
1
N
N
i=1
δ(Ni
sc − t) (2)
where N is the number of the analyzed time windows of
size T seconds, i is the current analyzed frame interval:
[ni, ni + T · 25] containing Ni
sc shot changes. We can also
note that, N = (Tmovie −T)·25+1 and ni+1 −ni = 1, where
Tmovie is the movie length measured in seconds. The action
shots are further obtained by using the following algorithm:
a.Thresholding: all the frames within the current ana-
lyzed frame window i of size T seconds are marked as action
frames if ζT v̄T . An action segment is a time continuous
interval of action frames and it is represented as a binary
True/False signal (see the graph a in Figure 2).
b.Merging: first, the SCCs are marked as action segments
as they contain attractive movie information. Then, the
neighbor action segments at a distance lower than the anal-
ysis window T are merged together. This step allows us to
erase the small gaps as we can see in the graph b in Figure 2.
c.Clearing: the small action segments having the length
smaller than the analysis window T are being erased. This
step allows us to erase the small spikes in the action seg-
ments (see the graph c in Figure 2).
d.Removing: all the action segments containing only one
movie shot are being removed as it is possible to obtain false
action segments due to short movie segments containing a
high value of v̄T (see the graph d in Figure 2).
632
3. a
b
c
d
action segments
action segments
cut
scc
Figure 2: Action shot computation using T = 5s (the
oX axis is the frame index, a − d are the computa-
tion steps): video annotation graph (red line), the
obtained action shots (green line).
Several tests were performed on various animation movies
for different values of T (T ∈ {1, ..., 10} seconds). The T
value is related to the granularity of the action shots. Us-
ing small values of T, will result in a high density of small
length action shots (the action shots are over segmented). A
good compromise between the action shot length and den-
sity proved to be T = 5s (see Figure 2).
2.3 Intra-shot analysis
As the inter-shot analysis provides us only with the movie
global action information, a frame-level analysis is proposed.
The movie shot content is analyzed by using a cumulative
inter-frame distance histogram. The method was inspired
from the median filtering techniques.
First, the movie is both temporally and spatially sub-
sampled: only one frame in 2 is retained and then only one
pixel for each block of 4 × 4 pixels is retained, all in or-
der to reduce the computational time. The colors are then
reduced using a RGB uniform color space quantification in
only 5×5×5 colors, being motivated by the fact that the an-
imation movies use a reduced color palette. For each frame
a color histogram is computed: Hl
shotk
(c) with k being the
shot index, l the shot current analyzed frame index and c
the color index (c ∈ {1, ..., 125}).
As a histogram similarity measure we propose the use of
the Manhattan distance, dM , which requires a reduced com-
putational time and is normalized to 1:
dM (Hl
shotk
, Hm
shotk
) =
Nc
c=1 |Hl
shotk
(c) − Hm
shotk
(c)|
2 × Np
(3)
where Nc = 125 is the number of colors and Np is the frame’s
total number of pixels. The normalized inter-frame cu-
mulative distance for the frame l of the shot k , Dshotk
(l),
is defined as:
Dshotk
(l) =
m∈S,l=m dM (Hl
shotk
, Hm
shotk
)
Card(S) − 1
(4)
where S is the retained frame set for the shot k.
The cumulative inter-frame distance histogram, de-
noted HD
shotk
is computed by quantifying the Dshotk
(l) val-
ues for l ∈ S into Nb = 100 bins, (which proved to be a good
compromise between histogram size and precision) :
HD
shotk
(di) =
l∈S
δ(Dshotk
(l) − di) (5)
with S the retained frame set for the shot k, di the normal-
ized cumulative inter-frame distance value, i the bin index
(i = 1..100).
Figure 3: An example of a multi-modal histogram
(the oX axis corresponds to the bin index and the
oY axis to the histogram value).
High histogram values correspond to frame spatial activ-
ity. By analyzing the obtained cumulative histograms for
several animation movies from [1] we found that there is a
limited number of such histogram patterns. They are:
1. small distance histograms (pattern 1) which cor-
respond to shots with very few color changes (almost
uniform);
2. histograms with both small and high distances
(pattern 2) which correspond to shots with a predomi-
nant frame color similarity but also containing several
important color changes;
3. multi-modal histograms (pattern 3) which corre-
spond to shots containing several similar color groups
of frames usually linked by a camera motion (see in
Figure 3 an example for a shot containing a 3D cam-
era motion with several focuses on different important
objects);
4. single-modal histograms (pattern 4) which match
the shots containing a high amount of color changes.
For each shot within the obtained action shots, the partic-
ular inter-frame histogram patterns are determined by ana-
lyzing the min/max value distribution using the algorithm
proposed in [13].
2.4 Trailer generation
The proposed trailer, denoted Atrailer, is defined as:
Atrailer =
M
m=1
Nm
k=1
sk
p% (6)
where M is the number of the determined action shots, Nm
is the number of movie shots for the current action shot m,
sk
p% is a centred image sequence containing p% of the shot
k frames. More details will be captured for the longer shots
as they contain more information.
The p value is related to HD
shotk
pattern. For the shots
with a histogram pattern 1 or 2 which contain similar color
information, a smaller value of p is used (around 15%). But
for the shots with histogram patterns 3 or 4, which contain
much more action information, we use p = 35%. The values
of p were empirically determined by the manual analysis of
several animation movies, having as constraint the preser-
vation of the visual continuity of the obtained trailer.
633
4. 3. EXPERIMENTAL RESULTS
Evaluating a video abstraction technique is a very subjec-
tive task as it relies on the human perception of the video
content. However, no consistent evaluation framework ex-
ists yet and the proposed approaches usually lack of perfor-
mance comparison with other existing techniques. Several
evaluation methods were proposed: the result description,
objective metrics and user studies which are probably the
most useful and realistic form of evaluation [14].
The proposed animation trailer was tested on 10 short
animation movies from [1][2] with a total time of 61 min-
utes, namely: ”Casa” (#1), ”Circuit Marine” (#2), ”Fer-
railles” (#3), ”Francois le Vaillant” (#4), ”Gazoon” (#5),
”La Bouche Cousue” (#6), ”La Cancion du Microsillon”
(#7), ”Le Moine et le Poisson” (#8), ”Paroles en l’Air”
(#9) and ”The Buddy System” (#10).
The quality of the obtained results was evaluated by con-
ducting a user study involving 27 animation artists and or-
dinary people. The test consisted in answering several ques-
tions related to the quality of the action content represen-
tation of the proposed movie trailers.
For the question A: ”Do you think that the proposed
trailer contains the movie most important parts?”
the answers where represented using a score ranging from
1 to 10, meaning: X=don’t know, 1, 2=not at all, 3, 4=very
few, 5, 6=some, 7, 8=almost all and 9, 10=all of them. We
have achieved a global mean score of 7.7 with a standard
deviation of 1.3.
For the question B: How do you find the length of
the proposed trailer? a score ranging from 0 − 4 is used:
0=very short, 1=short, 2=appropriate, 3=long and 4=very
long. We have achieved a global mean score of 2.6 with the
standard deviation of 0.6 (see Figure 4).
Question A
7,5 8,2 8,3 8,1 8,5 6,0 7,4 8,0 7,6 7,6
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
X=1
X=4
Question B
2,3 3,6 3,2 3,1 2,4 1,9 2,2 2,5 2,5 2,5
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
+/- std.dev.
Figure 4: The obtained scores for the question A
(top) and B (bottom). The oY axis is the achieved
mean score (X is the number of ”don’t know” an-
swers).
The proposed video trailers show very satisfactory scores,
being appreciated as containing almost all the action parts
of the movies and having a correct length.
4. CONCLUSIONS
In this article we have proposed an automatic trailer gen-
eration for the special case of animation movies [1][2]. The
proposed method analyzes the movie both at a shot level by
highlighting the movie action segments and at a frame level
where a cumulative inter-frame color histogram are com-
puted as a measure of the frame spatial activity. The pro-
posed trailer is generated as a moving-image abstract of all
the highlighted movie segments with respect to their spa-
tial activity. The validation of the proposed methods was
performed by conducting a user study involving several an-
imation movie experts from [1]. Future work will consist in
adding the movie sound-track to the proposed video skim.
5. REFERENCES
[1] Centre International du Cinema d’Animation.
”http://www.annecy.org”.
[2] Folimage Company. ”http://www.folimage.com”.
[3] A. D. Doulamis, N. Doulamis and S. Kollias.
Non-sequential video content representation using
temporal variation of feature vectors. IEEE
Transactions on Consumer Electronics, 46(3), 2000.
[4] H.-C. Lee and S.-D. Kim. Iterative key frame selection
in the rate-constraint environment. Signal Processing:
Image Communication, 18:1-15, 2003.
[5] Y. Li, T. Zhang and D. Tretter. An overview of video
abstraction techniques. Tech. Rep. HP-2001-191, HP
Laboratory, July 2001.
[6] A. Hanjalic and H.J. Zhang. An Integrated Scheme for
Automated Video Abstraction Based on Unsupervised
Cluster-Validity Analysis. IEEE Transactions on
Circuits and Systems for Video Technology, 9(8), 1999.
[7] J.-Q. Ouyang, J.-T. Li and Y.-D. Zhang. Replay
Boundary Detection in Mpeg Compressed Video.
International Conference on Machine Learning and
Cybernetics, 5, 2003.
[8] J. Assfalg, M. Bertini, C. Colombo, A. Del Bimbo and
W. Nunziati. Semantic annotation of soccer videos:
automatic highlights identification. Computer Vision
and Image Understanding, 92(2-3):285-305, 2003.
[9] F. Coldefy and P. Bouthemy. Unsupervized Soccer
Video Abstraction Based on Pitch, Dominant Color
and Camera Motion Analysis. ACM Multimedia, pages
268-271, 2004.
[10] R. Lienhart, S. Pfeiffer and S. Fischer. Automatic
Movie Abstracting and Its Presentation on an
HTML-Page. ”http://www.informatik.uni-
mannheim.de/pi4/publications/Lienhart1997c.pdf”.
[11] Cees G.M. Snoek and M. Worring. Multimodal Video
Indexing: A Review of the State-of-the-art. Multimedia
Tools and Applications, 25(1):5-35, 2005.
[12] B. Ionescu, V. Buzuloiu, P. Lambert and D. Coquin.
Improved Cut Detection for the Segmentation of
Animation Movies. IEEE International Conference on
Acoustic, Speech and Signal Processing, Toulouse,
France, may 2006.
[13] H. Cheng and Y. Sun. A Hierarchical Approach to
Color Image Segmentation using Homogeneity. IEEE
Transactions on Image Processing, 9(12):2071-2082,
2000.
[14] B.T. Truong and S. Venkatesh. Video Abstraction: A
Systematic Review and Classification. Accepted for
ACM Transactions on Multimedia Computing,
Communications and Applications, 2006.
634