3D Human Pose And Shape Estimation From Multi-View Imagery

3D Human Pose and Shape Estimation from Multi-view Imagery
Atul Kanaujia
ObjectVideo, Inc.
akanaujia@objectvideo.com
Niels Haering
ObjectVideo, Inc.
nhaering@objectvideo.com
Graham Taylor
New york University
gwtaylor@cs.nyu.edu
Chris Bregler
New York University
chris.bregler@nyu.edu
Abstract
In this study we present robust solution for estimating
3D pose and shape of human targets from multiple, syn-
chronized video streams. The objective is to automatically
estimate physical attributes of the targets that would allow
us to analyze its behavior non-intrusively. Proposed system
estimates the anthropometric skeleton, pose and shape of
the human target from the 3D visual hull reconstructed from
multiple silhouettes of the target. Discriminative (bottom-
up) method is used to first initialize 3D pose of the tar-
gets using low-level features extracted from the 2D image.
The pose is refined using generative (top-down) method
that also estimates the optimal skeleton of the target us-
ing anthropometric prior models learned from the CAESAR
dataset. Statistical shape models are also learned from the
CAESAR dataset and are used to model both global and lo-
cal shape variability of human body parts. We also propose
a novel optimization scheme to fit 3D shape by searching in
the parametric space of local parts model and constraining
the overall shape using a global shape model. The system
provides a useful framework for automatically identifying
dispropotionate body parts, estimating size of backpacks
and inferring attributes like gender, age and ethnicity of the
human target.
1. Introduction
With rapid advancements in computer vision technology
and the emergence of matured technologies for detection
and tracking of human targets from a significant stand-off
point, there is a greater need for cognitive video analytics
with the ability to infer subtle attributes of humans and ana-
lyze human behavior. Tremendous progress has been made
in sensor technology in recent years, thus enabling develop-
ment of advanced video sensors with gigapixel resolution,
that are capable of providing sufficient image resolution for
detailed analysis of human targets from a large distance.
In this paper we propose a framework for inferring var-
ious human attributes from multi-view video based 3D hu-
man pose and shape estimates. Detailed 3D human shape
estimation from multi-view imagery is still a difficult prob-
lem that does not have satisfactory solution. Our fully auto-
mated system estimates the skeleton, 3D pose and shape of
human targets from multi-view images obtained from syn-
chronized and calibrated sensors, in a non-intrusive way.
Concealed objects typically manifests as local artifacts on
human body surface and are difficult to fit using global hu-
man shape models. In order to accurately fit local bulges on
the human body, we use an iterative mechanism to locally
fit a body part shapes to the sensor data. The estimated 3D
shape of the target is used to classify its gender, whether it is
carrying backpack or not, concealing an object and to infer
dimensions of various body parts.
Contributions: Our approach combines the strengths
of discriminative (bottom-up) approach with model-based
generative (top-down) algorithms for efficient estimation
of 3D pose and shape of the target. In that respect, our
work is similar to [19]. However, as discussed in section
3, our human modeling technique is significantly differ-
ent in representation compared to theirs. In addition, we
use local, parts-based shape models for fitting 3D shapes
to the data. Local parts-based shape models provide richer
and more flexible representations of human body shapes en-
abling improved fitting to the abnormal target shapes. Fol-
lowing are the key contributions of our approach: (1) We
propose coarse-to-fine, 3D pose and shape fitting algorithm
that uses an intermediate step of pose refinement using a
cylindrical parts based human shape model. This allows us
to easily enforce anthropometric constraints like non-self
penetration of body parts, which is costly to impose when
modeling finer surface deformation ; (2) To efficiently infer
both skeleton and shape of the human target in any pose,
we model human body using a joint subspace of anthropo-
metric skeletons and 3D shapes ; (3) We have developed a
novel parts-based 3D shape and pose optimization scheme
that fits the part shapes locally to the observation at the same
time constraining the overall shape globally ; (4) We extend
3D shape model to fit the shapes of humans carrying acces-
sories such as backpack.
Related work: Initial work on marker-less motion-capture
focused on accurate 3D pose estimation from single and
49

multi-view imagery. A comprehensive survey of existing
state of the art techniques in vision-based motion capture
is provided in [15]. Bregler and Malik [4] proposed a rep-
resentation for articulated human models using twists that
has been widely employed in a number of single and multi-
ple camera based motion capture systems [9, 17, 21, 8, 7].
Compared to earlier approaches [15] that modeled hu-
man shapes with cylindrical or superquadrics parts, current
methods use more accurate modeling of 3D human shapes
using SCAPE body models [2] or CAESAR dataset [1]. A
number of recent multi-camera based systems proposed by
Balan and Sigal [2, 3, 19] employed SCAPE data to model
variability in 3D human shapes due to anthropometry and
pose. They have used these shape models to estimate hu-
man body shape under loose clothing and also efficiently
track across multiple frames. Guan et. al[12] used SCAPE
based shape model to perform height-constrained estima-
tion of body shape. These approaches however lack ar-
ticulated skeleton underlying the human body shape. The
3D shape deformation of body surface is captured by track-
ing the 3D mesh surfaces directly. Deforming the 3D mesh
while maintaining the surface smoothness is not only com-
putationally demanding but also ill-constrained, occasion-
ally causing poor surface deformation due to noisy silhou-
ettes (or visual hull). In parallel to above approaches, Mun-
derman et. al [16] developed a SCAPE model with an un-
derlying skeleton to track 3D shapes of a human target in
multi-view image sequences using an extension of Itera-
tive Closest Point (ICP) algorithm. Our proposed system
resembles more closely to the work proposed by Gall et.
al[8],[21]. In addition to proposing combined skeleton and
3D shape based human models, they fit 3D pose to multi-
view image data using a combined local and global opti-
mization scheme.An extension of the above work [9] used
action based priors to improve pose tracking and 3D shape
estimation from multi-view image data. Moll et. al [17]
proposed a multi-modal system to improve 3D human pose
and shape estimation from multi-view imagery by using
both visual cues and global orientation information from in-
ertial sensors. Chen et. al [5] developed a non-linear mani-
fold representation of 3D shape variability of humans due to
pose and anthropometry. They use non-linear optimization
to search in the low-dimensional parameter space of shape
and camera parameters to optimally fit 3D shape to the sil-
houette.
2. System Overview
Fig. 1 shows the overview of the system. The system
uses synchronized streams of multi-view image sequences
of human target from a set of calibrated cameras as inputs.
It generates a 3D volumetric reconstruction (visual hull) of
the target using space carving (fig. 1(a)) from the target sil-
houettes. We use bottom-up predictors to generate initial
Figure 1. Overview of the proposed system for 3D human pose and
shape estimation;(a) 3D Data is acquired as volumetric reconstruc-
tion using space carving;(b) Human shape is modeled by register-
ing 3D template mesh to laser scans of human subjects and using it
to learn PCA subspace ; (c) Bottom-up predictors are used to gen-
erate initial human pose hypotheses using features extracted from
the image; (d) Pose predictions are refined by top-down search
in pose and shape space using a coarse 3D human shape model ;
(e) Detailed 3D shape is estimated by searching in the parametric
space of shape models of individual parts; (f) Estimated pose and
shape is used for inferring human attributes and anomalous shapes
hypotheses of the articulated 3D pose of the human inde-
pendently from each sensor and fuse them at the semantic
3D pose level (fig. 1(c)).
The 3D pose is refined by top-down (generative) meth-
ods that uses Markov Chain Monte Carlo (MCMC) based
search to efficiently fit a coarse 3D human shape model
(with cylindrical body parts) to the extracted visual hull
(fig. 1(d)). The top-down models are used to search in
both pose and parametric space of skeleton and coarse
3D human shapes to maximize the overlap with the visual
hull. We model the space of detailed human shape vari-
ation using Principal Component Analysis (PCA). Human
3D shape model is learned by first establishing one-to-one
correspondence between a hole-filled, template 3D mesh
model and a corpus of human body scans from CAESAR
Dataset [18] (fig. 1(b)). The registered 3D mesh data is
used to learn low-dimensional models for local parts-based
and global shape variability in humans. Detailed 3D shape
of a target human is obtained by searching in the PCA-based
low-dimensional parametric shape space for the best fitting
match (fig. 1(e)).
The developed system is used for analyzing 3D human
shapes and inferring attributes of the human target such as
50

Figure 2. 3D mesh surface and underlying skeleton of a template
human model is iteratively deformed to align it to the human body
scan data (CAESAR dataset). All scans have 73 landmark points
on body surface that are used for 3D shape registration
gender and dimensions of their body parts. In all of our ex-
periments we employed 4 calibrated cameras placed along
directions to maximally capture the entire viewing sphere
around the target. Although using fewer cameras introduces
ambiguity, we overcome this problem by using efficient an-
thropometric priors for searching in both pose and shape
space.
3. 3D Human Pose and Shape Modeling
We model human body as combination of an articulated
skeleton and 3D shape. The shape is modeled both coarsely
(using cylindrical parts) and finely (using detailed 3D sur-
face mesh). We learn the 3D shape models for both entire
human body and individual body parts (15 components).
We make the assumption that the human body shape gets
deformed only due to the underlying skeleton (and not due
to other factors such as clothing). Use of skeleton in de-
forming a 3D mesh surface is more robust to noisy silhou-
ettes compared to skeleton free shape estimation [2] as it
puts additional constraints to the shape fitting by searching
in the parametric space of human shape models.
3D Data Acquisition: Targets are localized using change
detection. We model background pixel intensity distribu-
tion as non-parametric kernel density estimate to extract
silhouettes of moving targets. Image streams from multi-
ple calibrated sensors are used to reconstruct 3D volumetric
representation (visual hull) of the human target using space
carving. We use octree-based fast iterative space carving
algorithm to extract volumetric reconstruction of the target.
A single volume (cube) that completely encloses the work-
ing space of the acquisition system is defined. Based on the
projection to the camera image plane each voxel is classi-
fied as inside, outside or on the boundary of the visual hull
using the target silhouette. The boundary voxels are itera-
tively subdivided into eight parts (voxels) until the size of
voxels is less than the threshold size.
As 2D shapes of the silhouette play a critical role in dis-
criminative 3D pose prediction (see section 4), visual hull
is back projected to obtain clean silhouettes of the target us-
ing Z-buffering. The improved silhouettes generate cleaner
shape descriptors for improved 3D pose estimation using
bottom-up methods.
Human 3D Shape Registration: Laser scans of human
body from CAESAR dataset are used to learn parametric
models for 3D human shapes. Human body scans are first
registered to a perfect, hole-filled, reference template hu-
man model composed of both 3D mesh surface and accu-
rately aligned skeleton. We use a detailed template model
of standard anthropometry, in order to capture subtle and
wide range of variations in human 3D shapes. The CAE-
SAR dataset has 73 landmark points on various positions,
and these are used to guide the 3D shape registration. The
deformation is an iterative process that gradually brings the
template surface mesh vertices (and the skeleton) close to
the laser scan data points by translating them along surface
normal while maintaining the surface smoothness.
Anthropometric Prior and Coarse Human Shape Mod-
eling: We learn parametric models for the space of human
skeletons and coarse representation of 3D shape of the hu-
man body L using cylindrical parts (see fig. 3). Princi-
pal Component Analysis (PCA) is used to learn the space
of human skeletons and variability of dimensions of the
cylindrical body parts from the registered CAESAR dataset
[18](see fig. 2). The space of human skeletons is parame-
terized using 5 dimensional PCA subspace, capturing 94%
of the variability in length of skeletal links. The coarse 3D
human shape model parameters L = [l r1 r2] include the
length and the two radii of the tapered cylindrical human
parts.
Global and Part-based Shape Modeling: We characterize
the space of human body shapes and the individual body
parts using Principal Component Analysis(PCA). Global
3D human shape models are excessively restrictive in cap-
turing shape variabilities due to a concealed object and dis-
proportionate or abnormally sized body part. In compar-
ison, parts-based 3D shape models are richer in model-
ing asymmetries and surface protrusion arising due to ob-
ject concealment. We use PCA to learn subspace for each
of the body parts from the parts vertices of the registered
shape, that are in one-to-one correspondence with the pre-
segmented template mesh model.
Detailed Parts Shapes from Coarse Human Model: In
order to efficiently initialize the detailed 3D parts from the
coarse cylindrical body parts, we employ approach simi-
lar to [1], for learning relation between the PCA coeffi-
cients of the ith
body part and dimensions of its corre-
sponding cylindrical shape model (L(i)
=

l(i)
r
(i)
1 r
(i)
2

).
Specifically, we learn linear regression map from the
PCA coefficients [P]Nxk of the N data points in k-
dimensional PCA subspace. For the regression function :
51

Figure 3. (Top left) Space of articulated human skeletons; (Top
right) Coarse human shape model used in our system; (Bottom
left) Average detailed human shape model ; (Bottom right) Coarse
human shape model with size of parts estimated from the detailed
3D shape
M

l(i)
r
(i)
1 r
(i)
2 1
T
=

P
(i)
1 · · · P
(i)
k
T
. The mapping is
learned as a pseudo-inverse:
M = P(LLT
+ λI)−1
(1)
where λ is the regularization constant of the ridge regres-
sion. The PCA coefficients of the detailed 3D shape of the
ith
body part can be directly computed from the dimensions
of cylindrical body part as M[l(i)
r
(i)
1 r
(i)
2 1].
4. Bottom-up 3D Pose Estimation
Due to high degree of articulation of human body,
searching in high dimensional pose space is prone to lo-
cal optima. We overcome this problem by initializing the
search near the global optima using discriminative (bottom-
up) methods. To this end, we employ a regression based
framework to directly predict multiple plausible 3D poses
(obtained as probabilistic distribution over pose space) us-
ing the visual cues extracted from individual sensors. The
predictive distribution from multiple sensors are then ob-
tained by simple summing these distributions.Inferring 3D
pose using only 2D visual observation is an ill-posed prob-
lem, due to loss of depth information from perspective pro-
jection. Learning therefore involves modeling inverse per-
spective mapping that is one-to-many, as several 3D human
configurations can generate similar 2D visual observations.
We therefore model these relations as multi-valued map-
pings using Bayesian Mixture of Experts (BME)[20] model.
Formally, the BME model is
p(x|r) =
M

i=1
gi(r)pi(x|r) (2)
gi(r) =
exp(λ⊤
i r)

k exp(λ⊤
k r)
(3)
pi(x|r) = G(x|Wir, Ω−1
i ) (4)
where r is the input or predictor variable(image descrip-
tors), x is the output or response(3D pose parameters), and
gi is the input-dependent positive gate functions. Gates
gi output value between [0, 1] and are computed using (3).
For a particular input r, gates output the probability of the
expert function that should be used to map r to the out-
put pose x. In the model, pi refers to Gaussian distribu-
tions with covariances Ω−1
i centered at different ”expert”
predictions. BME is learned in Sparse Bayesian Learning
(SBL) paradigm that uses Automatic Relevance Determina-
tion(ARD) mechanism to train sparse (less parameterized)
models of regression. We use accelerated training algorithm
based on forward basis selection[6] to train our discrimina-
tive models on a large database of labeled poses observed
from different viewpoints.
In multi-camera settings, visual cues can be fused at fea-
ture level to train a single discriminative model to predict
3D pose using concatenated feature vector obtained from
multiple sensors. However, such a model will be depen-
dent on the camera configurations. Rather, we train a sin-
gle mixture of expert model to predict 3D pose from sin-
gle camera input but with training examples captured from
multiple viewpoints. We use this model to predict poses
from each of the viewpoints independently. The combined
predictive distribution is obtained by simply summing the
mixture of Gaussian distributions obtained from each of the
sensor models C = {C1, · · · , CN } with gate weights re-
weighted to sum to one:
p(x|r, W, Ω, λ) =
N

Cj
M

i=1
gij(r|λij)pij(x|r, Wij, Ω−1
ij )
(5)
where N is the number of sensors and M are the experts
in each of the Mixture of Experts model used to learn the
mapping.
5. Top-down 3D Pose Refinement and Coarse
Shape Estimation
Generative(top-down) model based feedback stage is
used to further refine the 3D pose estimates obtained from
bottom-up methods. Our generative model consist of a
coarse 3D human shape model with each body part rep-
resented using simple geometric primitive shapes such as
tapered cylinders. Geometric shapes allow fast image like-
lihood computation and enforcing non-self penetration con-
straint for the body parts. The top-down search fits the hu-
man model to the visual hull by optimizing the parameters
of the human skeleton model (5 dimensional), coarse 3D
52

Figure 4. (left) Top-down model fitting is initialized by aligning
the root joint and the centroid of the visual hull(shown in blue)
(right) Overlap cost is computed as number of voxels(visual hull
elements) lying inside the cylindrical body part. Parts self inter-
section is penalized by adding an additional cost proportional to
(R1 + R2 − D) for every self-penetrating part.
shapes (5 dimensional) and joints angles (≈ 15 after vari-
ance based pruning). We use predictive distribution from
the feed-forward methods to prune the joint angles having
low variance. The likelihood cost is computed as sum of
degree of overlap of each part to the visual hull with an
added cost for each pair of intersecting parts (see fig. 4). In
computing the self-penetration cost, we compute the short-
est distance D between the two axes of the cylindrical body
parts of radii R1 and R2. For the two intersecting parts, we
add a penalty term proportional to (R1 + R2 − D) in the
likelihood function.
Stochastic Optimization using MCMC: We use Markov
Chain Monte Carlo (MCMC) simulation for searching in
the parameter space of the human skeletal links(L), the
coarse shape models (S) and 3D pose (θ). MCMC is a
suitable methodology for computing a maximum a pos-
terior(MAP) solution of the posterior argmaxxp(x|r) by
drawing samples from the proposal density (that approxi-
mates the posterior) using a random walk based Metropolis
algorithm[14]. At the tth
iteration, a candidate xi
is sam-
pled from a proposal distribution q(x′
|xt−1) and accepted
as the new state with a probability a(xt−1 → x′
) where:
a(xt−1 → x′
) = min{1,
p(x′
|r)q(xt−1|x′
)
p(xt−1|r)q(x′|xt−1)
} (6)
where x′
= {L, S, θ} are the parameters which are op-
timized to maximize the overlap between the coarse 3D
human model and visual hull. Here S denotes the low-
dimensional PCA coefficients of anthropometric prior. In
order to avoid local optima, we use simulated annealing that
gradually introduces global optima in the distribution to be
maximized p(x|r)1/Ti
. The parameter Ti is gradually de-
creased under the assumption that p(x|r)∞
mostly concen-
trates around the global maxima[10].
Proposal Map Computation: The proposal distribution
plays critical role in MCMC search and is assumed to be
independent for shape and pose parameters. We adopt
Metropolis algorithm for sampling our proposal map that
are not conditioned on the current state xt−1. The proposal
distribution q(θ) is obtained as mixture of Gaussians from
the bottom-up predictors (5) and are ill-suited for searching
in the joint angle space. Sampling from the angular pri-
ors of the joints higher in the skeletal hierarchy (such as
shoulder and femur joints) may produce larger spatial mo-
tion compared to the lower joints (such as elbow and knee
joints). Optimizing simultaneously in the entire 3D pose
space may cause instability and more iterations for conver-
gence. This problem may be resolved by fitting joints higher
in the skeletal hierarchy first. We adopt a more principled
approach [13] whereby we sample from the spatial prior as
opposed to angular prior. Specifically, for the ith
skele-
tal link, we sample from the p(θi, Σθi ) = N(F(θi), ΣF )
and F(θi) = F(θ
(p)
i ) ∗ R(θi) + T (θi) where F(θi) is the
end location of the ith
joint link and θ
(p)
i is its parent joint.
Sampling from F(θi) is not straight forward as unlike θi,
it spans non-linear manifold M. In order to compute the
covariance, we linearly approximate the manifold at a point
by the tangent space at that point. We compute the jacobian
J and use it to compute covariance as ΣF = Jθi Σθi JT
θi
. At
tth
iteration, sampling from the distribution N(F(θi), ΣF )
generates locations of end-effectors of the joints that is used
to compute the angle by minimization of the function:
θ
(t)
i = minθi ||F′
(t) − F(θi)||2
s.t. θmin
i ≤ θi ≤ θmax
i ,
(7)
The minimization is performed using standard Levenberg-
Marquardt optimization algorithm.
6. Detailed 3D Shape Estimation
3D pose and coarse shape, estimated from top-down
method, is used to initialize the search in parameter space of
detailed 3D human shapes. We model 3D shape of humans
using polygonal 3D mesh surfaces skinned to an underly-
ing skeleton. We assume that the 3D mesh surface under-
goes deformation only under the influence of the skeleton
attached to it. Shape of human body can vary both due to
anthropometry or the pose of the target. Anthropometric
variability is modeled by the learned 3D shape models for
humans. The shape deformation due to pose is obtained by
first skinning the 3D mesh to the skeleton and transforming
the vertices under the influence of associated skeletal joints.
Skinning 3D Mesh to the Skeleton: We use Linear Blend
Skinning (LBS) for efficient non-rigid deformation of skin
as a function of underlying skeleton. LBS is achieved by as-
sociating the vertices to two nearest joints. The transforma-
tion is computed as weighted sum of the transformation due
to each of the joints where weights are computed as inverse
distance from the joints. Fig. 5 illustrates the computation
53

Figure 5. Linear Blend Skinning is used to deform the 3D mesh
under the influence of the skeleton,(left) Rigidly deforming human
body parts causes artifacts around the joints ;(middle) Vertices are
transformed using weighted sum of transformation due to multi-
ple associated joints ; (right) Shape deformation with backpack
accessory attached to the torso
of the transformation of vertices associated to different body
segments.
Although rich in terms of representation, global 3D hu-
man shape representation cannot model 3D shapes with dis-
proportionately sized body parts. In order to support rich
set of human shapes we use a combined local part-based
and global optimization scheme that first searches in the lo-
cal subspace of human body parts to match the observation,
followed by constraining the whole shape using global hu-
man shape model. Fitting body parts independently causes
discontinuities along the joints and generates unrealistic
shapes (see fig. 6). Constraining the shape to lie in the
global shape space therefore ensures it to be a valid shape.
For linear PCA based shape models, this is efficiently done
by ensuring the PCA coefficients of the shape (when pro-
jected to the subspace) to lie within a range of variance.
Stochastic Search in Local and Global Shape Space: Our
algorithm does alternate search in the parameter space of
3D human pose (θ) and shape (S) to simultaneously re-
fine the pose and fit detailed 3D shape to the observation.
The search is performed using Data Driven MCMC with
metropolis-hasting method wherein the proposal map does
not use the predictive distribution obtained from bottom-
up methods but rather is modeled as Gaussian distribu-
tion conditioned on the current state q(x′
|xt−1) where
xt−1 = {θt−1, St−1}.The likelihood distribution is mod-
eled as symmetrical chamfer distance map[2] to match the
2D projection of the model to the observed image silhou-
ettes from multiple sensors. For optimizing the 3D pose, we
use the current 3D shape to search in the parameter space of
articulated human pose. The regression function M (1), that
maps the coarse human shape model to the detailed shape
PCA coefficients, is used to initialize the search. Plausi-
ble 3D shapes are sampled from the Gaussian distributions
that the PCA based subspace represents for each of the body
Figure 6. Detailed 3D shape fitting by sampling from PCA based
shape models of various body components, (left) Average human
shape model, (middle) Shape with each body part sampled from
the parts shape model, (right) 3D shape obtained after constraining
the shape using global shape model
Figure 7. Accurate 3D surface reconstruction of human body is
provided for all the poses in I3DPost [11] dataset. 3D shape fitting
algorithms are evaluated by matching the fitted 3D shape (shown
as red colored vertices) with the ground truth surface reconstruc-
tion(shown as blue colored vertices).
parts. The search is performed by alternately fitting the 3D
pose first, followed by optimization of the shape parame-
ters of the individual body parts. At every iteration, the
3D shape of human body is constrained using global shape
model to ensure a valid shape (see fig. 6).
7. Experimental Evaluation
We conducted experiments on both publically available
datasets and those captured at our motion capture facility. In
all our experiments, we used 4 synchronized image streams
from calibrated sensors to estimate 3D pose and shape of the
human targets. 3D motion capture data was used to train our
bottom-up predictors. BME model was trained with 3 ex-
perts For training bottom-up methods, we used vector quan-
tized, shape context histograms computed over both outer
contour and the internal edges of the foreground object as
the inputs for regression. Fig. 8 illustrates the results of our
framework on walking sequences with and without back-
pack. I3DPost data[11] also provide accurate 3D surface
reconstruction of subjects in different walking poses. We
evaluate the accuracy of our shape fitting algorithms using
this as a groundtruth. Error is computed as sum of distance
of the surface vertex to the nearest vertex of the fitted 3D
shape. Fig. 7 illustrates the technique on an example image
54

Figure 8. 3D Pose and shape fitting results for different sequences. Three columns on the right show the results with backpack accessory
from walking sequence.
7.1. Shape Fitting to Accessories
Our system also supports automatic estimation of size of
an accessory bag carried by humans. Backpack is modeled
as a trapezoidal shape and is assumed to be rigidly attached
to the torso such that the translation and orientation of the
backpack can be directly computed using that of torso. The
two parameters of the trapezoid (thickness and orientation
of non-perpendicular face) are iteratively estimated during
the 3D shape fitting. The shape of the accessory is initial-
ized to mean thickness of human torso. The framework
functions as a generative classifier to identify whether a hu-
man is carrying backpack or not. Improvement in the likeli-
hood of fit for the model with the attached accessory implies
presence of backpack. This is illustrated in the fig. 9(b)
whereby use of model with an attached accessory improved
the likelihood of fit from 1.043 to 1.3441.
7.2. Human Attribute Inference Using 3D Shape
Analysis
The estimated 3D shape of the human target can be used
for inference of a variety of human attributes that are use-
ful for identifying a potentially hostile behavior. Demo-
graphic features such as gender and ethnicity, physical at-
tributes such as height, weight and body appearance can be
inferred either by computing spatial statistics of different
regions of the fitted 3D shape or by determining anthropo-
metric variations that characterizes these features.Various
anthropometric measurements can be directly inferred from
the 3D shape fitting to the observed multi-sensor data. Fig.
9(c) shows the measurements of different body parts esti-
mated from the 3D shapes fitted to the observations.
Gender Classification: We use linear discriminant analysis
(LDA) to find the feature projections that best discriminate
the shape profiles of the two gender classes. Linear Dis-
criminant Analysis (LDA) essentially learns a linear clas-
sification boundary between the two classes under the as-
sumption that the samples from each of the two classes are
normally distributed. The LDA vector can be used to clas-
sify a person’s gender based on the fitted 3D shape. Similar
to gender classification, age and ethnicity attributes of a per-
son can be inferred depending on the body stature. Fig. 9(a)
shows the gender classification results using LDA. Here the
threshold for gender classification is set to 0 and negative
LDA coefficients denote female shapes.
8. Conclusions
We have proposed an integrated approach that combines
bottom-up and top-down methods for 3D pose and shape
estimation of human targets from multi-view imagery. We
55

Figure 9. Human attribute inference using shape analysis,(a) Gender classification (b) 3D shape fitting without and with backpack in middle
and bottom row respectively. The observation matching cost (using chamfer distance) without and with backpack model were 1.3441 and
1.043 respectively. (c) 3D shape estimation can be used estimate dimensions of various body parts
limit the number of sensors used in our framework to 4. To
overcome ambiguity and ill-constrained nature of the prob-
lem, we use efficient anthropometric priors of human shape
and pose learned from the CAESAR dataset. Accurate 3D
pose and shape estimated from our framework can be used
for inferring attributes like gender, age, ethnicity and body
weight. Currently our framework does not use tracking, but
fits pose and shape for every frame independently. Pose
and surface tracking will be employed in future to obtain
smoother 3D shape deformation in a video.
Acknowledgements: We thank George Williams, Peter
Birdsall and Kirill Smoleskiy for assisting us in data collec-
tion. We thank Asaad Hakeem for discussions and useful
comments on the work. This work was supported by Air
Force Research Lab, contract number FA8650-10-M-6094.
References
[1] B. Allen, B. Curless, and Z. Popovic. The space of human body shapes: recon-
struction and parameterization from range scans. ACM SIGGRAPH, 2003. 50,
51
[2] A. Balan, L. Sigal, M. Black, J. Davis, and H. Haussecker. Detailed human
shape and pose from images. CVPR, 2007. 50, 51, 54
[3] A. O. Balan and M. J. Black. The naked truth: Estimating body shape under
clothing. In ECCV (2), pages 15–29, 2008. 50
[4] C. Bregler, J. Malik, and K. Pullen. Twist based acquisition and tracking
of animal and human kinematics. International Journal of Computer Vision,
56(3):179–194, 2004. 50
[5] Y. Chen, T.-K. Kim, and R. Cipolla. Inferring 3d shapes and deformations from
single views. In ECCV (3), pages 300–313, 2010. 50
[6] A. C. Faul and M. E. Tipping. Analysis of sparse bayesian learning. Proc.
Neural Information Processing Systems, pages 383–389, 2001. 52
[7] J. Gall, B. Rosenhahn, and H.-P. Seidel. Drift-free tracking of rigid and articu-
lated objects. In CVPR. IEEE Computer Society, 2008. 50
[8] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H.-P. Seidel.
Motion capture using joint skeleton tracking and surface estimation. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition,
pages 1746–1753, 2009. 50
[9] J. Gall, A. Yao, and L. J. V. Gool. 2d action recognition serves 3d human pose
estimation. In ECCV (3), pages 425–438, 2010. 50
[10] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions and the
bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6(6):721–741, 2010. 53
[11] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas. The i3dpost multi-
view and 3d human action/interaction. In Proc. Conference on Visual Media
Production, 1(1):159–168, 2009. 54
[12] P. Guan, A. Weiss, A. O. Balan, and M. J. Black. Estimating human shape and
pose from a single image. In ICCV, pages 1381–1388. IEEE, 2009. 50
[13] S. Hauberg, S. Sommer, and K. S. Pedersen. Gaussian-like spatial priors for
articulated tracking. ECCV, 2010. 53
[14] M. Lee and I. Cohen. Proposal maps driven mcmc for estimating human body
pose in static images. Proc. Computer Vision and Pattern Recognition Conf.,
pages 334–341, 2004. 53
[15] T. B. Moeslund, A. Hilton, and V. Krüger. A survey of advances in vision-based
human motion capture and analysis. Computer Vision and Image Understand-
ing, 104(2-3):90–126, 2006. 50
[16] L. Mündermann, S. Corazza, and T. P. Andriacchi. Accurately measuring hu-
man movement using articulated icp with soft-joint constraints and a repository
of articulated models. In CVPR. IEEE Computer Society, 2007. 50
[17] G. Pons-Moll, A. Baak, T. Helten, M. Müller, H.-P. Seidel, and B. Rosenhahn.
Multisensor-fusion for 3d full-body human motion capture. In CVPR, pages
663–670, 2010. 50
[18] K. Robinette and H. Daanen. The caesar project: A 3-d surface anthropometry
survey. Second International Conference on 3-D Imaging and Modeling, 1999.
50, 51
[19] L. Sigal, A. O. Balan, and M. J. Black. Combined discriminative and genera-
tive articulated pose and non-rigid shape estimation. In J. C. Platt, D. Koller,
Y. Singer, and S. T. Roweis, editors, NIPS. MIT Press, 2007. 49, 50
[20] C. Sminchisescu, A. Kanaujia, Z. Li, and D. N. Metaxas. Discriminative density
propagation for 3d human motion estimation. In Proc. Computer Vision Pattern
Recognition, 2005. 52
[21] C. Stoll, J. Gall, E. de Aguiar, S. Thrun, and C. Theobalt. Video-based re-
construction of animatable human characters. ACM Trans. Graph., 29(6):139,
2010. 50
56

3D Human Pose And Shape Estimation From Multi-View Imagery

Recommended

Recommended

More Related Content

Similar to 3D Human Pose And Shape Estimation From Multi-View Imagery

Similar to 3D Human Pose And Shape Estimation From Multi-View Imagery (20)

More from Liz Adams

More from Liz Adams (20)

Recently uploaded

Recently uploaded (20)

3D Human Pose And Shape Estimation From Multi-View Imagery