How Geometry Meets Learning in Pose Estimation

Cai, Ming

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/126955

Type:	Thesis
Title:	How Geometry Meets Learning in Pose Estimation
Author:	Cai, Ming
Issue Date:	2020
School/Discipline:	School of Computer Science
Abstract:	This thesis focuses on one of the fundamental problems in computer vision, sixdegree- of-freedom (6dof) pose estimation, whose task is to predict the geometric transformation from the camera to a target of interest, from only RGB inputs. Solutions to this problem have been proposed using the technique of image retrieval or sparse 2D-3D correspondence matching with geometric verification. Thanks to the development of deep learning, the direct regression-based (compute pose directly from image-to-pose regression) and indirect reconstruction-based (solve pose via dense matching between image and 3D reconstruction) approaches using neural network recently draw growing attention in community. Although models have been proposed for both camera relocalisation and object pose estimation using a deep network base, there are still open questions. In this thesis, we investigate several problems in pose estimation regarding end-to-end object pose inference, uncertainty of pose estimation in regression-based method and self-supervision for reconstruction-based learning both for scenes and objects. We focus on the end-to-end 6dof pose regression for objects in the first part of this thesis. Traditional methods that predict the 6dof pose for objects usually rely on the 3D CAD model and require a multi-step scheme to compute the pose. We alternatively use the idea of direct pose regression for objects based on a region proposed network Mask R-CNN, which is well-known for object detection and instance segmentation. Our newly proposed network head regresses a 4D vector from the RoI feature map of each object. A 3D vector from Lie algebra is used as the representation for rotation. Another one scalar for the z-axis of translation is predicted to recover the full 3D translation along with the position of bounding boxes. This simplification avoids the spatial ambiguity for object in the scope of 2D image caused by RoIPooling. Our method performs accurately at inference time, and faster than methods that require 3D models and refinement in their pipeline. We estimate the uncertainty for the pose regressed by a deep model in the second part. A CNN is combined with Gaussian Process Regression (GPR) to build a framework that directly obtains a predictive distribution over camera pose. The combination is achieved by exploiting the CNN to extract discriminative features and using the GPR to perform probabilistic inference. In order to prevent the complexity of uncertainty estimation from growing with the number of training images in the datasets, we use pseudo inducing CNN feature points to represent the whole dataset and learn their representations using Stochastic Variational Inference (SVI). This makes GPR a parametric model, which can be learnt together with the CNN backbone at the same time. We test the proposed hybrid framework on the problem of camera relocalisation. The third and fourth parts of our thesis have similar objectives: seeking selfsupervision for the learning of dense reconstruction for pose estimation from images without using the ground truth 3D model of scenes (in part 3) and objects (in part 4). We explore an alternative supervisory signal from multi-view geometry. Photometric and/or featuremetric consistency in image pairs from different viewpoints is proposed to constrain the learning of the world-centric coordinates (part 3) and object-centric coordinates (part 4). The dense reconstruction model is subsequently used as 2D-3D correspondences establisher at inference time to compute the 6dof pose using PnP plus RANSAC. Our 3D model free methods for pose estimation eliminate the dependency on 3D models used in state-of-the-art approaches.
Advisor:	Reid, Ian D. Shen, Chunhua
Dissertation Note:	Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2020
Keywords:	pose estimation camera relocalization deep learning 3d reconstruction multi-view geometry
Provenance:	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:	Research Theses

Files in This Item:

File	Description	Size	Format
Cai2020_PhD.pdf		34.71 MB	Adobe PDF	View/Open

Show full item record

Adelaide Research & Scholarship