COS598C Spring 2014: Scene Understanding

Overview:

This class is to lay the foundation for research in the area of scene understanding of computer vision, by focusing on important topics from practical point of views. This class will review popular approaches and discuss about the fundamental principles underlying scene understanding in computer vision. We will be reading a mixture of papers from computer vision and influential works from cognitive psychology. We will also emphasis implementation techniques to leverage computation power, crowd sourcing and big data for computer vision research in general.

Instructor: Jianxiong Xiao
Time: Monday,Wednesday, 3:00PM - 4:20PM
Location: CS402

Schedule:

Date	Topic	Presenter	Slide + Code	Reading
Feb 3 Mon	Introduction + Camera Model	Jianxiong Xiao	pptx pdf panorama	[HZ] Multiple view geometry in computer vision. [SingleViewMetrology] Single view metrology. [ObjectPerspective] Putting objects in perspective. [LabelMe3D] Building a database of 3d scenes from user annotations.
Feb 5 Wed	Class Canceled (Severe Weather)
Feb 10 Mon	Linear Algebra Review + Two View Geometry	Fisher Yu	key pdf [SFMedu code] [Direct code] [Consistency code]	[HZ] Multiple view geometry in computer vision. [PhotoTourism] Photo tourism: exploring photo collections in 3D. [QuasiDense] A quasi-dense approach to surface reconstruction from uncalibrated images. [ceres-solver] Ceres Solver.
Feb 12 Wed	Structure From Motion + Stereo Matching	Fisher Yu		[PMVS] Accurate, dense, and robust multiview stereopsis.
Feb 17 Wed	Factorization for SFM + Non-rigid SFM + Direct Method for RGBD	Fisher Yu		[Nonrigid3D] Recovering non-rigid 3D shape from image streams. [NonrigidSFM] Nonrigid structure-from-motion: Estimating shape and motion with hierarchical priors. [DirectMethod] Robust odometry estimation for rgb-d cameras. [DirectMethodICCV] Real-Time Visual Odometry from Dense RGB-D Images.
Feb 19 Mon	Kinect Fusion	Sema Berkiten	pdf key [KinFu code] [SUN3Dsfm code] [SiftFu code]	[KinectFusion] KinectFusion: Real-time dense surface mapping and tracking. [EfficientICP] Efficient variants of the ICP algorithm. [GeneralizedICP] Generalized-ICP. [LargeKinectFusion] Scalable real-time volumetric surface reconstruction. [Kintinuous] Kintinuous: Spatially extended kinectfusion. [KintinuousLoop] Deformation-based loop closure for large scale dense rgb-d slam. [KintinuousRobust] Robust real-time visual odometry for dense RGB-D mapping. [NonRigid] Robust Single-View Geometry And Motion Reconstruction. [SelfPortraits] 3D Self-Portraits. [KeyFrameFusion] On unifying key-frame and voxel-based dense visual SLAM at large scales. [HDRslam] 3D High Dynamic Range dense visual SLAM and its application to real-time object re-lighting. [SuperResolutionSLAM] Super-Resolution 3D Tracking and Mapping. [Elastic] Elastic Fragments for Dense Scene Reconstruction.
Feb 24 Mon	Convolutional Neural Network	Zhirong Wu	pdf [Jianxiong's note] [Matlab Demo] [Web Demo] [Alex Code] [Caffe Code]	[CNNnote] Notes on convolutional neural networks. [ParallelCognition] The parallel distributed processing approach to semantic cognition. [Connectionist] Learning and connectionist representations. [DCNN] Imagenet classification with deep convolutional neural networks. [LecunNet] Backpropagation applied to handwritten zip code recognition. [BestCNN] Visualizing and Understanding Convolutional Neural Networks. [Caffe] Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding. [DeepDetection] Deep Neural Networks for Object Detection. [BengioRepresentation] Representation learning: A review and new perspectives.
Feb 26 Wed	Autoencoder	David Dohan	pptx pdf [Autoencoder Code] [RBM code] [DBM code]	[AutoEncoder] Reducing the dimensionality of data with neural networks.
Mar 3 Mon	RBM + DBM + DBN	David Dohan	pptx pdf [Autoencoder Code] [RBM code] [DBM code]	[RBM] Restricted Boltzmann machines for collaborative filtering. [DBM] Deep boltzmann machines. [DBN] A fast learning algorithm for deep belief nets.
Mar 5 Wed	Vision and Action: Reinforcement + Apprenticeship Learning	Chenyi Chen	pdf pptx [demo]	[DeepRL] Playing Atari with Deep Reinforcement Learning. [ApprenticeshipLearning] Apprenticeship learning via inverse reinforcement learning.
Mar 10 Mon	GPU Programming	Maciej Halber	pdf key [example code]	CUDA C Programming Guide GPU Programming in MATLAB GPUmat
Mar 12 Wed	MRF + CRF + GC + LBP	Huiwen Chang	pdf pptx [BP Code] [GraphCut Code gco] [MRFsfm]	[BP] Understanding belief propagation and its generalizations. [GraphCut] Fast approximate energy minimization via graph cuts. [DistanceTransform] Distance transforms of sampled functions. [LazySnapping] Lazy snapping. [ConnectedCRF] Efficient inference in fully connected crfs with gaussian edge potentials. [MRFsfm] Discrete-Continuous Optimization for Large-Scale Structure from Motion. [MRFsfmPAMI] SfM with MRFs: Discrete-Continuous Optimization for Large-Scale Reconstruction. [EfficientBP] Efficient belief propagation for early vision. [TextonBoost] Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. [CRFobject] Conditional random fields for object recognition.
Mar 17 Mon	No Class (Spring Recess)
Mar 19 Wed	No Class (Spring Recess)
Mar 24 Mon	Cloud Computing	John McSpedon	pdf pptx demo code
Mar 26 Wed	Object Detection	Shuran Song	pdf pptx [DPM code] [Vlfeat code] [Color SIFT code]	[DevaSVM] Dual coordinate solvers for large-scale structural SVMs. [PictorialStructure] The representation and matching of pictorial structures. [DalalTriggs] Histograms of oriented gradients for human detection. [DPM] Object Detection with Discriminatively Trained Part Based Models. [ExemplarSVMs] Ensemble of exemplar-svms for object detection and beyond. [PartMixtures] Articulated pose estimation with flexible mixtures-of-parts. [Poselet] Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations. [ExemplarSVMsMatching] Data-driven Visual Similarity for Cross-domain Image Matching. [FindingThings] Finding things: Image parsing with regions and per-exemplar detectors. [SelectiveSearch] Segmentation as selective search for object recognition. [Regionlets] Regionlets for Generic Object Detection. [CF] Model recommendation for action recognition. [LDA] Discriminative decorrelation for clustering and classification. [Cuboid] Localizing 3D Cuboids in Single-view Images.
Mar 31 Mon	Features and Datasets	Shuran Song	pdf pptx [DPM code] [Vlfeat code] [Color SIFT code]	[SIFT] Distinctive image features from scale-invariant keypoints. [ColorSIFT] Evaluating Color Descriptors for Object and Scene Recognition. [DalalTriggs] Histograms of oriented gradients for human detection. [DPM] Object Detection with Discriminatively Trained Part Based Models. [GIST] Modeling the shape of the scene: A holistic representation of the spatial envelope. [LBP] Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. [PrinciplesOfCategorization] Principles of categorization. [Visipedia] Vision of a Visipedia. [SUNDB] SUN Database: Exploring a Large Collection of Scene Categories. [PASCAL] The pascal visual object classes (voc) challenge. [ImageNet] Imagenet: A large-scale hierarchical image database. [SUN3D] SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels.
Apr 2 Wed	BOW + SPM + Sparse Coding	Xinyi Fan	pdf key	[SPM] Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. [LLC] Locality-constrained linear coding for image classification. [LSPM] Linear spatial pyramid matching using sparse coding for image classification. [FisherVector] Image Classification with the Fisher Vector: Theory and Practice. [FisherKernel] Improving the fisher kernel for large-scale image classification. [CodingComparison] The devil is in the details: an evaluation of recent feature encoding methods. [SmallCodes] Small codes and large image databases for recognition. [MultidimensionalSpectralHashing] Multidimensional spectral hashing. [SpectralHashing] Spectral hashing. [CompactCodes] Aggregating local image descriptors into compact codes.
Apr 7 Mon	Instance-level Matching	Pingmei Xu	pdf key	[VideoGoogle] Video Google: A text retrieval approach to object matching in videos. [GoogleGoggle] Object retrieval with large vocabularies and fast spatial matching. [Quantization] Lost in quantization: Improving particular object retrieval in large scale image databases. [TotalRecall] Total recall: Automatic query expansion with a generative feature model for object retrieval. [InstanceLevelRecognition] 3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. [GeometricEra] Object recognition in the geometric era: A retrospective.
Apr 9 Wed	Web Programming	Pingmei Xu	pdf key	w3schools.com
Apr 14 Mon	WebGL + Blender (Basic + Command Line Tool)	Maciej Halber	WebGL pdf WebGL key WebGL code Blender key Blender pdf BlenderScript BlenderFiles	Learning WebGL Lessons
Apr 16 Wed	Crowd Sourcing	Simin Chen	pdf pptx [Matlab Turk API] [DrawMe code] [TurkCleaner code]	[HumanInTheLoop] Visual recognition with humans in the loop. [Rating] Online crowdsourcing: rating annotators and obtaining cost-effective labels. [InteractiveTraining] Strong supervision from weak annotation: Interactive training of deformable part models. [Turkit] Turkit: human computation algorithms on mechanical turk. [ParallelHuman] Exploring iterative and parallel human computation processes. [ProgrammingHuman] Programming with human computation. [CrowdPowered] Crowd-powered systems. [ImageNet] Imagenet: A large-scale hierarchical image database.
Apr 21 Mon	Scene and Context	Yinda Zhang	pdf pptx	[GeometricContext] Geometric context from a single image. [PhotoPop-up] Automatic photo pop-up. [RGBDcuboid] A Linear Approach to Matching Cuboids in RGBD Images. [ExactLayout] Efficient exact inference for 3d indoor scene understanding. [BoxInBox] Box In the Box: Joint 3D Layout and Object Reasoning from Single Images. [ObjectPerspective] Putting objects in perspective. [Make3D] Make3d: Learning 3d scene structure from a single still image. [HallucinateHuman] Hallucinated Humans as the Hidden Context for Labeling 3D Scenes. [RoomLayout] Recovering the spatial layout of cluttered rooms. [StochasticGrammar] A stochastic grammar of images. [DDMCMC] Image segmentation by data-driven Markov chain Monte Carlo. [ImageParsing] Image parsing: Unifying segmentation, detection, and recognition. [AutoContext] Auto-context and its application to high-level vision tasks. [GrammarParsing] Bottom-up/top-down image parsing with attribute grammar. [AndOrGraph] A numerical study of the bottom-up and top-down inference processes in and-or graphs. [SimulationScene] Simulation as an engine of physical scene understanding. [GrowMind] How to grow a mind: Statistics, structure, and abstraction. [ProbabilisticGraphics] Approximate Bayesian image interpretation using generative probabilistic graphics programs.
Apr 23 Wed	Semantic Segmentation	Bebe Shi	pdf [TextonBoost Code] [TextonForest Code] [SiftFlow Code] [Label Transfer Code] [SuperParsing Code]	[TextonBoost] Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. [TextonForest] Semantic texton forests for image categorization and segmentation. [SiftFlow] SIFT flow: dense correspondence across different scenes. [LabelTransfer] Nonparametric scene parsing via label transfer.
Apr 28 Mon	Compressive Sensing	Li-Fang Cheng	pdf pptx L1 magic	[InformativeSensingArXiv] Informative sensing. [InformativeSensingICIP] Informative sensing of natural images. [InformativeSensing] Informative sensing: theory and applications.
Apr 30 Wed	How to do research + Open Discussion	Jianxiong Xiao	pdf pptx	Bill Freeman's how to do research Bill Freeman's crowd sourced note Ramesh Raskar's How to invent: The Idea Hexagon

Tentative Topics:

Geometry
- Camera Model [Jianxiong]
- Structure From Motion [Fisher]
- Stereo Matching [Fisher]
- Kinect Fusion [Sema]
Deep Learning [David][Zhirong]
- Energy-based Models
- Stochastic Gradient Descent
- Markov Chain Monte Carlo and Gibbs Sampling
- Constrastive Divergence
- Basic Neural Network
- Convolutional Neural Net
- Restricted Boltzmann Machine
- Deep Belief Net and Deep Boltzmann Machine
- Vision and Action [Chenyi]
  - Basic Reinforcement Learning
  - Deep Reinforcement Learning
- MRF, CRF, Graph Cut Modeling, and LBP [Huiwen]
Parallel Computing and Big Data
- Basic GPU Concept [Maciej]
- GPU programming with CUDA and C++ [Maciej]
- MEX in Matlab [Maciej]
- GPU in Matlab (PTX kernel, gpuArray, arrayfun) [Maciej]
- Parfor in Matlab [John]
- PBS system [John]
- MapReduce [John]
- Amazon EC2 and other services [John]
Object Detection
- Support Vector Machine [Shuran]
- Sliding Window Object Detection [Shuran]
- Deformable Part-based Model [Shuran]
- Selective Search and Regionlets [Shuran]
- SUN database, PASCAL VOC and Image-Net [Shuran]
- Gist, HOG, SIFT, LBP [Shuran]
- Bag of Word, Spatial Pyramid Matching, Sparse Coding [Xinyi]
- Instance-level Matching [Pingmei]
Crowd Sourcing
- HTML and CSS [Pingmei]
- Javascript and jQuery (sortable table) [Pingmei]
- Ajax and CGI (Python) [Pingmei]
- HTML5 Canvas [Pingmei]
- HTML5 Video and getUserMedia [Pingmei]
- HTML5 WebGL [Maciej]
- Amazon Mechinical Turk [Simin]
- Crowd-sourcing Practice [Simin]
Scene Parsing
- Texton Boost [Bebe Shi]
- Random Forest [Bebe Shi]
- Sift Flow [Bebe Shi]
- Geometric Context [Yinda]
- Room Layout Estimation [Yinda]
- And Or Graph [Yinda]
- Street Scene Understanding for Autonomous Driving [Chenyi]

Class Requirement:

Each student will sign up for the topic that they know the best, and take turns to give an in-depth tutorial to the class.
There is no exam for the class. The grade directly depends on the quality of your presentation.
Your presentation should assume zero prior knowledge about the subject, and should be as clear and understandable as possible.
Your presentation should start by explaining the main idea, the main equations, and then go into great details while not losing the audience.
Your presentation should be very technical. You should read the papers to present many times, read the source codes for the papers if available.
You should know everything about the subject to present and be prepared to answer any questions from the class.
You should try to integrate several papers of the topic into a coherent presentation. Instead of dividing your presentation as having several disconected parts, one for each paper, try to give a coherent tutorial.
For each lecture, prepare your slides and drop by my office 3 days before your presentation to discuss about the slides.

Reading List:

[VisionEasier] Vision is getting easier every day.
[SUN3D] SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels.
[SUNDB] SUN Database: Exploring a Large Collection of Scene Categories.
[ApprenticeshipLearning] Apprenticeship learning via inverse reinforcement learning.
[DeepRL] Playing Atari with Deep Reinforcement Learning.
[ExemplarSVMs] Ensemble of exemplar-svms for object detection and beyond.
[ExemplarSVMsMatching] Data-driven Visual Similarity for Cross-domain Image Matching.
[DPM] Object Detection with Discriminatively Trained Part Based Models.
[GeometricContext] Geometric context from a single image.
[DalalTriggs] Histograms of oriented gradients for human detection.
[DCNN] Imagenet classification with deep convolutional neural networks.
[DBM] Deep boltzmann machines.
[DBN] A fast learning algorithm for deep belief nets.
[GeometricEra] Object recognition in the geometric era: A retrospective.
[PMVS] Accurate, dense, and robust multiview stereopsis.
[KinectPose] Real-time human pose recognition in parts from single depth images.
[QuasiDense] A quasi-dense approach to surface reconstruction from uncalibrated images.
[TextonBoost] Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation.
[TextonForest] Semantic texton forests for image categorization and segmentation.
[KinectFusion] KinectFusion: Real-time dense surface mapping and tracking.
[KeyFrameFusion] On unifying key-frame and voxel-based dense visual SLAM at large scales.
[ForestLocalization] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images.
[DecisionJungles] Decision Jungles: Compact and Rich Models for Classification.
[PartMixtures] Articulated pose estimation with flexible mixtures-of-parts.
[FisherVector] Image Classification with the Fisher Vector: Theory and Practice.
[VideoGoogle] Video Google: A text retrieval approach to object matching in videos.
[GoogleGoggle] Object retrieval with large vocabularies and fast spatial matching.
[SPM] Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories.
[ImageNet] Imagenet: A large-scale hierarchical image database.
[SimulationScene] Simulation as an engine of physical scene understanding.
[GrowMind] How to grow a mind: Statistics, structure, and abstraction.
[ProbabilisticGraphics] Approximate Bayesian image interpretation using generative probabilistic graphics programs.
[RBC] Recognition-by-components: a theory of human image understanding..
[PictorialStructure] The representation and matching of pictorial structures.
[PrinciplesOfCategorization] Principles of categorization.
[ObjectPerspective] Putting objects in perspective.
[PhotoTourism] Photo tourism: exploring photo collections in 3D.
[BiedermanScene] Scene perception: Detecting and judging objects undergoing relational violations.
[TinyImage] 80 million tiny images: A large data set for nonparametric object and scene recognition.
[SceneCompletion] Scene completion using millions of photographs.
[IM2GPS] IM2GPS: estimating geographic information from a single image.
[RoomLayout] Recovering the spatial layout of cluttered rooms.
[StochasticGrammar] A stochastic grammar of images.
[DDMCMC] Image segmentation by data-driven Markov chain Monte Carlo.
[ImageParsing] Image parsing: Unifying segmentation, detection, and recognition.
[AutoContext] Auto-context and its application to high-level vision tasks.
[GrammarParsing] Bottom-up/top-down image parsing with attribute grammar.
[AndOrGraph] A numerical study of the bottom-up and top-down inference processes in and-or graphs.
[InstanceLevelRecognition] 3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints.
[SingleViewMetrology] Single view metrology.
[PhotoPop-up] Automatic photo pop-up.
[Make3D] Make3d: Learning 3d scene structure from a single still image.
[LabelMe3D] Building a database of 3d scenes from user annotations.
[HallucinateHuman] Hallucinated Humans as the Hidden Context for Labeling 3D Scenes.
[ConnectedCRF] Efficient inference in fully connected crfs with gaussian edge potentials.
[SmallCodes] Small codes and large image databases for recognition.
[MultidimensionalSpectralHashing] Multidimensional spectral hashing.
[SpectralHashing] Spectral hashing.
[CompactCodes] Aggregating local image descriptors into compact codes.
[AutoEncoder] Reducing the dimensionality of data with neural networks.
[JonathanBarron] Shape, Illumination, and Reflectance from Shading.
[ShapeContext] Shape context: A new descriptor for shape matching and object recognition.
[NormalizedCut] Normalized cuts and image segmentation.
[SIFT] Object recognition from local scale-invariant features.
[GraphCut] Fast approximate energy minimization via graph cuts.
[LazySnapping] Lazy snapping.
[CRFobject] Conditional random fields for object recognition.
[BP] Understanding belief propagation and its generalizations.
[PedroSegmentation] Efficient graph-based image segmentation.
[EfficientBP] Efficient belief propagation for early vision.
[DistanceTransform] Distance transforms of sampled functions.
[PASCAL] The pascal visual object classes (voc) challenge.
[ceres-solver] Ceres Solver.
[Poselet] Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations.
[GPB] Contour Detection and Hierarchical Image Segmentation.
[Visipedia] Vision of a Visipedia.
[SpinImages] Using spin images for efficient object recognition in cluttered 3D scenes.
[CF] Model recommendation for action recognition.
[LDA] Discriminative decorrelation for clustering and classification.
[Cuboid] Localizing 3D Cuboids in Single-view Images.
[RGBDcuboid] A Linear Approach to Matching Cuboids in RGBD Images.
[ExactLayout] Efficient exact inference for 3d indoor scene understanding.
[TrafficScene] 3D Traffic Scene Understanding from Movable Platforms.
[SiftFlow] SIFT flow: dense correspondence across different scenes.

Resources:

Books

There is no textbook for this class. The following are just references if you are interested.

Computer vision:

[Sz] Szeliski, Computer Vision: Algorithms and Applications, Springer, 2010 (online draft)
[HZ] Hartley and Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2004
[FP] Forsyth and Ponce, Computer Vision: A Modern Approach, Prentice Hall, 2002
[Pa] Palmer, Vision Science, MIT Press, 1999

Learning:

[Mi] Mitchel, Machine Learning, McGraw-Hill, 1997
[DHS] Duda, Hart and Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 2000

Graphical models:

[KF] Koller and Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009

Related Courses:

Computer Vision Class at Princeton

By Antonio Torralba at MIT:

By Alyosha Efros at CMU/Berkeley:

By James Hays at Brown:

By others:

Deep Learning Summer School 2012
Introduction to Computer Vision, by Noah Snavely
Multiple View Geometry, by Marc Pollefeys
Geometry, by Andrew Zisserman
Recognizing People, Objects and Actions, by Jitendra Malik
Introduction to Computer Vision, by Michael Black
Computer Vision, by Kristen Grauman
Computer Vision, by Rob Fergus
Introduction to Computer Vision, by Fei-Fei Li
The Computer Vision Industry

Code and Datasets

SUN database
SUN360 panorama database
Scene Classification Benchmark
The Steerable Pyramid
DrawMe: a light-weight Javascript library for line drawing on a picture.
Structural SVM
Template Matching
Representation and Synthesis of Visual Texture, Portilla & Simoncelli
Berkeley Segmentation
Pb
Superpixels
Structure from Motion for Unordered Image Collections
Peter Kovesi's Functions for Computer Vision
SIFT implementation by Andrea Vedaldi
Affine Covariant Features
A simple object detector with boosting
OpenCV