Siddhesh Khandelwal

I am a third year PhD student at the University of British Columbia. I am a part of Computer Vision Lab working with Prof. Leonid Sigal on problems in Computer Vision. I completed my Masters in Computer Science from the University of British Columbia, and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Guwahati.

My current research focus is towards the intersection of developing methods for structured understanding of visual scenes and effective learning with limited supervision. I am also interested in forecasting accurate future trajectories for autonomous driving.

Selected Publications

NeurIPS
Iterative Scene Graph Generation

Siddhesh Khandelwal, and Leonid Sigal

In Proceedings of Neural Information Processing Systems, 2022

Abs Bib PDF Supp Code

The task of scene graph generation entails identifying object entities and their corresponding interaction predicates in a given image (or video). Due to the combinatorially large solution space, existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation feasible (e.g., assuming that objects are conditionally independent of predicate predictions). However, this fixed factorization is not ideal under all scenarios (e.g., for images where an object entailed in interaction is small and not discernible on its own). In this work, we propose a novel framework for scene graph generation that addresses this limitation, as well as introduces dynamic conditioning on the image, using message passing in a Markov Random Field. This is implemented as an iterative refinement procedure wherein each modification is conditioned on the graph generated in the previous iteration. This conditioning across refinement steps allows joint reasoning over entities and relations. This framework is realized via a novel and end-to-end trainable transformer-based architecture. In addition, the proposed framework can improve existing approach performance. Through extensive experiments on Visual Genome and Action Genome benchmark datasets we show improved performance on the scene graph generation.
@inproceedings{khandelwal2022iterative, title = {Iterative Scene Graph Generation}, author = {Khandelwal, Siddhesh and Sigal, Leonid}, booktitle = {Proceedings of Neural Information Processing Systems}, year = {2022}, }
ICCV
Segmentation-grounded scene graph generation

Siddhesh Khandelwal, Mohammed Suhail, and Leonid Sigal

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021

Abs Bib PDF Supp Code

Scene graph generation has emerged as an important problem in computer vision. While scene graphs provide a grounded representation of objects, their locations and relations in an image, they do so only at the granularity of proposal bounding boxes. In this work, we propose the first, to our knowledge, framework for pixel-level segmentation-grounded scene graph generation. Our framework is agnostic to the underlying scene graph generation method and address the lack of segmentation annotations in target scene graph datasets (e.g., Visual Genome) through transfer and multi-task learning from, and with, an auxiliary dataset (e.g., MS COCO). Specifically, each target object being detected is endowed with a segmentation mask, which is expressed as a lingual-similarity weighted linear combination over categories that have annotations present in an auxiliary dataset. These inferred masks, along with a novel Gaussian attention mechanism which grounds the relations at a pixel-level within the image, allow for improved relation prediction. The entire framework is end-to-end trainable and is learned in a multi-task manner with both target and auxiliary datasets.
@inproceedings{khandelwal2021segmentation, title = {Segmentation-grounded scene graph generation}, author = {Khandelwal, Siddhesh and Suhail, Mohammed and Sigal, Leonid}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages = {15879--15889}, year = {2021}, }
CVPR
Unit: Unified knowledge transfer for any-shot object detection and segmentation

Siddhesh Khandelwal, Raghav Goyal, and Leonid Sigal

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

Abs Bib PDF Supp Code

Methods for object detection and segmentation rely on large scale instance-level annotations for training, which are difficult and time-consuming to collect. Efforts to alleviate this look at varying degrees and quality of supervision. Weakly-supervised approaches draw on image-level labels to build detectors/segmentors, while zero/few-shot methods assume abundant instance-level data for a set of base classes, and none to a few examples for novel classes. This taxonomy has largely siloed algorithmic designs. In this work, we aim to bridge this divide by proposing an intuitive and unified semi-supervised model that is applicable to a range of supervision: from zero to a few instance-level samples per novel class. For base classes, our model learns a mapping from weakly-supervised to fully-supervised detectors/segmentors. By learning and leveraging visual and lingual similarities between the novel and base classes, we transfer those mappings to obtain detectors/segmentors for novel classes; refining them with a few novel class instance-level annotated samples, if available. The overall model is end-to-end trainable and highly flexible. Through extensive experiments on MS-COCO and Pascal VOC benchmark datasets we show improved performance in a variety of settings.
@inproceedings{khandelwal2021unit, title = {Unit: Unified knowledge transfer for any-shot object detection and segmentation}, author = {Khandelwal, Siddhesh and Goyal, Raghav and Sigal, Leonid}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages = {5951--5961}, year = {2021}, }
ICCV
AttentionRNN: a structured spatial attention mechanism

Siddhesh Khandelwal, and Leonid Sigal

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019

Abs Bib PDF Supp

Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effective way to utilize visual information selectively, which has shown to be especially valuable in multi-modal learning tasks. However, all prior attention frameworks lack the ability to explicitly model structural dependencies among attention variables, making it difficult to predict consistent attention masks. In this paper we develop a novel structured spatial attention mechanism which is end-to-end trainable and can be integrated with any feed-forward convolutional neural network. This proposed AttentionRNN layer explicitly enforces structure over the spatial attention variables by sequentially predicting attention values in the spatial mask in a bi-directional raster-scan and inverse raster-scan order. As a result, each attention value depends not only on local image or contextual information, but also on the previously predicted attention values. Our experiments show consistent quantitative and qualitative improvements on a variety of recognition tasks and datasets; including image categorization, question answering and image generation.
@inproceedings{khandelwal2019attentionrnn, title = {AttentionRNN: a structured spatial attention mechanism}, author = {Khandelwal, Siddhesh and Sigal, Leonid}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages = {3425--3434}, year = {2019}, }