EfficientPS Demo

Deep Convolutional Neural Networks for Panoptic Segmentation

This demo shows the panoptic segmentation performance of our EfficeintPS model trained on four challenging urban scene understanding datasets. EfficientPS is currently ranked #1 for panoptic segmentation on standard benchmark datasets such as Cityscapes, KITTI, Mapillary Vistas, and IDD. Additionally, EfficientPS is also ranked #2 on the Cityscapes semantic segmentation benchmark as well as #2 on the Cityscapes instance segmentation benchmark, among the published methods. To learn more about panoptic segmentation and the approach employed, please see the Technical Approach. View the demo by selecting a dataset to load from the drop down box below and click on an image in the carosel to see live results. The results are shown as an overlay of panoptic segmentation over the input image. The colors of the overlay denote what semantic category that the pixel belongs to and the instances of objects are indicated with a white boundary.

Please Select a Model:

Selected Dataset:

Cityscapes

Technical Approach

What is Panoptic Segmentation?

Network architecture

Humans from an early age are able to effortlessly comprehend complex visual scenes which forms the bases for learning more advanced capabilities. Similarly, intelligent systems such as robots should have the ability to coherently understand visual scenes at both the fundamental pixel-level as well as at the distinctive object instance level. This enables them to perceive and reason about the environment holistically which facilitates interaction. Such modeling ability is a crucial enabler that can revolutionize several diverse applications including robotics, self-driving cars, augmented reality, and biomedical imaging.

A relatively new approach to scene understanding called as panoptic segmentation aims to use a single convolutional neural network to simultaneously recognize distinct foreground objects such as people, cyclists or cars (a task called instance segmentation), while also labeling pixels in the image background with classes such as road, sky, or grass (a task called semantic segmentation). Most early research has primarily explored these two segmentation tasks separately using different types of network architectures. However, this disjoint approach has several drawbacks including large computational overhead, redundancy in learning and discrepancy between the predictions of each network. Although recent methods have made significant strides to address this task in top-down manner with shared components or in a bottom-up manner sequentially, these approaches still face several challenges in terms of computational efficiency, slow runtimes and subpar results compared to task-specific state-of-the-art networks. To address these issues, we study the typical design choices for such networks and make several key advances that we incorporate in our EfficientPS architecture, which improves both performance as well as efficiency.


EfficientPS Architecture

The idea behind EfficientPS is influenced by our goal of achieving superior performance compared to prior state-of-the-art models while simultaneously being fast and computationally efficient. Initial panoptic segmentation methods heuristically combine predictions from state-of-the-art instance segmentation network and semantic segmentation network in a post-processing step. However, this disjoint approach has several drawbacks including large computational overhead, redundancy in learning and discrepancy between the predictions of each network. Although recent methods have made significant strides to address this task in top-down manner with shared components or in a bottom-up manner sequentially, these approaches still face several challenges in terms of computational efficiency, slow runtimes and subpar results compared to task-specific individual networks.


Network architecture
Figure: Illustration of our EfficientPS architecture. It consists of a shared backbone built upon on EfficientNet (red) and a new 2-way FPN (purple, blue, and green), our proposed semantic head (yellow), a seperable convolution based Mask R-CNN instance head (orange), and our proposed panoptic fusion module.

We address the aforementioned challenges in our EfficientPS architecture that provides effective solutions to these problems. EfficientPS consists of our new shared backbone with mobile inverted bottleneck units and our proposed 2-way Feature Pyramid Network (FPN), followed by task-specific instance and semantic segmentation heads with seperable convolutions, whose outputs are combined in our parameter-free panoptic fusion module. The entire network is jointly optimized in an end-to-end manner to yield the final panoptic segmentation output.


Previous panoptic segmentation architectures rely on ResNets or ResNeXts with Feature Pyramid Network (FPN) as the backbone, which consume a significant amount of parameters and have a limited representational capacity. In order to achieve a better efficiency, we propose a new backbone network consisting of a modified EfficientNet that employs compound scaling to uniformly scale all the dimensions of the network, coupled with our novel 2-way FPN. We identify that the standard FPN has its limitations to aggregate multi-scale features due to the unidirectional flow of information. Therefore, we introduce the novel 2-way FPN that facilities bidirectional flow of information which substantially improves the panoptic quality of foreground classes while remaining comparable in runtime.


We incorporate our proposed semantic head with separable convolutions that captures fine features and long-range context efficiently as well as correlates them before fusion for better object boundary refinement. We build upon Mask R-CNN for the instance head and augment it with separable convolutions and iABN sync layers. One of the critical challenges in panoptic segmentation deals with resolving the conflict of overlapping predictions from the semantic and instance heads. In order to thoroughly exploit the logits from both heads, we propose a new panoptic fusion module that dynamically adapts the fusion of logits from the semantic and instance heads based on their mask confidences and congruously integrates instance-specific foreground classes with background classes to yield the final panoptic segmentation output.


EfficientPS Performance

We evaluate EfficientPS on four challenging urban scene understanding benchmark datasets, namely Cityscapes, Mapillary Vistas, KITTI and IDD. EfficientPS is ranked #1 for panoptic segmentation on the widely used Cityscapes benchmark leaderboard, exceeding the prior state-of-the-art model by a large margin, while consuming fewer parameters, lesser computation and faster inference time. In addition, EfficientPS is also ranked #2 on the Cityscapes semantic segmentation benchmark as well as the Cityscapes instance segmentation benchmark, among the published methods. EfficientPS consistently achieves state-of-the-art panoptic segmentation performance on Mapillary Vistas, KITTI and IDD benchmark datasets. EfficientPS is the first to benchmark on all the four standard urban scene understanding datasets for panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the most efficient model. Achieving computationally efficient rich and coherent image segmentation has widespread implications for image recognition systems that have to make sense of cluttered real-world environments, where objects move and overlap. Segmenting foreground objects together with background is important for understanding entire scenes and for performing related actions, such as navigating through dynamic scenes.

KITTI Panoptic Segmentation Dataset


We introduce the KITTI panoptic segmentation dataset for urban scene understanding that provides panoptic annotations for a subset of images from the KITTI Vision Benchmark Suite. The annotations for the images that we provide do not intersect with the official KITTI semantic/instance segmentation test set, therefore in addition to panoptic segmentation, they can also be used as supplementary training data for benchmarking semantic or instance segmentation tasks individually. The dataset consists of a total of 1055 images, out of which 855 are used for the training set and 200 are used for the validation set. The images are a resolution of 1280×384 pixels. We provide annotations for 11 ‘stuff’ classes and 8 ‘thing’ classes adhering to the Cityscapes ‘stuff’ and ‘thing’ class distribution.

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the KITTI Panoptic Segmentation dataset, please consider citing the paper mentioned in the Publications section.

Videos

Code

Given the exceptional performance of our EfficentPS, we expect it could serve as a new foundation of future segmentation related research, and potentially make high-accuracy panoptic segmentation models practically useful for many real-world applications. Therefore, we have open sourced all the code and pretrained model checkpoints on our GitHub for academic usage under the GPLv3 license. For any commercial purpose, please contact the authors.

Publications

Rohit Mohan and Abhinav Valada, "EfficientPS: Efficient Panoptic Segmentation",
International Journal of Computer Vision (IJCV), 129(5):1551–1579, 2020.

(Pdf) (Bibtex)


Rohit Mohan and Abhinav Valada, "Robust Vision Challenge 2020 - 1st Place Report for Panoptic Segmentation",
European Conference on Computer Vision (ECCV) Workshop on Robust Vision Challenge, 2020.

(Pdf) (Bibtex)


People