Popular Datasets in SSL of CV

Introduction

This blog is an excerpt from my own paper 《A survey on pre-trained models in text, image and graph: powerful self-supervised deep learning via big data》 that targets introducing the powerful pre-trained model (PTM) in Self-Supervised Learning (SSL) era. This blog serves as a synopsis of benchmark datasets in SSL in computer vision (CV).

Classification

The MNIST[1] is a database of handwritten digits containing 60,00060,000 training examples and 10,00010,000 testing examples. The images are fixed-size with 28×2828\times28 pixels. The pixel values are from 00 to 255.0255.0 in which pixel values smaller than 255.0255.0 can be understood as background (white) and 255 means foreground (black). The labels are from 0 to 9 and only one of these digits exists in an image. Both traditional and deep learning methods are based on this most popular dataset despite advanced methods showing perfect results. Thus, Geoffrey Hinton has described it as “the drosophila of machine learning”.

Also in the domain of digit number, The Street View House Numbers (SVHN)[2] dataset collects real-world digit numbers from house numbers in Google Street View images. There are 73,25773,257 digits for training, 26,03226,032 digits for testing, and 531,131531,131 additional, and all of them are 32×3232\times32 color images with both class labels and character level bounding boxes.

As more advanced methods show perfect results on simple dataset, more sophisticated dataset such as CIFAR-10/CIFAR-100[3] are conducted. These two datasets are more close to the real-world object. The CIFAR-10 consists of 50,00050,000 training images and 10,00010,000 testing images, with 6,0006,000 images per class and 32×3232\times32 pixels in each RGB colour image. The CIFAR-100 is similar to the CIFAR-10 but with more detailed label information. There are 100100 classes containing 500500 training images and 100100 testing images in each class. In addition, these 100100 “fine” classes are grouped equally into 2020 “coarse” classes. Researchers can adapt it to suitable learning methods.

Inspired by the CIFAR-10 dataset, **STL-10[4] is another 96×9696\times96 color image dataset containing similar 1010 real-world classes. Each class has 500500 training images and 800800 test images. The biggest difference is that STL-10 has 100,000100,000 unlabeled images for unsupervised learning. More construction information can be seen in [5].

Caltech-101[6] collects roughly 300×200300\times200 color images of objects belonging to 101101 categories, with 4040 to 800800 images per category and 5050 on average. The outlines of the objects in the pictures are annotated for the convenience of different learning methods.

ImageNet[7] is one of the most popular and large-scale dataset on computer vision. It is built according to the hierarchical structure of WordNet[8]. The full ImageNet dataset contains 14,197,12214,197,122 images and 21,84121,841 synsets indexed, attaching on average 1,0001,000 images to illustrate each synset. The most frequently-used subset of ImageNet is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset from 2010 to 2017, containing tasks of classification, localization, and detection. The number of samples in training and testing datasets and the labels of images is determined by the specific task, more details are seen in [9].

In addition to the popular MNIST, there still exist many domain datasets used for the downstream tasks in the classification problem.
HMDB51[10][11] is an action video database for a total of 7,0007,000 clips in 5151 action classes. It contains five types of facial actions and body movements. UCF101[12] is another action video data set designed for the more realistic action recognition. It is an extension of UCF50[13] data set containing only 50 action categories with 101101 action categories, collected from YouTube. What makes it a famous recognition dataset is the workshop in ICCV13 with UCF101 as its main competition benchmark. Food-101[14] is a real-world food dataset of 101101 food categories, with 750750 and 250250 images per class in training and testing dataset respectively. Birdsnap[15] is a large-scale fine-grained visual categorization of birds, with bounding boxes and the locations/annotations of 1717 parts in the object. It contains 49,82949,829 images of 500500 most common species in North American, with each species containing 6969 to 100100 images and most species having 100100. In addition, some images are also labeled as male or female, immature or adult, and breeding or non-breeding plumage. To target the scene categorization, the extensive Scene UNderstanding (SUN)[16][17] database fills the gap of the existing dataset with the limited scope of categories. This database contains 899899 categories and 130,519130,519 images, and only images with more than 200×200200\times200 pixels were kept. SUN397 is a more well-sampled subset that maintains 397397 categories with at least 100100 images per category, in which other categories containing relatively few unique photographs are discarded. Places205[18] dataset is another large scale scene dataset consists of 2,448,8732,448,873 images from 205205 scene categories. Cars[19] dataset in the domain of cars contains 16,18516,185 color images of 196196 classes (at the level of Make, Model, Year) of cars. For convenience, this dataset is split into training and test set in roughly equal quantities. Aircraft[20] is another fine-grained visual classification designed for aircraft (also known as FGVC-Aircraft). A popular form of this dataset is the fine-grained recognition challenge 2013 (FGComp2013)[21] ran in parallel with the ILSVRC2013. There exist four-level hierarchy: Model, Variant, Family, Manufacturer, from finer to coarser to organize this database. The more detailed information is shown in [22]. Pets[23] represents The Oxford-IIIT Pet Dataset that collects 3737 pet categories with roughly 200200 images per category. All images have an associated ground truth annotation of breed for classification, head ROI for detection, and pixel-level trimap for segmentation. Similarly, Flowers[24] is another domain dataset in flowers also collected by Oxford; it contains Oxford-17 Flowers of 1717 categories and Oxford-102 Flowers of 102102 categories. The Describable Textures Dataset (DTD)[25] is an evolving collection of textural images in the wild, which consists of 5,6405,640 images of 4747 categories, with 120120 images per category. iNaturallist2018[26] is a large-scale species classification competition conducted on the FGVC5 workshop at CVPR2018. This dataset contains over 8,0008,000 species categories, with more than 450,000450,000 images in the training and validation dataset collected from iNaturalist[27].

Detection

COCO[28] is a large-scale data set for object detection, segmentation, and caption; it contains 330,000330,000 rgb images, with more than 200,000200,000 labelled. There are 1.51.5 million​ object instances of 8080 object categories involved. Thus, it is one of the most popular benchmark data set in detection and segmentation in parallel with the following PASCAL VOC.

The PASCAL VOC project[29] provides standardized image datasets for object class recognition and also has run challenges evaluating performance on object class recognition from 2005 to 2012. The main datasets used in self-supervised learning are VOC07, VOC11, and VOC12. Main competitions in VOC07[30] contains classification and detection tasks; both of them consist of 2020 objects and contain at least one object in each image. Thus, it is common to use VOC07 serving as the downstream task for the detection.

Segmentation

Both VOC11[31] and VOC12[32] contains classification, detection, and segmentation tasks in the main competition, thus leading to the common use of down stream task for the segmentation.

ADE20K[33][34] collects 27,57427,574 images from both the SUN and Places205 databases, in which 25,57425,574 for training and 2,0002,000 for testing. All 707,868707,868 objects from 3,6883,688 categories existing in images are annotated. Especially, this dataset contains 193,238193,238 annotated object parts and parts of parts, and additional attributes, annotation time, depth ordering for the benefit of research community.

The NYU-Depth V2[35] is a dataset consists of images and video sequences from 464464 indoor scenes that are recorded by both the RGB and Depth cameras from 33 cities. It contains 1,4491,449 images with the ground truth of depth, and the original RGB values are also provided. In addition, there are 407,024407,024 new unlabeled frames and additional class labels for the objects in images.

Cityscapes[36][37] is a dataset of urban street scenes from 5050 cities with the ground truth of semantic segmentation. The main instances are vehicles, people, and construction. The high-quality dense pixel annotations contain a volume of 5,0005,000 images. In addition to the fine annotations, coarser polygonal annotations are provided for a set of 20,00020,000 images. Moreover, the videos consist of not consistent images with high-quality annotations, and these annotated images with consistent changing views are provided for researchers.

LVIS[38] is dataset for large vocabulary instance segmentation. It features that 1) a category or word in one image is related to the only segmentation object; 2) more than 1,2001,200 categories are extracted from roughly 160,000160,000 images; 3) long tails phenomenon exist in these categories; and 4) more than 2,000,0002,000,000 high-quality instance segmentation masks.

Densely Annotated VIdeo Segmentation (DAVIS)[39] is a video dataset designed for the in-depth analysis of the SOTA in video object segmentation, in which DAVIS 2017[40] contains both semi-supervised (human-guided at test time) and unsupervised (human non-guided at test time) video sequences with multiple annotated instances.

Others

Paris StreetView dataset[41] is designed for image inpainting task, which contains 14,90014,900 training images and 100 test images collected from street views of Paris. This dataset is collected from Google Street View and mainly focuses on the buildings in the city.

Based on MNIST, Moving-MNIST[42] is a video dataset designed for evaluating sequence prediction or reconstruction, which contains 10,00010,000 sequences. Each video is long of 2020 frames and consisted of two digits (possibly overlapped) moving inside a 64×6464\times64 patch. The first benchmark is reported on [43] by the method of LSTMs.

Yahoo Flickr Creative Commons 100100 Million (YFCC100M) dataset[44][45] is the largest public multimedia collection that is allowed to search by users for their own targets; this dataset can browse both images and videos. It is free and for researchers to explore and investigate subsets of the YFCC100M in real-time. Subsets of the complete dataset can be retrieved by any keyword searching and reviewed directly. In addition, the text information attached to any image or video is abundant, such as containing location information and user tags. Briefly,it is more a multimedia library than a domain dataset.

More generalized dataset concept in self-supervised learning era is composed of multimedia website, APP, or search engine such as \textbf{Instagram}, Flickr, Google Images, etc. I think pictures in the wild will play a major role in the future study in CV because of the quantity of data, the computation source, and the learning power of PTM.

Reference

[1] http://yann.lecun.com/exdb/mnist/.

[2] http://ufldl.stanford.edu/housenumbers/.

[3] https://www.cs.toronto.edu/~kriz/index.html.

[4] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223, JMLR Workshop and Conference Proceedings, 2011.

[5] https://cs.stanford.edu/~acoates/stl10/.

[6] http://www.vision.caltech.edu/Image_Datasets/Caltech101/.

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.

[8] G. A. Miller, WordNet: An electronic lexical database. MIT press, 1998.

[9] https://image-net.org/.

[10] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” in Proceedings of the International Conference on Computer Vision (ICCV), 2011.

[11] https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/.

[12] https://www.crcv.ucf.edu/data/UCF101.php.

[13] https://www.crcv.ucf.edu/data/UCF50.php.

[14] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in European conference on computer vision, pp. 446–461, Springer, 2014.

[15] T. Berg, J. Liu, S. Woo Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur, “Birdsnap: Large-scale fine-grained visual categorization of birds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.

[16] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3485–3492, IEEE, 2010.

[17] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene c International Journal of Computer Vision, vol. 119, no. 1, pp. 3–22, 2016.

[18] http://places.csail.mit.edu/downloadData.html.

[19] http://ai.stanford.edu/~jkrause/cars/car_dataset.html.

[20] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” tech. rep., 2013.

[21] https://sites.google.com/site/fgcomp2013/.

[22] https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/.

[23] https://www.robots.ox.ac.uk/~vgg/data/pets/.

[24] https://www.robots.ox.ac.uk/~vgg/data/flowers/.

[25] https://www.robots.ox.ac.uk/~vgg/data/dtd/.

[26] https://sites.google.com/view/fgvc5/competitions/inaturalist.

[27] https://www.inaturalist.org/.

[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014.

[29] http://host.robots.ox.ac.uk/pascal/VOC/.

[30] http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html.

[31] http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html.

[32] http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html.

[33] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641, 2017.

[34] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019.

[35] https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html.

[36] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPR Workshop on The Future of Datasets in Vision, 2015.

[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[38] A. Gupta, P. Dollar, and R. Girshick, “LVIS: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.

[39] https://davischallenge.org/.

[40] https://davischallenge.org/davis2017/code.html.

[41] C. Doersch, “Data analysis project: What makes paris look like paris?”.

[42] http://www.cs.toronto.edu/~nitish/unsupervised_video/.

[43] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in International conference on machine learning, pp. 843–852, PMLR, 2015.

[44] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.

[45] http://projects.dfki.uni-kl.de/yfcc100m/.