Introduction
This blog is an excerpt from my own paper 《A survey on pre-trained models in text, image and graph: powerful self-supervised deep learning via big data》 that targets introducing the powerful pre-trained model (PTM) in Self-Supervised Learning (SSL) era. This blog serves as a synopsis of benchmark datasets in SSL in computer vision (CV).
Classification
The MNIST[1] is a database of handwritten digits containing training examples and testing examples. The images are fixed-size with pixels. The pixel values are from to in which pixel values smaller than can be understood as background (white) and 255 means foreground (black). The labels are from 0 to 9 and only one of these digits exists in an image. Both traditional and deep learning methods are based on this most popular dataset despite advanced methods showing perfect results. Thus, Geoffrey Hinton has described it as “the drosophila of machine learning”.
Also in the domain of digit number, The Street View House Numbers (SVHN)[2] dataset collects real-world digit numbers from house numbers in Google Street View images. There are digits for training, digits for testing, and additional, and all of them are color images with both class labels and character level bounding boxes.
As more advanced methods show perfect results on simple dataset, more sophisticated dataset such as CIFAR-10/CIFAR-100[3] are conducted. These two datasets are more close to the real-world object. The CIFAR-10 consists of training images and testing images, with images per class and pixels in each RGB colour image. The CIFAR-100 is similar to the CIFAR-10 but with more detailed label information. There are classes containing training images and testing images in each class. In addition, these “fine” classes are grouped equally into “coarse” classes. Researchers can adapt it to suitable learning methods.
Inspired by the CIFAR-10 dataset, **STL-10[4] is another color image dataset containing similar real-world classes. Each class has training images and test images. The biggest difference is that STL-10 has unlabeled images for unsupervised learning. More construction information can be seen in [5].
Caltech-101[6] collects roughly color images of objects belonging to categories, with to images per category and on average. The outlines of the objects in the pictures are annotated for the convenience of different learning methods.
ImageNet[7] is one of the most popular and large-scale dataset on computer vision. It is built according to the hierarchical structure of WordNet[8]. The full ImageNet dataset contains images and synsets indexed, attaching on average images to illustrate each synset. The most frequently-used subset of ImageNet is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset from 2010 to 2017, containing tasks of classification, localization, and detection. The number of samples in training and testing datasets and the labels of images is determined by the specific task, more details are seen in [9].
In addition to the popular MNIST, there still exist many domain datasets used for the downstream tasks in the classification problem.
HMDB51[10][11] is an action video database for a total of clips in action classes. It contains five types of facial actions and body movements. UCF101[12] is another action video data set designed for the more realistic action recognition. It is an extension of UCF50[13] data set containing only 50 action categories with action categories, collected from YouTube. What makes it a famous recognition dataset is the workshop in ICCV13 with UCF101 as its main competition benchmark. Food-101[14] is a real-world food dataset of food categories, with and images per class in training and testing dataset respectively. Birdsnap[15] is a large-scale fine-grained visual categorization of birds, with bounding boxes and the locations/annotations of parts in the object. It contains images of most common species in North American, with each species containing to images and most species having . In addition, some images are also labeled as male or female, immature or adult, and breeding or non-breeding plumage. To target the scene categorization, the extensive Scene UNderstanding (SUN)[16][17] database fills the gap of the existing dataset with the limited scope of categories. This database contains categories and images, and only images with more than pixels were kept. SUN397 is a more well-sampled subset that maintains categories with at least images per category, in which other categories containing relatively few unique photographs are discarded. Places205[18] dataset is another large scale scene dataset consists of images from scene categories. Cars[19] dataset in the domain of cars contains color images of classes (at the level of Make, Model, Year) of cars. For convenience, this dataset is split into training and test set in roughly equal quantities. Aircraft[20] is another fine-grained visual classification designed for aircraft (also known as FGVC-Aircraft). A popular form of this dataset is the fine-grained recognition challenge 2013 (FGComp2013)[21] ran in parallel with the ILSVRC2013. There exist four-level hierarchy: Model, Variant, Family, Manufacturer, from finer to coarser to organize this database. The more detailed information is shown in [22]. Pets[23] represents The Oxford-IIIT Pet Dataset that collects pet categories with roughly images per category. All images have an associated ground truth annotation of breed for classification, head ROI for detection, and pixel-level trimap for segmentation. Similarly, Flowers[24] is another domain dataset in flowers also collected by Oxford; it contains Oxford-17 Flowers of categories and Oxford-102 Flowers of categories. The Describable Textures Dataset (DTD)[25] is an evolving collection of textural images in the wild, which consists of images of categories, with images per category. iNaturallist2018[26] is a large-scale species classification competition conducted on the FGVC5 workshop at CVPR2018. This dataset contains over species categories, with more than images in the training and validation dataset collected from iNaturalist[27].
Detection
COCO[28] is a large-scale data set for object detection, segmentation, and caption; it contains rgb images, with more than labelled. There are million object instances of object categories involved. Thus, it is one of the most popular benchmark data set in detection and segmentation in parallel with the following PASCAL VOC.
The PASCAL VOC project[29] provides standardized image datasets for object class recognition and also has run challenges evaluating performance on object class recognition from 2005 to 2012. The main datasets used in self-supervised learning are VOC07, VOC11, and VOC12. Main competitions in VOC07[30] contains classification and detection tasks; both of them consist of objects and contain at least one object in each image. Thus, it is common to use VOC07 serving as the downstream task for the detection.
Segmentation
Both VOC11[31] and VOC12[32] contains classification, detection, and segmentation tasks in the main competition, thus leading to the common use of down stream task for the segmentation.
ADE20K[33][34] collects images from both the SUN and Places205 databases, in which for training and for testing. All objects from categories existing in images are annotated. Especially, this dataset contains annotated object parts and parts of parts, and additional attributes, annotation time, depth ordering for the benefit of research community.
The NYU-Depth V2[35] is a dataset consists of images and video sequences from indoor scenes that are recorded by both the RGB and Depth cameras from cities. It contains images with the ground truth of depth, and the original RGB values are also provided. In addition, there are new unlabeled frames and additional class labels for the objects in images.
Cityscapes[36][37] is a dataset of urban street scenes from cities with the ground truth of semantic segmentation. The main instances are vehicles, people, and construction. The high-quality dense pixel annotations contain a volume of images. In addition to the fine annotations, coarser polygonal annotations are provided for a set of images. Moreover, the videos consist of not consistent images with high-quality annotations, and these annotated images with consistent changing views are provided for researchers.
LVIS[38] is dataset for large vocabulary instance segmentation. It features that 1) a category or word in one image is related to the only segmentation object; 2) more than categories are extracted from roughly images; 3) long tails phenomenon exist in these categories; and 4) more than high-quality instance segmentation masks.
Densely Annotated VIdeo Segmentation (DAVIS)[39] is a video dataset designed for the in-depth analysis of the SOTA in video object segmentation, in which DAVIS 2017[40] contains both semi-supervised (human-guided at test time) and unsupervised (human non-guided at test time) video sequences with multiple annotated instances.
Others
Paris StreetView dataset[41] is designed for image inpainting task, which contains training images and 100 test images collected from street views of Paris. This dataset is collected from Google Street View and mainly focuses on the buildings in the city.
Based on MNIST, Moving-MNIST[42] is a video dataset designed for evaluating sequence prediction or reconstruction, which contains sequences. Each video is long of frames and consisted of two digits (possibly overlapped) moving inside a patch. The first benchmark is reported on [43] by the method of LSTMs.
Yahoo Flickr Creative Commons Million (YFCC100M) dataset[44][45] is the largest public multimedia collection that is allowed to search by users for their own targets; this dataset can browse both images and videos. It is free and for researchers to explore and investigate subsets of the YFCC100M in real-time. Subsets of the complete dataset can be retrieved by any keyword searching and reviewed directly. In addition, the text information attached to any image or video is abundant, such as containing location information and user tags. Briefly,it is more a multimedia library than a domain dataset.
More generalized dataset concept in self-supervised learning era is composed of multimedia website, APP, or search engine such as \textbf{Instagram}, Flickr, Google Images, etc. I think pictures in the wild will play a major role in the future study in CV because of the quantity of data, the computation source, and the learning power of PTM.
Reference
[1] http://yann.lecun.com/exdb/mnist/.
[2] http://ufldl.stanford.edu/housenumbers/.
[3] https://www.cs.toronto.edu/~kriz/index.html.
[4] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223, JMLR Workshop and Conference Proceedings, 2011.
[5] https://cs.stanford.edu/~acoates/stl10/.
[6] http://www.vision.caltech.edu/Image_Datasets/Caltech101/.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
[8] G. A. Miller, WordNet: An electronic lexical database. MIT press, 1998.
[10] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” in Proceedings of the International Conference on Computer Vision (ICCV), 2011.
[11] https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/.
[12] https://www.crcv.ucf.edu/data/UCF101.php.
[13] https://www.crcv.ucf.edu/data/UCF50.php.
[14] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in European conference on computer vision, pp. 446–461, Springer, 2014.
[15] T. Berg, J. Liu, S. Woo Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur, “Birdsnap: Large-scale fine-grained visual categorization of birds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
[16] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3485–3492, IEEE, 2010.
[17] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene c International Journal of Computer Vision, vol. 119, no. 1, pp. 3–22, 2016.
[18] http://places.csail.mit.edu/downloadData.html.
[19] http://ai.stanford.edu/~jkrause/cars/car_dataset.html.
[20] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” tech. rep., 2013.
[21] https://sites.google.com/site/fgcomp2013/.
[22] https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/.
[23] https://www.robots.ox.ac.uk/~vgg/data/pets/.
[24] https://www.robots.ox.ac.uk/~vgg/data/flowers/.
[25] https://www.robots.ox.ac.uk/~vgg/data/dtd/.
[26] https://sites.google.com/view/fgvc5/competitions/inaturalist.
[27] https://www.inaturalist.org/.
[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014.
[29] http://host.robots.ox.ac.uk/pascal/VOC/.
[30] http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html.
[31] http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html.
[32] http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html.
[33] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641, 2017.
[34] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019.
[35] https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html.
[36] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPR Workshop on The Future of Datasets in Vision, 2015.
[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[38] A. Gupta, P. Dollar, and R. Girshick, “LVIS: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[39] https://davischallenge.org/.
[40] https://davischallenge.org/davis2017/code.html.
[41] C. Doersch, “Data analysis project: What makes paris look like paris?”.
[42] http://www.cs.toronto.edu/~nitish/unsupervised_video/.
[43] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in International conference on machine learning, pp. 843–852, PMLR, 2015.
[44] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.