A recent study by MIT critically examines the image datasets commonly used to train modern artificial vision models, finding them overly simplistic. This simplicity can lead to inadequate results when training computer vision systems for precise image recognition and object identification within scenes

The training of computer vision systems in the timely recognition of images and, therefore, of the objects that populate the scene being analysed, has a fundamental flaw.

This issue was highlighted by researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds and Machines (CBMM). They authored “How hard are computer vision datasets? Calibrating dataset difficulty to viewing time“, presented at the “Neural Information Processing Systems” (NeurIPS) conference in New Orleans from December 10 to 16, 2023.

Fundamentally, in AI studies, ‘recognizing’ an image involves identifying things, people, and places within it. This is a basic requirement for artificial vision models, leading to more advanced operations like image classification, segmentation, and analysis of interactions and movements within the scene.

The MIT team points out a key flaw: despite recent efforts to enhance AI models for image recognition in terms of accuracy and processing time, the standard training datasets still consist of overly simplistic video data.

Dataset creators tend to exclude more challenging images, leading to a bias towards simpler images and an overestimation of performance in controlled environments. However, real-world performance, especially in scenarios with distorted shapes, low definition, occlusions, or unusual spatial distributions, is what truly matters.


Traditionally, standard datasets for training artificial vision systems emphasized quantity. Today, the complexity and difficulty of video data cannot be overlooked.
Drawing inspiration from the extended human processing time for ‘difficult’ images, the MIT researchers have developed a method to calculate the difficulty level of training data.
Tests using the ImageNet and ObjectNet datasets supported the hypothesis that both are biased towards easily recognizable images.

Computer Vision and Image Recognition: Assessing Training Data Difficulty

The efficacy of AI systems depends heavily on the quality of training data. This is especially true in computer vision and image recognition, with applications ranging from autonomous driving to diagnostic imaging and advanced video surveillance.

The authors highlight a lack of difficulty level information in standard training datasets. This absence makes it hard to objectively assess an artificial vision system’s progress and its approximation to human performance.

Historically, dataset compilation focused on size, with a ‘bigger is better’ approach, neglecting the ‘complexity‘ inherent to human vision.

However, by focusing on methods to measure video data difficulty, it’s possible to calibrate datasets, leading to more balanced AI system performance, the researchers note.

The ‘Minimum Viewing Time’ Metric

Some video data take longer to be processed, recognised and classified by the human visual system. This time extension is due, for example, to poor lighting, unclear images, a cluttered, crowded scene, where objects are superimposed, not in the foreground or partly concealed.

Based on this absolute principle, the authors of the study on computer vision and image recognition developed a metric called “Minimum Viewing Time” (MVT) – i.e. “minimum viewing time” – to “quantify image recognition difficulty based on viewing time before correct identification”, they explain.

The new metric was tested on a sample of people using subsets of ImageNet and ObjectNet. The former is a large set of real images taken from the Web (over 14 million, all labelled), specifically made for training in the field of computer vision; the latter is a similar dataset, but – unlike the former – the objects portrayed have completely random backgrounds, viewpoints and rotations.

ImageNet and ObjectNet: A Critical Examination of Two Benchmark Datasets

During the experiment, participants were presented with transient images on a screen, displayed for durations varying from 17 milliseconds to 10 seconds. Their task was to accurately identify the object from a choice of 50 options.

Images quickly identified with brief glimpses were classified as ‘easy’, whereas those requiring longer observation were categorized as ‘difficult’. The primary goal was to determine the level of difficulty of images from the ImageNet and ObjectNet datasets, which MIT researchers suspected of being under-representative in terms of complexity. This hypothesis formed the basis of the study. Following more than 200,000 trials, it became evident that both datasets were skewed towards simpler images that could be recognized swiftly, with a majority of the test results stemming from images easily identified by the participants.

This depicts the range of test images used to evaluate the 'Minimum Viewing Time' metric. The sequence of images transitions from the simplest on the left to the most complex on the right, with the respective minimum viewing times indicated above each image. This visual arrangement corresponds with the findings of the 'How hard are computer vision datasets? Calibrating dataset difficulty to viewing time' study, conducted by the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds and Machines (CBMM) at the Massachusetts Institute of Technology.
This depicts the range of test images used to evaluate the ‘Minimum Viewing Time’ metric. The sequence of images transitions from the simplest on the left to the most complex on the right, with the respective minimum viewing times indicated above each image. This visual arrangement corresponds with the findings of the ‘How hard are computer vision datasets? Calibrating dataset difficulty to viewing time’ study, conducted by the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds and Machines (CBMM) at the Massachusetts Institute of Technology.

Upon completing the experiment, the team released the datasets employed, with the images labelled according to their recognition difficulty. Additionally, they provided a set of tools for automatically calculating the Minimum Viewing Time. This enables other research groups to incorporate this metric into existing benchmarks and broaden its application across diverse fields.

Future Research in Computer Vision and Image Recognition

To enhance machine processing and classification of video signals, it’s crucial to correlate these operations with the difficulties indicated by the required ‘viewing time’. The aim is to generate more challenging (or easier) versions of training datasets.

This will help create more realistic benchmarks, improving artificial vision system performances and allowing fairer comparisons between AI and human visual perception,” the research team states.

Looking to the future, they suggest further adaptations to the recent experiment, proposing that “an MVT difficulty metric could be developed for simultaneously classifying multiple objects. Modifying our approach to mirror human capabilities in a wide array of visual tasks, especially under specific dataset and condition constraints, continues to be a formidable challenge, yet it is now one we consider solvable.

Future Scenarios Preview

What should we expect – thirty, forty, fifty years from now – from a machine that perceives all visual stimuli in the real world (easy and difficult, simple and complex) better than our optical apparatus and then processes them even faster and more accurately than our brain?

Computer vision and image recognition is one of the most fascinating topics in AI, but it also raises a few eyebrows because of the ‘power’ that – in the distant future – its concrete applications might have.

Apart from the aforementioned autonomous driving and predictive maintenance in industry,it is in the medical field and in public video surveillance that the range of uses is currently difficult to calculate.

Just think of image analysis (X-rays, CT scans, MRIs, PET scans) in the early diagnosis of serious chronic diseases, neurodegenerative diseases and oncological diseases, where the infinitely small detail still eludes us today. Many lives would be saved or, in any case, the course of some diseases would be slowed down further by a computer vision system pushed to its maximum power.

Cameras with an on-board video analysis system capable of analysing any type of scene in a very short time, could, in 50 years’ time, be systematically used – in the public as well as in the private sector – for predictive anti-crime analysis and not only (as is the case today) for simple deterrence.

It is about future scenarios, which we can now anticipate by mapping their impacts, in order to cope in time with the changes, the revolutions, that they will inevitably bring.

Written by: