Although multimodal Large Language Models are not specifically designed and trained for deepfake detection, the extensive world knowledge embedded within their neural networks and the semantic reasoning that governs them could, in the future, be applied to this field.

Online multimedia content created using AI-manipulated images, audio, and video is increasingly being used as a tool for misinformation and reality distortion across various fields. Consequently, it is now seen as a threat and has become a persistent cause for concern.

The latest striking example occurred between 8 December 2023 and 8 January 2024, when more than one hundred highly realistic deepfake video ads circulated on Facebook. These videos featured British Prime Minister, Rishi Sunak, delivering speeches that were completely fabricated and did not reflect his actual views. This clear attempt at political manipulation came precisely six months before the UK general elections on 4 July 2024.

The English tabloid “The Guardian”, reporting the incident, highlighted that «the fake videos could have reached up to 400,000 people». The deepfakes also included a video where a well-known BBC journalist, while reading the latest news, announced a fabricated scandal involving Sunak, «accused of secretly pocketing enormous sums from a public works project».

This phenomenon is increasingly facilitated by generative AI tools that are accessible to everyone, requiring no specific technical knowledge and minimal financial investment. Essentially, anyone with little effort and money can generate fake images, videos, and audio content and distribute them online, passing them off as genuine.

The future challenge, as warned by analysts at the World Economic Forum in “4 ways to future-proof against deepfakes in 2024 and beyond” (12 February 2024), will be posed by real-time deepfake generation by AI chatbots capable of creating highly personalised and even more precise manipulations.

Thus, the risks are likely to increase with the evolution of artificial intelligence, necessitating equally innovative deepfake detection systems.


The World Economic Forum has identified deepfakes as the new cyber threat and has warned of future real-time and highly personalized online content manipulations by advanced AI chatbots. Future automatic detection systems must be equally adept at identifying these manipulations.
A study led by the University of Buffalo, New York, examined ChatGPT-4 Vision’s abilities in identifying deepfakes of human faces, highlighting its versatility and the clear, simple explanations it provides during the detection process.
In the future, the improved performance of GPT-4V in recognising counterfeit images, combined with the development of more sophisticated prompts, could lay the groundwork for interactive discussions on deepfake recognition between the multimodal Large Language Model and users.

Current methods for detecting deepfake

Existing approaches leverage machine learning models, particularly deep neural networks, trained using data from online media sources containing recognised deepfakes.

Recent research has primarily focused on developing algorithms to detect specific facial landmarks, such as the eyes and mouth. Additionally, extensive studies have been conducted on image and video manipulation techniques that make deepfakes so convincing and realistic. These techniques include systems based on Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), the latter being capable of generating entirely novel audio and video data [source: “Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics” – arXiv, 2020].

A joint study by the University of Naples Federico II and the Technical University of Munich (“ID-Reveal: Identity-aware DeepFake Video Detection” – Computer Vision Foundation, 2021) highlights that the latest deepfake detection algorithms, primarily trained to identify specific methods of falsification, exhibit limited generalisation across various types of facial manipulations. The authors identify this as the current challenge. To enhance generalisation, the team proposes a machine learning system that does not require training data from fake images and videos but instead trains solely on real videos and utilises «high-level semantic features».

Meanwhile, a methodology termed “multi-attention” is presented by the University of Science and Technology of China in collaboration with Microsoft Cloud AI (“Multi-attentional Deepfake Detection” – Computer Vision Foundation, 2021). This methodology combines self-attention mechanisms, spatial attention, and temporal attention, enabling the AI model to focus «on essential regions while filtering out extraneous data, thus acquiring both global and local contextual information within the manipulated videos».

The importance of audio signals in detecting deepfakes is underscored by Facebook AI in “Joint Audio-Visual Deepfake Detection” (IEEE – Institute of Electrical and Electronics Engineers, 2021), which points out that manipulated videos often display discrepancies between audio and video components due to challenges in synchronising fake audio with the false visual content.

The authors propose a joint detection system that simultaneously examines both audio and video elements within the same multimedia content. The employed AI technique, deep learning, is tasked with «extracting relevant features from both modalities, integrating them, and making a joint decision on the content’s authenticity».

The contribution of multimodal Large Language Models, including ChatGPT-4 Vision

Recently, a research group led by the University at Buffalo in New York reviewed current approaches to deepfake detection, considering the role of Large Language Models (LLMs). These versatile tools, with extensive applications, have emerged as significant contributors in recent years. LLMs fall within the AI domain initially aimed at developing systems capable of generating text from a specific linguistic input.

They are based on a neural network architecture called “transformer,” which employs an attention mechanism focusing not on individual input elements (such as individual words in a text) but on the structure in which these elements are embedded, capturing their relationships and context [source: “Transformers and Large Language Models” – Stanford University].

Trained on vast volumes of text from the Web, the most well-known iterations include conversational agents like OpenAI’s ChatGPT, Google Gemini, and Meta’s open-source Large Language Models. 

«In recent years, LLMs have demonstrated a strong ability to encode extensive knowledge bases from existing text corpora. This capability has been further extended to images and videos as recent LLMs introduce visual language models for multimodal understanding, exemplified by the latest ChatGPT-4 Vision», explains the University at Buffalo research team in the article “Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics,” published on arXiv on 11 June 2024.

Specifically, ChatGPT-4 Vision (or GPT-4V), released by OpenAI in September 2023, has augmented ChatGPT-4’s reading, writing, listening, and speaking abilities with the capability to “see,” allowing it to decode input images alongside user-provided text instructions. The US researchers aim to test GPT-4V’s performance in identifying AI-generated human faces through a series of experiments based on specific textual prompts.

Why ChatGPT-4V is potentially useful for deepfake detection

The experiments conducted by the authors are preliminary but have already provided significant insights into multimodal Large Language Models (LLMs), of which ChatGPT-4V is an extension. 

Firstly, these models have the ability to distinguish between natural images and AI-generated images.This capability stems from their semantic knowledge, which underpins the functions for which they were designed. Multimodal LLMs, such as GPT-4V, can read, see, listen, and “explain” in both written and oral forms.

Secondly, the distinction made by ChatGPT-4 Vision between “genuine” human faces and deepfakes is more comprehensible and interpretable for humans compared to traditional machine learning detection methods.

However, there is a caveat. The team finds GPT-4V’s ability to recognise AI-generated counterfeit images to be “satisfactory,” with a very high evaluation score. Its accuracy in recognising authentic images, on the other hand, is lower.

This performance gap arises because the absence of semantic inconsistencies in unmanipulated images is not, in itself, sufficient for an LLM like ChatGPT-4 Vision to automatically confirm their authenticity and naturalness.

«The automatic detection capabilities of these LLMs cannot be fully exploited through simple binary instructions, which can lead to a refusal to provide clear answers – clarify researchers from the University of Buffalo. They suggest more effective and incisive prompts to «maximise the potential of ChatGPT-4 Vision in differentiating between real and AI-generated images».

Initial experimental data

The tests conducted by the working group were based on a thousand images of real human faces from the FFHQ (Flickr-Faces-HQ) dataset and two thousand images of human faces created by generative AI models, all subjected to detection by GPT-4V:

«For each batch of input face images, there is a text prompt that requires a Yes/No response on whether the images are AI-generated or, conversely, authentic, accompanied – in case of an affirmative response – by explanations from the machine»

In the examples shown below, on the left (in the pink area) are a series of deepfake images, while on the right (in the green area) are real images (the system is only asked if they are fake or not). Both successful detections by GPT-4V (with green check marks) and failures (with red crosses) are displayed.

Figura che mostra alcuni esempi di analisi di immagini deepfake da parte di ChatGPT-4 Vision con i relativi Prompt, in cui viene richiesta una risposta Sì/No sul fatto che le immagini siano prodotte dall'intelligenza artificiale oppure, al contrario, che siano autentiche, accompagnata da spiegazioni da parte della macchina: a sinistra (area in rosa) figurano i casi in cui le immagini in input sono generate da modelli AI, mentre a destra (area in verde) troviamo i casi di immagini reali. Vengono visualizzati sia i casi di successo da parte di GPT-4V (con segni di spunta verde) che quelli di fallimento (crocette rosse) [Credit: “Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics” - Università di Buffalo, Stato di New York - https://arxiv.org/pdf/2403.14077].
[Credit: “Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics” – University of Buffalo, New York State – https://arxiv.org/pdf/2403.14077].

When compared to the performance of traditional methods, the authors comment that, overall, ChatGPT-4 Vision’s results are «slightly better, but not competitive with the former». The noteworthy aspect of ChatGPT-4 Vision lies in the simple yet clear explanations it is able to provide alongside the detection process.

There is a fundamental difference in the way the two approaches work: deepfake detection models identify statistical differences at the signal level between video data acquired during training with real images and video data acquired during training with fake images. In contrast, «the decisions of the multimodal Large Language Model are based on anomalies detected at the semantic level, as evidenced by the explanations provided in the responses», since ChatGPT-4 Vision is trained using extensive datasets of images paired with brief descriptive texts to learn to relate images and words.

As previously highlighted, the advantage of the LLM mechanism lies in the fact that semantic reasoning leads to results (and explanations) that are more comprehensible to humans.

Most of the errors made by GPT-4V during this initial part of the experiments occurred during t hedetection of real images, the researchers note, with a classification accuracy of about 50%, «drastically different from that which characterised the detection of AI-generated images, which is over 90%».

Glimpses of Futures

Although multimodal Large Language Models (LLMs) in general, and ChatGPT-4 Vision in particular, were not originally designed to detect deepfakes of human faces, their extensive world knowledge might one day be applied to this task, making final decisions more accessible and user-friendly.

Let us now attempt to foresee possible future scenarios by analysing the impacts of this evolving approach from multiple perspectives using the STEPS matrix.

S – SOCIAL: in a future where research expands the applications of ChatGPT-4 Vision in the realm of automatic deepfake detection—encompassing not only the analysis of potentially manipulated human face images but also falsified video and audio content—an integrated approach to combating the spread of false information on the web, generated using artificial intelligence techniques, becomes conceivable. Advances in multimodal LLMs for the automatic processing of all types of multimedia content could one day support forensic media analysis activities. Here, the semantic reasoning of LLMs and their explanatory capabilities regarding the decisions made would offer greater assistance to operators compared to the opacity of traditional machine learning systems.

T – TECHNOLOGICAL: noting that the team’s experiments are in their infancy and have only tested simple queries so far, future efforts must focus on developing strategies for crafting more conceptually complex prompts to present to the LLM. The goal is to move beyond binary instructions (used in traditional deepfake detection systems), which do not yield clear responses from ChatGPT-4 Vision for tasks distinguishing between real images and AI-generated ones. Achieving this milestone could lay the groundwork for interactive conversations with GPT-4V on deepfake recognition, aimed at obtaining increasingly rich and relevant responses from the machine. This would guide users towards a deeper understanding of manipulated content, opening the door to future, possible human-machine interactions against this phenomenon and its consequences.

E – ECONOMIC: «Deepfakes increasingly pose a threat to businesses», highlight analysts from the World Economic Forum in the article “How can we combat the worrying rise in the use of deepfakes in cybercrime?”. In 2022, according to their data, 66% of cybersecurity professionals worldwide experienced deepfake attacks within their organisations, not without economic damage. «Deepfakes are costly», as evidenced in 2022 by 26% of small businesses and 38% of large companies globally, which suffered deepfake frauds resulting in losses of up to $480,000. This presents a new cybersecurity challenge, urgently needing mitigation with innovative and high-performing defence measures. In the future, applying ChatGPT-4 Vision could provide a real-time, automatic deepfake detection tool, enriched with increasingly comprehensive explanations to users, making them more prepared and aware of the dangers (including economic impacts) associated with skillfully counterfeited multimedia content.

P – POLITICAL: the pursuit of new methodologies for the automatic detection of deepfakes in all types of multimedia content, as proposed by the US research group, must be supported by a solid regulatory framework. Specifically, in Europe, it is noteworthy that in February 2024, the EU Commission backed the drafting of a new white paper on the challenges posed by generative AI, along with the ethical issues related to its use for non-virtuous purposes. On 21 May 2024, the AI Act was definitively approved (expected to be published in the Official Journal by July 2024), which categorises AI systems capable of manipulating people through subliminal or deceptive techniques as “unacceptably high risk.” The new white paper on generative artificial intelligence also highlights the «visible power imbalance between content creators, academics, and citizens on one side and major tech companies (such as OpenAI, Microsoft, Google, and Meta) that develop and sell generative AI models on the other».

S – SUSTAINABILITY: whenever Large Language Models are discussed, the negative impacts of their application on environmental sustainability come to mind. Whether referring to the decoding of human language, both written and spoken, or to the generation of texts, written and spoken, or, as in the case of ChatGPT-4 Vision, the multimodal understanding and decoding of input images, the processes enabling these functionalities require millions of hours of training and processing, which correspond to high CO2 emissions. This issue, of course, is not exclusive to large language models but pertains more generally to all artificial intelligence techniques and the digital world, whose carbon footprint is the flip side of the coin, about which we still know little in terms of detailed and transparent reporting.

Written by: