Speaking images. A novel framework for the automated self-description of artworks

Authors: Valentine Bernasconi, Gustavo Marfia

Published: 2025-05-28 09:13:41+00:00

AI Summary

This paper presents a novel framework for creating "speaking images," short videos of artworks where depicted characters are animated to explain the artwork's content. It utilizes open-source large language models, face detection, text-to-speech, and audio-to-animation models to automatically generate these videos from digitized artwork.

Abstract

Recent breakthroughs in generative AI have opened the door to new research perspectives in the domain of art and cultural heritage, where a large number of artifacts have been digitized. There is a need for innovation to ease the access and highlight the content of digital collections. Such innovations develop into creative explorations of the digital image in relation to its malleability and contemporary interpretation, in confrontation to the original historical object. Based on the concept of the autonomous image, we propose a new framework towards the production of self-explaining cultural artifacts using open-source large-language, face detection, text-to-speech and audio-to-animation models. The goal is to start from a digitized artwork and to automatically assemble a short video of the latter where the main character animates to explain its content. The whole process questions cultural biases encapsulated in large-language models, the potential of digital images and deepfakes of artworks for educational purposes, along with concerns of the field of art history regarding such creative diversions.


Key findings
The framework successfully generated speaking image videos, though the quality of the results varied depending on the artwork's visual features. LLM responses were influenced by prompting and showed biases, while the audio-to-animation model produced good results but suffered from some image quality degradation with longer audio clips. The study highlights challenges related to LLM biases, content moderation, and the ethical implications of animating artworks.
Approach
The framework involves four steps: 1) face detection and gender recognition using Deepface, 2) generating a first-person description using Llama 3.2, 3) converting text to speech using Kokoro, and 4) animating the detected face based on the audio using Hallo. The animated video is then integrated back into the original artwork.
Datasets
A small dataset of 15 images, including portraits, religious images, and contemporary subjects, spanning various artistic periods.
Model(s)
Llama 3.2 (LLM), Deepface (face detection and gender recognition), Kokoro (text-to-speech), Hallo (audio-driven portrait animation), MediaPipe, DLIB, OpenCV, MTCNN
Author countries
Italy