STB-VMM: Swin Transformer Based Video Motion Magnification

View on arXiv ← Back to list

Authors: Ricard Lado-Roigé, Marco A. Pérez

Published: 2023-02-20 14:21:56+00:00

AI Summary

This research introduces STB-VMM, a state-of-the-art video motion magnification model utilizing the Swin Transformer. STB-VMM surpasses existing methods by producing higher-quality outputs with less noise, blurriness, and artifacts, leading to more precise measurements in various applications.

Abstract

The goal of video motion magnification techniques is to magnify small motions in a video to reveal previously invisible or unseen movement. Its uses extend from bio-medical applications and deepfake detection to structural modal analysis and predictive maintenance. However, discerning small motion from noise is a complex task, especially when attempting to magnify very subtle, often sub-pixel movement. As a result, motion magnification techniques generally suffer from noisy and blurry outputs. This work presents a new state-of-the-art model based on the Swin Transformer, which offers better tolerance to noisy inputs as well as higher-quality outputs that exhibit less noise, blurriness, and artifacts than prior-art. Improvements in output image quality will enable more precise measurements for any application reliant on magnified video sequences, and may enable further development of video motion magnification techniques in new technical fields.

Key findings

STB-VMM demonstrates superior performance to the previous state-of-the-art (LB-VMM) in terms of image quality (measured by MUSIQ), exhibiting better noise tolerance and less blurriness. While computationally more expensive, the improved output quality benefits applications requiring precise measurements, such as vibration monitoring.

Approach

STB-VMM employs a three-stage architecture: a feature extractor (using Swin Transformer blocks), a manipulator that magnifies motion, and a reconstructor. It's trained end-to-end using a synthetic dataset with L1 loss, improving feature extraction through regularization.

Datasets

A synthetic dataset generated by Oh et al. (2018) using segmented objects from the PASCAL VOC dataset and background images from MS COCO.

Model(s)

Swin Transformer based architecture with Residual Swin Transformer Blocks (RSTB) and a Mixed Magniﬁed Transformer Block (MMTB).

Author countries

Spain

← Previous