conversational head generation challenge

Challenge Overview

Conversational Head Generation Challenge

In face-to-face communication, the speaker transmits the verbal and non-verbal messages explicitly by keeping pace with the listener's reactions, and the listener receives the message and provides real-time responsive feedback to the speaker through non-verbal behavior (e.g., nod, smile, shake, etc.). Generating vivid talking head video and proper responsive listening behavior are both essential for digital humans during face-to-face human-computer interaction.

This challenge is based on our ``ViCo'' dataset, which is so far the first video conversation dataset containing face-to-face dialogue video clips in various scenarios. We aim to bring the face-to-face interactive head video generation into a visual competition in this challenge. A comprehensive collection of conversational video clips are selected from YouTube, containing two people’s frontal face by strictly following the principle that a video clip contains only uniquely identified listener and speaker, and requires the listener has responsive non-verbal feedback to speaker on the content of the conversation.

ViCo conversational head generation challenge is organized in conjunction with ACM Multimedia 2022. The challenge includes two tracks:

Vivid Talking Head Video Generation conditioned on the identity and audio signals of the speaker.
Responsive Listening Head Video Generation conditioned on the identity of the listener and with responding to the speaker’s behaviors in real-time.

Those generation videos are expected to be clear, lively and identity-preserved. In general, we encourage digital humans to simulate talking, seeing, and listening to users, like understanding the meaning behind the words during face-to-face conversations.

Individuals and teams with top submissions will present their work at the workshop. We also encourage every team to upload a paper (may consist of up to 4 pages) that briefly describes their system. The paper format follows ACM proceeding style.

If there are any questions, please let us know by raising an issue.

Dataset Overview

ViCo: The First Video Corpus for Face-to-face Conversation

All of the videos in our ViCo dataset are collected from the online video sharing platform YouTube. Rich video clips are contained in the dataset. These clips are collected from various face-to-face conversational scenarios, where bi-directional information flow is highlighted between two people. All video clips are manually checked/labeled by the production expert team.

Show example videos from our ViCo dataset.

More details about ViCo dataset, please refer to:
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei. "Responsive Listening Head Generation: A Benchmark Dataset and Baseline".
[arXiv] [Project]

Dataset Downloads

Train+Valid+Test Set: OneDrive
Train Set: OneDrive
Validation Set: OneDrive
Test Set: OneDrive
Note: For Train Set, data in listening_head.zip contains the data in talking_head.zip.
**Note: Except for the driver part, all other parts (e.g. render) can use additional data, but teams need to declare the pretrained model or additional data used when submission. For example, the use of a Talking Head Generation data is not allowed, but the use of pretrained data / model to adjust the render is allowed.

Guidelines

In Train Set, for each track, the data consists of three parts:

videos/*.mp4: all videos without audio track
audios/*.wav: all audios

*.csv: return meta data about all videos/audios

Name	Type	Description
video_id	str	ID of video
uuid	str	ID of video sub-clips
speaker_id	int	ID of speaker
listener_id	int	ID of listener, only in listening_head

Given the uuid, the only audio audios/{uuid}.wav can be identified, and the listener's video is videos/{uuid}.listener.mp4, the speaker's video is videos/{uuid}.speaker.mp4.

In Validation Set, it organized as the final Test Set exclude the output/ directory.
The inputs consist of these parts:

videos/*.mp4: speaker videos, only in listening_head
audios/*.wav: all audios
first_frames/*.jpg: first frames of expected listener/speaker videos
ref_images/(\d+).jpg: reference images by person id
*.csv: return meta data about all videos/audios, same to CSVs in Train Set

Meanwhile, a baseline method and evaluation scripts is released in github.
Example generations on train set:

Terms and Conditions

The dataset users have requested permission to use the ViCo database. In exchange for such permission, the users hereby agree to the following terms and conditions:

The database can only be used for non-commercial research and educational purposes.
The authors of the database make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.
You accepts full responsibility for your use of the Database and shall defend and indemnify the Authors of ViCo, against any and all claims arising from your use of the Database, including but not limited to your use of any copies of copyrighted images that you may create from the Database.
You may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
If you are employed by a for-profit, commercial entity, your employer shall also be bound by these terms and conditions, and you hereby represents that you are authorized to enter into this agreement on behalf of such employer.

Competition Submission

Submission platform and Leaderboard are publicly available at this link. Submission format and ranking rules are also included.

Competition Evaluation

The quality of generated videos will be quantitative evaluated from the following prespectives:

generation quality (image level): SSIM, CPBD, PSNR
generation quality (feature level): FID
identity preserving: Cosine Similarity (Arcface)
expression: L1 distance of 3dmm exp features
head motion: L1 distance of 3dmm angle & trans features
lip sync (speaker only): AV offset and AV confidence (SyncNet)
lip landmark distance: L1 distance of lip landmarks

Scripts can be accessed from this github repo.

Important Dates

Important Dates and Details of the Conversational Head Generation Challenge.

Signup to receive updates: using this form

The submission deadline is at 11:59 p.m. of the stated deadline date Anywhere on Earth.

Dataset available for download (training set)	March 31, 2022
Challenge launch date	April 8, 2022
Test set of each track available for download	May 23, 2022
Challenge submissions deadline	June 3, 2022
Evaluation results and challenge award announce	June 8, 2022
Paper submission deadline	June 18, 2022
Grand challenge paper notification of Acceptance	July 7, 2022
MM Grand Challenge Camera Ready papers due	July 20, 2022