Challenge Overview
Conversational Head Generation Challenge

In face-to-face communication, the speaker transmits the verbal and non-verbal messages explicitly by keeping pace with the listener's reactions, and the listener receives the message and provides real-time responsive feedback to the speaker through non-verbal behavior (e.g., nod, smile, shake, etc.). Generating vivid talking head video and proper responsive listening behavior are both essential for digital humans during face-to-face human-computer interaction.

This challenge is based on our extended `ViCo'' dataset, which is so far the first video conversation dataset containing face-to-face dialogue video clips in various scenarios. In constract to the challenge in 2022, more videos have collected to enable the use of more advance machine learning methods. We aim to bring the face-to-face interactive head video generation into a visual competition in this challenge. A comprehensive collection of conversational video clips are selected from YouTube, containing two people's frontal face by strictly following the principle that a video clip contains only uniquely identified listener and speaker, and requires the listener has responsive non-verbal feedback to speaker on the content of the conversation.

ViCo conversational head generation challenge is organized in conjunction with ACM Multimedia 2023. The challenge includes two tracks:

  • Vivid Talking Head Video Generation conditioned on the identity and audio signals of the speaker.
  • Responsive Listening Head Video Generation conditioned on the identity of the listener and with responding to the speaker's behaviors in real-time.
Those generation videos are expected to be clear, lively and identity-preserved. In general, we encourage digital humans to simulate talking, seeing, and listening to users, like understanding the meaning behind the words during face-to-face conversations.

Individuals and teams with top submissions will present their work at the workshop. We also encourage every team to upload a paper (may consist of up to 4 pages) that briefly describes their system. The paper format follows ACM proceeding style.

If you would like to receive notifications about the challenge (e.g. challenge process, launch time, deadline, awards announcement), please subscribe through this url.

If there are any questions, please let us know by raising an issue or contact us <vico-challenge[at]outlook[dot]com>.

Dataset Overview
ViCo: The First Video Corpus for Face-to-face Conversation

All of the videos in our ViCo dataset are collected from the online video sharing platform YouTube. Rich video clips are contained in the dataset. These clips are collected from various face-to-face conversational scenarios, where bi-directional information flow is highlighted between two people. All video clips are manually checked/labeled by the production expert team.

Show example videos from our ViCo dataset.

More details about ViCo dataset, please refer to:
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei. "Responsive Listening Head Generation: A Benchmark Dataset and Baseline".
[ECCV 2022] [Project]

Dataset Downloads

Train Set: OneDrive
Validation Set (same to 2022): OneDrive
Test Set: TBD
**Note: Except for the driver part, all other parts (e.g. render) can use additional data, but teams need to declare the pretrained model or additional data used when submission. For example, the use of a Talking Head Generation data is not allowed, but the use of pretrained data / model to adjust the render is allowed.


In Train Set, for each track, the data consists of three parts:

  • videos/*.mp4: all videos without audio track
  • audios/*.wav: all audios
  • *.csv: return meta data about all videos/audios
    video_idstrID of video
    uuidstrID of video sub-clips
    speaker_idintID of speaker
    listener_idintID of listener, only in listening_head
    Given the uuid, the only audio audios/{uuid}.wav can be identified, and the listener's video is videos/{uuid}.listener.mp4, the speaker's video is videos/{uuid}.speaker.mp4.
In Validation Set, it organized as the final Test Set exclude the output/ directory.
The inputs consist of these parts:
  • videos/*.mp4: speaker videos, only in listening_head
  • audios/*.wav: all audios
  • first_frames/*.jpg: first frames of expected listener/speaker videos
  • ref_images/(\d+).jpg: reference images by person id
  • *.csv: return meta data about all videos/audios, same to CSVs in Train Set
Meanwhile, a baseline method and evaluation scripts is released in github.
Example generations on train set:

Competition Submission


Competition Evaluation

The quality of generated videos will be quantitative evaluated from the prespectives of visual quality, naturalness, and task-specified goals:

Category Metric Note
Visual Quality CPBD image sharpness
FID distance between synthetic and real data distributions
SSIM perceptual similarity of image contrast, luminance and structure
PSNR the error of synthetic data compared to real
CSIM identity preservation from ArcFace
Naturalness ExpL1 $L_1$ distance of expression coefficients from 3DMM reconstruction
PoseL1 $L_1$ distance of translation/rotation coefficients from 3DMM reconstruction
Speaker-specified AVOffset the synchronization offset of lip motion with input audio
AVConf the confidence of AVOffset
LipLMD the lip landmark distance
Listener-specified ExpFD Fréchet distance in expression coefficients from 3DMM reconstruction
PoseFD Fréchet distance in translation/rotation coefficients from 3DMM reconstruction
The final ranking of all teams is comprehensively measured by all the above evaluation metrics through a ranking-based scoring system based on an additive function. Evaluation scripts can be accessed from this github repo.

Important Dates
Important Dates and Details of the Conversational Head Generation Challenge.

Dataset available for download (training set) March 27, 2023
Test set of each track available for download TBD
Challenge submissions deadline TBD
Evaluation results and challenge award announce TBD
Paper submission deadline TBD
Grand challenge paper notification of Acceptance TBD
MM Grand Challenge Camera Ready papers due TBD
Organizers of this Conversational Head Generation Challenge

Yalong Bai
JD Explore Academy

Mohan Zhou
HIT, Harbin, China

Wei Zhang
JD Explore Academy

Ting Yao
JD Explore Academy

Abdulmotaleb El Saddik
University of Ottawa

Xiaodong He
JD Explore Academy

Tao Mei
JD Explore Academy