October 3: workshop program
June 10: modified submission dates
April 13: submission dates
April 10: site open


The 2nd International Workshop on “Computer Vision for Audio-Visual Media” (CVAVM) – ICCV 2017 is dedicated to the role of computer vision for audio-visual media. Audio-visual data is readily available since it is simple to acquire and the great majority of videos today contain an audio track. Audio-visual media are ubiquitous in our daily life: from movies to TV programs to music videos to YouTube clips, to cite just a few. Moreover audio-visual media exist on various platforms: TVs, movie theaters, tablets and smartphones. Audio-visual media are also applied in many casual and professional contexts and applications such as entertainment, machine learning, biomedical, games, education, movie special effects, among many others.

The goals of this workshop are to (1) investigate the great research opportunities of audio-visual data/media processing and editing, (2) gather researchers working on audio-visual data/media and (3) present and discuss the latest trends in research and technology with paper presentations and invited talks.

Our workshop investigates any applications and algorithms that combine visual and audio information. The first major thrust is how the combination of audio and visual information can simplify or improve “traditional” computer vision applications, in particular (but not limited to) action recognition, video segmentation and 3D reconstruction. The second major thrust is the exploration of emerging, novel and unconventional applications of audio-visual media, for example movie trailer generation, video editing, and video-to-music alignment.

We invite anyone who is interested in audio-visual data, multi-modal learning and multimedia applications to submit papers and attend the workshop.


The CVAVM workshop is part of ICCV 2017. It will take place in Venice, Italy, on 23 October 2017, at the same venue as the main ICCV conference: Palazzo del Cinema – Venice Convention Center (Lungomare Guglielmo Marconi, 3030126 Lido di Venezia – Venice, see google map). Please see the ICCV webpage for more information on venue, accommodations, and other details.


Important dates

Paper registration (title, abstract and authors): July 19, July 21 2017 (due to CMT issue)
Full paper submission: July 21, 2017
Acceptance notification: August 11, 2017
Camera-ready paper due: August 25, 2017
Workshop date: October 23, 2017 (morning)
ICCV main conference date: October 24-27, 2017

Paper Submission

Our CVAVM workshop invites paper submissions on any applications and algorithms that combine visual and audio information. See the list of topics below.

Paper submissions are handled through the workshop’s CMT website: If you have any issues or questions, do not hesitate to contact us (bazinjc AT

The paper registration deadline (title, abstract and authors) is July 21 and the paper submission deadline is July 21, see dates. The paper submission is similar to the ICCV main conference, see guidelines and template on the ICCV webpage. Papers are limited to 8 pages (excluding references), including figures and tables. The reviewing will be double-blind, and each submission will be reviewed by at least two reviewers. Papers that are not blind, or do not use the template, or have more than 8 pages (excluding references) will be rejected without review. All the accepted papers will be published in the ICCV workshop proceedings.

Topics include (but are not limited to):

– multi-modal learning and deep learning
– automatic video captioning
– joint audio-visual processing
– scene/action recognition, and video classification
– 3D reconstruction and tracking
– video segmentation and saliency
– speaker identification
– speech recognition in videos
– virtual/augmented reality and tele-presence
– human-computer interaction
– robotics
– automatic generation of videos
– trailer generation
– video and movie manipulation
– video synchronization
– image sonification
– video-to-music alignment
– joint audio-video retargeting


The workshop will be on October 23, 2017. See venue information above.

08:40 – 08:50 Welcome and Opening Remarks
08:50 – 09:35 Invited keynote1 by Rif A. Saurous (Google)
09:35 – 09:50 Oral1: “Improving Speaker Turn Embedding by Crossmodal
Transfer Learning From Face Embedding” by Nam Le and Jean-Marc Odobez
09:50 – 10:05 Oral2: “Unsupervised Cross-Modal Deep-Model Adaptation for Audio-Visual Re-Identification With Wearable Cameras”, by Alessio Brutti and Andrea Cavallaro
10:05 – 10:20 Oral3: “Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking”, by Yutong Ban, Laurent Girin, Xavier Alameda-Pineda and Radu Horaud
10:20 – 10:40 Coffee break
10:40 – 11:25 Invited keynote2 by Rémi Ronfard (INRIA)
11:25 – 11:40 Oral4: “Improved Speech Reconstruction From Silent Video”, by Ariel Ephrat, Tavi Halperin and Shmuel Peleg
11:40 – 11:55 Oral5: “Visual Music Transcription of Clarinet Video Recordings Trained With Audio-Based Labelled Data”, by Pablo Zinemanas, Pablo Arias, Gloria Haro and Emilia Gómez
11:55 – 12:40 Invited keynote3 by Josh McDermott (MIT)
12:40 – 12:45 Closing Remarks

Keynote speakers

Josh McDermott, MIT, USA
Rémi Ronfard, INRIA, France
Rif A. Saurous, Google, USA

Workshop chairs

Jean-Charles Bazin, KAIST
Zhengyou Zhang, Microsoft Research
William T. Freeman, MIT

Committee members