DeepMind, a Google AI research laboratory based in the UK, shares progress on their video-to-audio (V2A) technology which makes synchronized audiovisual generation possible.
DeepMind’s V2A utilizes the description of a soundtrack which is paired with a video to create an unlimited array of music, sound effects and dialogue. The generated audio matches the characters and tone of the video provided.
Users are given more control over V2A’s audio output. A ‘positive prompt’ guides the generated output towards desired sounds, whereas a ‘negative prompt’ guides it away from undesired sounds.
However, DeepMind does not plan on releasing V2A until it undergoes rigorous safety assessments and testing. DeepMind has stated they are committed to developing and deploying AI technologies responsibly. They are focused on making sure their V2A technology has a positive impact on the creative community. Furthermore, DeepMind is gathering valuable feedback from various creatives to inform their ongoing research and development.
How Does DeepMind Work?
DeepMind’s V2A system encodes video input into a compressed representation. The diffusion model then refines audio from random noise. Visual input and natural language prompts are then given to generate the desired audio. Once the audio output is decoded, it turns into an audio waveform and is combined with the video data.
More information is added to the training process to generate higher-quality audio which also guides the model into generating more accurate and specific sounds. V2A trains on video, audio and additional annotations resulting in the technology associating specific audio events with various visual scenes.
Suggested:
Google Cloud Recently Launched The Accessibility Of A New Operation Called Generative AI Ops.
Google’s Bard AI Now Generates Images.