UniSpeaker

Unispeaker: A unified speech generation model for multimodality-driven voice control

[Paper] [Code]

Anonymous authors

Abstract: Recent advancements in zero-shot speech personalized generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal voice creation remains on the rise. In various scenarios, individuals often seek to control and create voice characteristics through different voice description modalities. To address the limitations in both the versatility and performance of voice control found in previous methods, this paper introduces UniSpeaker, a unified multimodality-driven speech generation model that integrates face images, text descriptions, voice attribute descriptions, and reference speech for comprehensive voice control and creation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. In addition, multimodal voice control is incorporated within a large-scale speech generation framework, employing self-distillation to enhance voice disentanglement. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models.

Experiment

Face-Driven Personalized Text-to-Speech
Face-Driven Voice Conversion
Text Description-Driven Personalized Text-to-Speech
Text Description-Driven Voice Conversion
Attribute-Driven Voice Editing

Discussion

Face-Driven Voice Consistency
More Face-Driven Voice Samples
More Text-Driven Voice Samples
Joint Face and Text-driven Voice Samples
Joint Face and Attribute-driven Voice Samples

Face-Driven Personalized Text-to-Speech

It is worth mentioning that the goal of face-driven speech synthesis is to generate voice characteristics that matches the face, rather than to synthesize the voice characteristics that is identical to that of the speaker.

Text	Face Image	Reference Speech	Baseline	UniSpeaker
It impacts sentencing.
She found us we found her disease.
And that's even if you're working way more than full time hours.
They were unexpected pleasures.
Look at the end to see the animal.
We were so excited.

Face-Driven Voice Conversion

Source Speeh	Face Image	Reference Speech	Baseline	UniSpeaker

Text Description-Driven Personalized Text-to-Speech

Content Prompt	Speaker Identity Desciprion	Baseline	UniSpeaker
Here I, for instance, quite naturally want to live, in order to satisfy all my capacities for life, and not simply my capacity for reasoning, that is, not simply one twentieth of my capacity for life.	W is a lovely princess, and he is noted with respect.
They were now close to the gate, and Cheesacre paused before he entered.	C is the fearless beast of the Nocknian will, a seasoned warrior of every battle, known for his brutality.
And they did push so.	J is a visionary leader, at the helm of an ambitious expansionist nation. Despite being disabled in the war, he steadfastly leads his country with unwavering determination.
Off for the prison ship.	M is a character full of contradictions; she embraces humanity while punishing the hypocritical and fallen. She opposes injustice and fights against the forces that suppress the truth.
It's part of my secret.	E is an adventurous wanderer full of vitality with mysterious abilities yet unknown, he is brave and fearless,

Text Description-Driven Voice Conversion

Source Speech	Speaker Identity Desciprion	Reference Speech	Converted Speech
	W is a lovely princess, and he is noted with respect.
	C is the premier peacekeeper, known for her composure, intelligence, and unique technological rifle, which often compensates for her partner's impulsive actions.
	B is a character who has been abandoned but never holds a grudge. He is naive yet cruel, yearning for power and a reunion with his brothers. He is passionate about war.
	M is a self-proclaimed detective. Although her actions are clumsy, she proves to be quite reliable in crucial moments. She is filled with genuine passion for detective work.
	A merciless leader, wielding magic and traditional martial arts as weapons, controls dangerous shadow forms with an elegant demeanor and immense power.

Attribute-Driven Voice Editing

Source Speech	Voice Attribute Description	Generated Speech
	I hope this voice becomes more magnetic.	0.7
	I wish for this sound to be brighter.	0.8
	I want this sound to be thinner.	0.8
	I want this voice to be coarser	0.8
	I desire this voice to turn thinner.	0.8
	I hope this voice becomes slimmer.	0.7

Face-Driven Voice Consistency

Face Image1	Face Image2	Face Image3	Face Image4

More Face-Driven Voice Control Samples

Face Image1	Face Image2	Face Image3	Face Image4 (Out of Domain)

More Text-Driven Voice Samples

Content Prompt	Speaker Identity Desciprion (Out of Domain)	UniSpeaker
What does you look like?	Kyle is a barbarian with immense strength. His muscles are well-developed, and his body is covered with tribal tattoos and battle scars. He fights with a massive warhammer. His roar in battle is as terrifying as his weapon.
"Exchange" means funds in other cities made available by bankers' drafts on such places.	Lillian Stone is a professional dancer with a slender frame and well-defined muscles. Her long hair cascades over her shoulders like a waterfall, and her eyes always sparkle with confidence. Every step she takes exudes grace and poise, whether she is on stage or in daily life, she always manages to execute each movement to perfection.
Priscilla took Honors in Classics, and Phil in Mathematics. Stella obtained a good all round showing.	Allison is a compassionate doctor, always dressed in a clean white coat with a kind smile on her face. She constantly soothes her patients with a gentle voice and treats everyone in need of help with patience and care.
From Thompson, that-.	Eilara is a stealthy assassin known for her agility and precision. Dressed in black form-fitting attire, she can move silently within the shadows, excelling in the use of dual daggers. Her cold and calculated demeanor makes her a formidable adversary in the darkness.
"Nay, nay, not two," The other softly said.	Roderick is a noble knight clad in shining armor, embodying the spirit of chivalry. His sword is engraved with symbols of honor, and he courageously protects the innocent and defends justice. His bravery is renowned far and wide, and his shield bears the scars of numerous battles.

Joint Face and Text-driven Voice Samples

Face Image	Speaker Identity Desciprion	Generated Speech
	Max Brown is an energetic athlete who always wears sports gear and often has sweat on his brow. He has a muscular build, moves with agility, and always has a bright smile on his face. He is enthusiastic about any challenge and enjoys helping others.	0.7

Joint Face and Attribute-driven Voice Samples

Face Image	Speaker Identity Desciprion	Generated Speech
	I hope this voice becomes more magnetic.	0.7

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.