What does it take to tune an audio recording of one's own voice and match it to the way we hear ourselves?

@j4k3@lemmy.world · 1 year ago

What does it take to tune an audio recording of one's own voice and match it to the way we hear ourselves?

@j4k3@lemmy.world · 1 year ago

So to clarify, I am really curious about speech comprehension if a Text To Speech AI is trained to sound like ourselves (myself).

Over the last couple of weeks I have been working on learning about and using offline AI. I don’t know why, but I decided to ask here first.

If anyone is curious, this is what my current favorite LLM has to say (Llama 2 70B Q4 GGML):

You

what does it take to tune a recording of my voice and match it to how I hear my own voice?

Assistant

To tune your recorded voice so that it matches the way you perceive your own voice, there are several steps involved in audio processing. Here’s an overview of the process:

Recording: Firstly, ensure that you have a high-quality recording of your voice using proper equipment such as a microphone and soundproofing to minimize background noise. This will provide a good starting point for further editing.
Equalization (EQ): Use an equalizer to adjust the frequency balance in your recorded audio. You can boost or cut specific frequencies that are too prominent or lacking, respectively. For example, if you feel like your voice sounds nasally, you may want to reduce some of the mid-range frequencies around 1kHz and above.
Compression: Apply compression to even out the dynamic range in your recording. This will help balance the loudness between different parts of your speech or singing, making it sound more consistent.
Reverb/Delay: Add a small amount of reverb or delay to create a sense of space and depth around your voice. This can make it feel like you’re hearing yourself in a room rather than through headphones or speakers.
Stereo Imaging: Use stereo imaging tools to adjust the width and positioning of your voice within the mix. This will help create an illusion that your voice is coming from a specific location, making it sound more natural.
Monitoring: Finally, listen back to your processed audio on different devices such as headphones or speakers in various environments (e.g., quiet room vs noisy cafe) to ensure that the changes you’ve made translate well across different listening scenarios. Make further adjustments if necessary until you are satisfied with how it sounds.

Remember, everyone perceives their own voice differently due to bone conduction and other factors, so there may be some trial-and-error involved in finding a sound that feels right for you. It’s also important to keep in mind that the goal is not necessarily to make your recorded voice identical to how it sounds when speaking but rather to create an enjoyable listening experience for others who will hear it through speakers or headphones.

You

Is there any formulaic approximation or rule regarding bone conduction and how humans perceive their own voice?

Assistant

Bone conduction is the process by which sound vibrations are transmitted directly to the inner ear via the skull, bypassing the outer and middle ears. This occurs when we speak or sing, causing us to hear our voices differently than others do because of this direct transmission pathway.

There isn’t a specific formulaic approximation or rule that can accurately predict how each individual perceives their own voice through bone conduction since it varies from person to person due to factors such as head shape and size, vocal tract resonance, and the acoustics of the surrounding environment. However, in general, people tend to hear their voices as deeper or richer when speaking because low-frequency sounds are more efficiently transmitted through bone conduction than high frequencies.

It’s important to note that while understanding how we perceive our own voice can be helpful for tuning a recording of it, the goal should ultimately be to create an enjoyable listening experience for others who will hear your recorded voice through speakers or headphones rather than trying to perfectly replicate what you hear when speaking.