AI and Voice Deepfakes: How Did We Created One?
We live in an era where technology is advancing at an incredibly fast pace. Tools that were once exclusive to laboratories and specialists are now accessible to everyone. Among all these advancements, one that grabs our attention here at Hakai is the creation of voice deepfakes.
A striking example of the malicious use of this technology occurred in Hong Kong, where a company fell victim to an attack involving voice deepfakes. The attackers created deepfake voices of the CFO and other employees of the company to deceive the finance department, convincing them to transfer approximately $25.6 million to fraudulent accounts. (Read more at: https://www.cfo.com/news/deepfake-cfo-hong-kong-25-million-fraud-cyber-crime/706529/)
Even though it’s gaining popularity online (and, as we’ve seen, in attacks targeting companies as well), many people still don’t know how this technology works. There’s a lot of talk about the dangers and ethical implications of deepfakes, but few have the chance to see the entire creation process up close.
With that in mind, I recently gave an internal talk on this topic to the Hakai team, where I demonstrated the step-by-step process of creating a voice deepfake using the voice of our founder, Oliveira Lima. In this post, I’ll share part of the content from that presentation, revealing the details behind the creation of the deepfake — from building the target’s dataset to the final result. Of course, some portions of the original audio can’t be shared here, but enough will be included for you to understand the power of this technology.
Preparing Our AI Model:
To create the deepfake, I used a tool called RVC (Retrieval-based Voice Conversion). This tool allows us to build an AI model of our target’s voice and also apply that model to another existing audio file (I’ll share more details on this below).
RVC was developed in Python, and I chose it for several reasons:
- The entire AI training process can be done locally, using only my computer’s resources. This ensures the security of all the data we use
- It was originally developed with music in mind, meaning its focus is to enable trained models to sing. Because of this, the tool not only trains the “texture” of the target’s voice but also captures specific nuances, breathing, vocal pitches, and more
- Since it identifies vocal pitches (high, low, etc.), the resulting model tends to sound more natural
- The tool is not text-to-speech but speech-to-speech. In other words, we provide an existing audio file, and the tool replaces the voice in that audio with the voice of our model while preserving nuances, accents, breathing patterns, and more
- Lastly, it comes with a web interface that significantly boosts productivity and makes it easier to fine-tune the details

Now That We Have the Tool and Understand the Strengths of Our Deepfake, How Do We Create Our Model?
Step 1: Preparing the Target’s Dataset
To start preparing the target’s dataset, we need to find some audio samples of them online. Fortunately, since my target is Oliveira Lima, we have a few appearances of him in Hakai’s live streams on YouTube and Instagram. After reviewing some of Hakai’s live streams, I decided to use the first Break the Code live stream we hosted, where Oliveira gives an introduction and closing remarks.

After identifying the audio source, the next step was to download the live stream and isolate the exact moments where Oliveira speaks. This was the most manual part of the entire process. In total, it was possible to extract about 9 and a half minutes of raw audio. While this may seem like a small amount, it’s fascinating to see just how far we can go with such limited material.

With all the raw audio ready, to make the training process easier, it’s a good idea to break this large audio file into smaller clips of less than 10 seconds each. For this, I used another tool called AutoSplitter, which is designed to remove silence from the audio and split it into small chunks of up to 10 seconds.

(As you can see in the screenshot, the tool trims moments where the energy is less than 0.01% for more than 0.6 seconds.)
At the end of the process, several short audio clips of Oliveira were created — around 83 files in total. Once again, this might typically be a small amount, but for demonstration purposes, it was more than enough.

Below are a few random sample audios from the cutting process results:
Now we can proceed to train Oliveira’s model locally using RVC.
Step 2: Training Our Model
With the tool running on localhost and the web interface set up for convenience, we can input the training information in the train tab. Initially, we provide the model’s name, enable or disable pitch differentiation (I kept it enabled so the model can be used in other cases), and select the number of processors to be used.

Next, we specify the folder containing all the trimmed audio clips to perform an initial processing step by clicking the “process data” button.

Once the processing is complete, we specify which GPU will be used for the training and select the extraction algorithm. I chose “harvest” because, based on my tests, it provides the highest extraction quality, albeit at the cost of longer training times. After selecting the algorithm, simply click “feature extraction,” and the tool will perform the initial preparations with all the previously provided audio clips.

Finally, we define the total number of epochs for the training process. Each epoch goes through all the audio files, extracting various details. It’s important to set a reasonably high number to ensure the tool can work thoroughly with the provided audio, but not so high that it risks compromising the model’s quality and fidelity — especially since we only provided 83 audio files.
With this in mind, I opted to configure 500 epochs and set the tool to save a backup every 50 epochs. This way, I’d have checkpoints in case the training process was interrupted unexpectedly.

After that, all that was left was to click the “one-click training” button and wait for the training to complete. Details about the training process can be monitored in the log console.

Step 3: Using Our Model to Generate the Deepfake
After the training is complete, the log console will display a success message, and two files will be generated: a .pth
file and an .index
file. Both files are crucial for exporting the AI model to other devices, eliminating the need to retrain the model on each computer whenever we create a deepfake.


With everything set up, we can now use the model to alter the voice in an existing audio file. This step is the most crucial for creating a deepfake that feels realistic, as the tool will extract all the nuances from the provided audio and overlay the trained model’s voice onto it. For instance, if you submit an audio file with a brazillian carioca accent, the model will reproduce that accent. The same applies to a brazillian paulista accent, an audio in English, Italian, or any other distinct feature. The tool focuses solely on replacing the voice in the provided audio with the voice of the trained model.
In complex attacks using deepfakes, the target is thoroughly studied. The way they speak, the words they use, the topics they engage with, and much more are analyzed. With this in mind, I reached out to Oliveira’s brother, who is exceptionally good at mimicking him, and he kindly agreed to record an audio for this project. Here’s a portion of the audio he recorded for the project:
With our base audio and the trained Oliveira model, we can now create the final deepfake result.
In the tool’s web interface, under “model interface”, you’ll find all the settings needed for the deepfake. In the first field, I selected the already trained model (in this case, Oliveira’s). Then, in the “Transpose” field, you can specify how many notes higher or lower the model will speak relative to the provided base audio.
For example, a male voice is typically one octave (or 12 notes) lower than a female voice. So, if the base audio provided features a male voice but the trained model is a female voice, you would need to input -12
in this field to adapt the deepfake audio to a female voice. Conversely, if the base audio features a female voice and the model is a male voice, you would need to input 12
to adapt it accordingly.
In this case, since the base audio was a male voice and the model is also a male voice, there was no need to adjust this value, leaving it set to 0
. After confirming this, further down, we select the location of the base audio file on the device and choose the algorithm to be used for replacing the voice in the audio with the voice of our model. I kept the algorithm as “harvest,” as it usually delivers better quality.
Next to it, there are more specific settings to fine-tune small characteristics of the final deepfake audio.
Here’s how the configurations for this part were set:

After that, by clicking “Convert,” I got the final audio within minutes, and this was the result:
Real example
Deepfake created
Unfortunately, the final result presented during the internal talk contained internal information, so it won’t be shared in full here. However, this trimmed version is enough to demonstrate the power of simple tools like this one.
It’s worth noting that this result was achieved using only 9 minutes of the target’s audio. In other words, there’s still plenty of room for improvement in this model.
Extra: Reflections and the Future of Identity
Creating deepfakes is no longer something out of this world. What was once considered extremely complex is now within reach for many people. The entire process presented in the talk was completed in less than a day with decent hardware for training the model. As impressive as this technology is, it comes with immense responsibility.
The accessibility of this kind of technology raises important questions about the future of identity and truth perception. If someone can “become” another person in less than a day, how will we know who to trust? These audio files can be used in social engineering attacks, convincing victims to cooperate with a malicious actor without even realizing it.
Throughout the talk, we collectively concluded that the human element is the only factor that can balance this power. Teaching people to question, verify sources, and be aware of potential manipulations is more critical than ever. In the future, trust may need to be built differently, and what we consider proof of authenticity today might not suffice tomorrow.
Finally, I hope this post has provided a deeper understanding of the topic. Technology is advancing rapidly, and here at Hakai, we’re always focused on keeping up with its evolution and, above all, ensuring its security.
Thank you for reading and Keep Hacking!