paint-brush
Hosting Your Own AI with Two-Way Voice Chat Is Easier Than You Think!by@herahavenai
249 reads

Hosting Your Own AI with Two-Way Voice Chat Is Easier Than You Think!

by HeraHaven AIJanuary 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

This guide will walk you through setting up a local LLM server that supports two-way voice interactions using Python, Transformers, Qwen2-Audio-7B-Instruct, and Bark.
featured image - Hosting Your Own AI with Two-Way Voice Chat Is Easier Than You Think!
HeraHaven AI HackerNoon profile picture

The integration of LLMs with voice capabilities has created new opportunities in personalized customer interactions.


This guide will walk you through setting up a local LLM server that supports two-way voice interactions using Python, Transformers, Qwen2-Audio-7B-Instruct, and Bark.

Prerequisites

Before we begin, you'll have the following installed:

  • Python: Version 3.9 or higher.
  • PyTorch: For running the models.
  • Transformers: Provides access to the Qwen model.
  • Accelerate: Required in some environments.
  • FFmpeg & pydub: For audio processing.
  • FastAPI: To create the web server.
  • Uvicorn: ASGI server to run FastAPI.
  • Bark: For text-to-speech synthesis.
  • Multipart & Scipy: To manipulate audio.


FFmpeg can be installed via apt install ffmpeg on Linux or brew install ffmpeg on MacOS.


You can install the Python dependencies using pip: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy

Step 1: Setting Up the Environment

First, let’s set up our Python environment and choose our PyTorch device:


import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'


This code checks if a CUDA-compatible (Nvidia) GPU is available and sets the device accordingly.


If no such GPU is available, PyTorch will instead run on CPU which is much slower.


For newer Apple Silicon devices, the device can also be set to mps to run PyTorch on Metal, but the PyTorch Metal implementation is not comprehensive.

Step 2: Loading the Model

Most open-source LLMs only support text input and text output. However, since we want to create a voice-in-voice-out system, this would require us to use two more models to (1) convert the speech into text before it's fed into our LLM and (2) convert the LLM output back into speech.


By using a multimodal LLM like Qwen Audio, we can get away with one model to process speech input into a text response, and then only have to use a second model convert the LLM output back into speech.


This multimodal approach is not only more efficient in terms of processing time and (V)RAM consumption, but also usually yields better results since the input audio is sent straight to the LLM without any friction.


If you're running on a cloud GPU host like Runpod or Vast, you'll want to set the HuggingFace home & Bark directories to your volume storage by running export HF_HOME=/workspace/hf & export XDG_CACHE_HOME=/workspace/bark before downloading the models.


from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model_name = "Qwen/Qwen2-Audio-7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)


We chose to use the small 7B variant of the Qwen Audio model series here in order to reduce our computational requirements. However, Qwen may have released stronger and bigger audio models by the time you are reading this article. You can view all the Qwen models on HuggingFace to double check you're using their latest model.


For a production environment, you may want to use a fast inference engine like vLLM for much higher throughput.

Step 3: Loading the Bark model

Bark is a state-of-the-art open-source text-to-speech AI model that supports multiple languages as well as sound effects.


from bark import SAMPLE_RATE, generate_audio, preload_models

preload_models()


Besides Bark, you can also use other open-source or proprietary text-to-speech models. Keep in mind that while the proprietary ones might be more performant, they come at a much higher cost. The TTS arena keeps an up-to-date comparison.


With both Qwen Audio 7B & Bark loaded into memory, the approximate (V)RAM usage is 24GB, so make sure your hardware supports this. Otherwise, you may use a quantized version of the Qwen model to save on memory.

Step 4: Setting Up the FastAPI Server

We’ll create a FastAPI server with two routes to handle incoming audio or text inputs and return audio responses.


from fastapi import FastAPI, UploadFile, Form
from fastapi.responses import StreamingResponse
import uvicorn

app = FastAPI()

@app.post("/voice")
async def voice_interaction(file: UploadFile):
    # TODO
    return

@app.post("/text")
async def text_interaction(text: str = Form(...)):
    # TODO
    return

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)


This server accepts audio files via POST requests at the /voice & /text endpoint.

Step 5: Processing Audio Input

We’ll use ffmpeg to process the incoming audio and prepare it for the Qwen model.


from pydub import AudioSegment
from io import BytesIO
import numpy as np

def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray:
    audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1)
    samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16)
    samples = samples.astype(np.float32) / 32768.0

    return samples

def load_audio_as_array(audio_bytes: bytes) -> np.ndarray:
    audio_segment = AudioSegment.from_file(BytesIO(audio_bytes))
    float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000)
    return float_array

Step 6: Generating Textual Response with Qwen

With the processed audio, we can generate a textual response using the Qwen model. This will need to handle both text & audio inputs.


The preprocessor will convert our input to the model's chat template (ChatML in Qwen's case).


def generate_response(conversation):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audio_array = load_audio_as_array(ele["audio_url"])
                    audios.append(audio_array)
    if audios:
        inputs = processor(
            text=text,
            audios=audios,
            return_tensors="pt",
            padding=True
        ).to(device)
    else:
        inputs = processor(
            text=text,
            return_tensors="pt",
            padding=True
        ).to(device)

    generate_ids = model.generate(**inputs, max_length=256)
    generate_ids = generate_ids[:, inputs.input_ids.size(1):]
    response = processor.batch_decode(
        generate_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]

    return response


Feel free to play around with the generation parameters like the temperature on the model.generate function.

Step 7: Converting Text to Speech with Bark

Finally, we’ll convert the generated text response back to speech.


from scipy.io.wavfile import write as write_wav

def text_to_speech(text):
    audio_array = generate_audio(text)
    output_buffer = BytesIO()
    write_wav(output_buffer, SAMPLE_RATE, audio_array)
    output_buffer.seek(0)
    return output_buffer

Step 8: Integrating Everything in the APIs

Update the endpoints to process the audio or text input, generate a response, and return the synthesized speech as a WAV file.


@app.post("/voice")
async def voice_interaction(file: UploadFile):
    audio_bytes = await file.read()
    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "audio_url": audio_bytes
                }
            ]
        }
    ]
    response_text = generate_response(conversation)
    audio_output = text_to_speech(response_text)
    return StreamingResponse(audio_output, media_type="audio/wav")


@app.post("/text")
async def text_interaction(text: str = Form(...)):
    conversation = [
        {"role": "user", "content": [{"type": "text", "text": text}]}
    ]
    response_text = generate_response(conversation)
    audio_output = text_to_speech(response_text)
    return StreamingResponse(audio_output, media_type="audio/wav")

You may choose to also add a system message to the conversations to gain more control over the assistant responses.

Step 9: Testing things out

We can use curl to ping our server as follows:


# Audio input
curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]"

# Text input
curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"

Conclusion

By following these steps, you’ve set up a simple local server capable of two-way voice interactions using state-of-the-art models. This setup can serve as a foundation for building more complex voice-enabled applications.

Applications

If you’re exploring ways to monetize AI-powered language models, consider these potential applications:

Full code

import torch
from fastapi import FastAPI, UploadFile, Form
from fastapi.responses import StreamingResponse
import uvicorn
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from pydub import AudioSegment
from io import BytesIO
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model_name = "Qwen/Qwen2-Audio-7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)

preload_models()

app = FastAPI()

def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray:
    audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1)
    samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16)
    samples = samples.astype(np.float32) / 32768.0

    return samples

def load_audio_as_array(audio_bytes: bytes) -> np.ndarray:
    audio_segment = AudioSegment.from_file(BytesIO(audio_bytes))
    float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000)
    return float_array

def generate_response(conversation):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audio_array = load_audio_as_array(ele["audio_url"])
                    audios.append(audio_array)
    if audios:
        inputs = processor(
            text=text,
            audios=audios,
            return_tensors="pt",
            padding=True
        ).to(device)
    else:
        inputs = processor(
            text=text,
            return_tensors="pt",
            padding=True
        ).to(device)

    generate_ids = model.generate(**inputs, max_length=256)
    generate_ids = generate_ids[:, inputs.input_ids.size(1):]
    response = processor.batch_decode(
        generate_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]

    return response


def text_to_speech(text):
    audio_array = generate_audio(text)
    output_buffer = BytesIO()
    write_wav(output_buffer, SAMPLE_RATE, audio_array)
    output_buffer.seek(0)
    return output_buffer


@app.post("/voice")
async def voice_interaction(file: UploadFile):
    audio_bytes = await file.read()
    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "audio_url": audio_bytes
                }
            ]
        }
    ]
    response_text = generate_response(conversation)
    audio_output = text_to_speech(response_text)
    return StreamingResponse(audio_output, media_type="audio/wav")


@app.post("/text")
async def text_interaction(text: str = Form(...)):
    conversation = [
        {"role": "user", "content": [{"type": "text", "text": text}]}
    ]
    response_text = generate_response(conversation)
    audio_output = text_to_speech(response_text)
    return StreamingResponse(audio_output, media_type="audio/wav")


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)