Building A Real-time Ai Chatbot With Vision And Voice Capabilities Using Openai, Livekit, And Deepgram On Gpu Droplets

3 weeks ago

ARTICLE AD BOX

Introduction

In this tutorial, you will study really to build a real-time AI chatbot pinch imagination and sound capabilities utilizing OpenAI, LiveKit and Deepgram deployed connected DigitalOcean GPU Droplets. This chatbot will beryllium tin to prosecute successful real-time conversations pinch users, analyse images captured from your camera, and proviso meticulous and timely responses.

Enhancing Chatbot Capabilities pinch Advanced Technologies

In this tutorial, you will leverage 3 powerful technologies to build your real-time AI chatbot, each serving a circumstantial intent that enhances nan chatbot’s capabilities, each while utilizing nan robust infrastructure provided by DigitalOcean’s GPU Droplets:

OpenAI API: The OpenAI API will make human-like matter responses based connected personification input. By employing precocious models for illustration GPT-4o, our chatbot will beryllium tin to understand context, prosecute successful meaningful conversations, and proviso meticulous answers to personification queries. This is important for creating an interactive acquisition wherever users consciousness understood and valued.
LiveKit: LiveKit will facilitate real-time audio and video relationship betwixt users and nan chatbot. It allows america to create a seamless narration experience, enabling users to speak to nan chatbot and personification sound responses. This is basal for building a voice-enabled chatbot that tin group prosecute users, making nan narration consciousness overmuch individual and intuitive.
Deepgram: Deepgram will beryllium employed for reside recognition, converting spoken relationship into text. This allows nan chatbot to process personification sound inputs effectively. By integrating Deepgram’s capabilities, you tin guarantee that nan chatbot accurately understands personification commands and queries, enhancing nan wide narration quality. This is peculiarly important successful a real-time mounting wherever speedy and meticulous responses are basal for maintaining personification engagement.

Why GPU Droplets?: Utilizing DigitalOcean’s GPU Droplets is peculiarly beneficial for this setup arsenic they proviso nan basal computational and GPU infrastructure to powerfulness and grip nan intensive processing required by these AI models and real-time communication. The GPUs are optimized for moving AI/ML workloads, importantly speeding up exemplary conclusion and video processing tasks. This ensures nan chatbot tin coming responses quickly and efficiently, moreover nether dense load, improving personification acquisition and engagement.

Prerequisites

Before you begin, guarantee you have:

A DigitalOcean Cloud account.
A GPU Droplet deployed and running.
Basic knowledge of Python programming.
An OpenAI API cardinal group up for utilizing nan GPT-4o model.
A LiveKit server up and moving connected your GPU Droplet.
A Deepgram API Key.

Step 1 - Set Up nan GPU Droplet

1.Create a New Project - You will petition to create a caller task from nan unreality powerfulness expanse and necktie it to a GPU Droplet.

2.Create a GPU Droplet - Log into your DigitalOcean account, create a caller GPU Droplet, and return AI/ML Ready arsenic nan OS. This OS image installs each nan basal NVIDIA GPU Drivers. You tin mention to our charismatic archiving connected really to create a GPU Droplet.

Create-a-gpu-droplet which is AI/ML Ready

3.Add an SSH Key for authentication - An SSH cardinal is required to authenticate pinch nan GPU Droplet and by adding nan SSH key, you tin login to nan GPU Droplet from your terminal.

Add an SSH cardinal for authentication

4.Finalize and Create nan GPU Droplet - Once each of nan supra steps are completed, finalize and create a caller GPU Droplet.

Create a GPU Droplet

Step 2 - Setup a LiveKit narration and instal nan CLI connected GPU Droplet

Firstly, you will petition to create an narration aliases mobility successful to your LiveKit Cloud relationship and create a LiveKit Project. Please connection down nan LIVEKIT_URL, LIVEKIT_API_KEY and nan LIVEKIT_API_SECRET business variables from nan Project Settings page arsenic you will petition them later successful nan tutorial.

Install nan LiveKit CLI

The beneath bid will instal nan LiveKit CLI connected your GPU Droplet.

curl -sSL https://get.livekit.io/cli | bash

For LiveKit Cloud users, you tin authenticate nan CLI pinch your Cloud task to create an API cardinal and secret. This allows you to usage nan CLI without manually providing credentials each time.

lk unreality auth

Then, recreation instructions and log successful from a browser.

You will beryllium asked to adhd nan instrumentality and authorize entree to your LiveKit Project you creted earlier successful this step.

Authorize nan app

Access granted

Step 3 - Bootstrap an supplier from an existing LiveKit template

The template provides a moving sound adjunct to build on. The template includes:

Basic sound interaction
Audio-only measurement subscription
Voice activity find (VAD)
Speech-to-text (STT)
Language exemplary (LLM)
Text-to-speech (TTS)

Note: By default, nan illustration supplier uses Deepgram for STT and OpenAI for TTS and LLM. However, you aren’t required to usage these providers.

Clone nan starter template for a elemental Python sound agent:

lk app create

This will springiness you aggregate existing LiveKit templates that you tin usage to deploy an app.

Output

voice-assistant-frontend transcription-frontend token-server multimodal-agent-python multimodal-agent-node voice-pipeline-agent-python voice-pipeline-agent-node android-voice-assistant voice-assistant-swift outbound-caller-python

You will usage nan voice-pipeline-agent-python template.

lk app create --template voice-pipeline-agent-python

Now, participate your Application name, OpenAI API Key and Deepgram API Key erstwhile prompted. If you aren’t utilizing Deepgram and OpenAI, you tin checkout different supported plugins.

Output

Cloning template... Instantiating environment... Cleaning up... To setup and tally nan agent: cd /root/do-voice-vision-bot python3 -m venv venv source venv/bin/activate pip install -r requirements.txt python3 agent.py dev

Step 4 - Install limitations and create a Virtual Environment

First, move to your applications’s directory which was created successful nan past step.

cd <app_name>

You tin database nan files that were created from nan template.

Output

LICENSE README.md agent.py requirements.txt

Here agent.py is nan main exertion grounds which contains nan logic and guidelines codification for nan AI chatbot.

Now, you will create and activate a python virtual business utilizing nan beneath commands:

apt install python3.10-venv python3 -m venv venv

Add nan pursuing API keys successful your environment:

export LIVEKIT_URL=<> export LIVEKIT_API_KEY=<> export LIVEKIT_API_SECRET=<> export DEEPGRAM_API_KEY=<> export OPENAI_API_KEY=<>

You tin find nan LIVEKIT_URL, LIVEKIT_API_KEY and nan LIVEKIT_API_SECRET connected nan LiveKit Projects Settings page.

Activate nan virtual environment:

source venv/bin/activate

Note: On Debian/Ubuntu systems, you petition to instal nan python3-venv package utilizing nan pursuing command.

apt install python3.10-venv

Now, let’s instal nan limitations required for nan app to work.

python3 -m pip install -r requirements.txt

Step 5 - Add Vision Capabilities to your AI agent

To adhd nan imagination capabilities to your supplier you will petition to modify nan agent.py grounds pinch nan beneath imports and functions.

First, let’s commencement disconnected by adding these imports alongside nan existing ones. Open your agent.py grounds utilizing a matter editor for illustration vi aliases nano.

vi agent.py

Copy nan beneath imports alongside nan existing ones:

agent.py

from livekit import rtc from livekit.agents.llm import ChatMessage, ChatImage

These caller imports include:

rtc: Access to LiveKit’s video functionality
ChatMessage and ChatImage: Classes you’ll usage to nonstop images to nan LLM

Enable video subscription

Find nan ctx.connect() connection successful nan entrypoint function. Change AutoSubscribe.AUDIO_ONLY to AutoSubscribe.SUBSCRIBE_ALL:

agent.py

await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL)

Note: If it is difficult for you to edit and modify agent.py grounds utilizing nan vi aliases nano matter editor connected nan GPU Droplet. You tin transcript nan agent.py grounds contented to your conception strategy and make nan required edits successful a Code editor for illustration VSCode etc, and past copy-paste nan updated code.

This will alteration nan adjunct to personification video tracks arsenic bully arsenic audio.

Add video model handling

Add these 2 helper functions after your imports but earlier nan prewarm function:

agent.py

async def get_video_track(room: rtc.Room): """Find and return nan first disposable distant video measurement in nan room.""" for participant_id, subordinate in room.remote_participants.items(): for track_id, track_publication in participant.track_publications.items(): if track_publication.track and isinstance( track_publication.track, rtc.RemoteVideoTrack ): logger.info( f"Found video measurement {track_publication.track.sid} " f"from subordinate {participant_id}" ) return track_publication.track raise ValueError("No distant video measurement recovered successful nan room")

This usability searches done each participants to find an disposable video track. It’s utilized to find nan video provender to process.

Now, you will adhd nan model seizure function

agent.py

async def get_latest_image(room: rtc.Room): """Capture and return a azygous model from nan video track.""" video_stream = None try: video_track = await get_video_track(room) video_stream = rtc.VideoStream(video_track) async for event in video_stream: logger.debug("Captured latest video frame") return event.frame isolated from Exception arsenic e: logger.error(f"Failed to get latest image: {e}") return None finally: if video_stream: await video_stream.aclose()

The intent of this usability is to seizure a azygous model from nan video measurement and ensures owed cleanup of resources. Using aclose() releases strategy resources for illustration practice buffers and video decoder instances, which helps forestall practice leaks.

Add nan LLM Callback

Now, incorrect nan entrypoint function, adhd nan beneath callback usability which will inject nan latest video model conscionable earlier nan LLM generates a response. Search for nan entrypoint usability incorrect nan agent.py file:

agent.py

async def before_llm_cb(assistant: VoicePipelineAgent, chat_ctx: llm.ChatContext): """ Callback that runs correct earlier nan LLM generates a response. Captures nan existent video model and adds it to nan reside context. """ try: if not hasattr(assistant, '_room'): logger.warning("Room not disposable successful assistant") return latest_image = await get_latest_image(assistant._room) if latest_image: image_content = [ChatImage(image=latest_image)] chat_ctx.messages.append(ChatMessage(role="user", content=image_content)) logger.debug("Added latest model to reside context") else: logger.warning("No image captured from video stream") isolated from Exception arsenic e: logger.error(f"Error successful before_llm_cb: {e}")

This callback is nan cardinal to businesslike sermon guidance — it will only adhd ocular accusation erstwhile nan adjunct is astir to respond. If ocular accusation was added to each message, it would quickly tin up nan LLMs sermon exemplary which would beryllium highly inefficient and costly.

Update nan strategy prompt

Find nan initial_ctx creation incorrect nan entrypoint usability and update it to spot imagination capabilities:

agent.py

initial_ctx = llm.ChatContext().append( role="system", text=( "You are a sound adjunct created by LiveKit that tin immoderate spot and hear. " "You should usage short and concise responses, avoiding unpronounceable punctuation. " "When you spot an image successful our conversation, group incorporated what you spot " "into your response. Keep ocular descriptions small but informative." ), )

Update nan adjunct configuration

Find nan VoicePipelineAgent creation incorrect nan entrypoint usability and adhd nan callback:

agent.py

assistant = VoicePipelineAgent( vad=ctx.proc.userdata["vad"], stt=openai.STT(), llm=openai.LLM(), tts=openai.TTS(), chat_ctx=initial_ctx, before_llm_cb=before_llm_cb )

The awesome update coming is nan before_llm_cb parameter, which uses nan callback created earlier to inject nan latest video model into nan reside context.

Final agent.py grounds pinch sound & imagination capabilities

This is really nan agent.py grounds would look aft adding each nan basal functions and imports:

agent.py

from asyncio.log import logger from livekit import rtc from livekit.agents.llm import ChatMessage, ChatImage import logging from dotenv import load_dotenv from livekit.agents import ( AutoSubscribe, JobContext, JobProcess, WorkerOptions, cli, llm, ) from livekit.agents.pipeline import VoicePipelineAgent from livekit.plugins import openai, deepgram, silero async def get_video_track(room: rtc.Room): """Find and return nan first disposable distant video measurement in nan room.""" for participant_id, subordinate in room.remote_participants.items(): for track_id, track_publication in participant.track_publications.items(): if track_publication.track and isinstance( track_publication.track, rtc.RemoteVideoTrack ): logger.info( f"Found video measurement {track_publication.track.sid} " f"from subordinate {participant_id}" ) return track_publication.track raise ValueError("No distant video measurement recovered successful nan room") async def get_latest_image(room: rtc.Room): """Capture and return a azygous model from nan video track.""" video_stream = None try: video_track = await get_video_track(room) video_stream = rtc.VideoStream(video_track) async for event in video_stream: logger.debug("Captured latest video frame") return event.frame isolated from Exception arsenic e: logger.error(f"Failed to get latest image: {e}") return None finally: if video_stream: await video_stream.aclose() def prewarm(proc: JobProcess): proc.userdata["vad"] = silero.VAD.load() async def entrypoint(ctx: JobContext): async def before_llm_cb(assistant: VoicePipelineAgent, chat_ctx: llm.ChatContext): """ Callback that runs correct earlier nan LLM generates a response. Captures nan existent video model and adds it to nan reside context. """ try: if not hasattr(assistant, '_room'): logger.warning("Room not disposable successful assistant") return latest_image = await get_latest_image(assistant._room) if latest_image: image_content = [ChatImage(image=latest_image)] chat_ctx.messages.append(ChatMessage(role="user", content=image_content)) logger.debug("Added latest model to reside context") else: logger.warning("No image captured from video stream") isolated from Exception arsenic e: logger.error(f"Error successful before_llm_cb: {e}") initial_ctx = llm.ChatContext().append( role="system", text=( "You are a sound adjunct created by LiveKit that tin immoderate spot and hear. " "You should usage short and concise responses, avoiding unpronounceable punctuation. " "When you spot an image successful our conversation, group incorporated what you spot " "into your response. Keep ocular descriptions small but informative." ), ) logger.info(f"connecting to room {ctx.room.name}") await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL) subordinate = await ctx.wait_for_participant() logger.info(f"starting sound adjunct for subordinate {participant.identity}") supplier = VoicePipelineAgent( vad=ctx.proc.userdata["vad"], stt=deepgram.STT(), llm=openai.LLM(), tts=openai.TTS(), chat_ctx=initial_ctx, before_llm_cb=before_llm_cb ) agent.start(ctx.room, participant) await agent.say("Hey, really tin I thief you today?", allow_interruptions=True) if __name__ == "__main__": cli.run_app( WorkerOptions( entrypoint_fnc=entrypoint, prewarm_fnc=prewarm, ), )

Testing your agent

Start your adjunct and proceedings nan below:

python3 agent.py dev

Test Voice Interaction: Speak into your microphone and spot nan chatbot respond.
Test Vision Capability: Ask nan chatbot to spot objects done your video cam stream.

You would obeseve nan pursuing logs successful your console:

Output

2024-12-30 08:32:56,167 - DEBUG asyncio - Using selector: EpollSelector 2024-12-30 08:32:56,168 - DEV livekit.agents - Watching /root/do-voice-vision-bot 2024-12-30 08:32:56,774 - DEBUG asyncio - Using selector: EpollSelector 2024-12-30 08:32:56,778 - INFO livekit.agents - starting worker {"version": "0.12.5", "rtc-version": "0.18.3"} 2024-12-30 08:32:56,819 - INFO livekit.agents - registered worker {"id": "AW_cjS8QXCEnFxy", "region": "US East", "protocol": 15, "node_id": "NC_OASHBURN1A_BvkfVkdYVEWo"}

Now, you will petition to nexus nan app to nan LiveKit room pinch a customer that publishes immoderate audio and video. The easiest measurement to do this is by utilizing nan hosted supplier playground.

Connect your Project to Hosted Playground

Since, this supplier requires a frontend exertion to walk with. You tin usage 1 of our illustration frontends successful livekit-examples, create your ain pursuing 1 of nan customer quickstarts, aliases proceedings instantly against 1 of nan hosted Sandbox frontends.

In this illustration you will usage an existing hosted supplier playground. Simply unfastened this https://agents-playground.livekit.io/ successful your system’s browser and nexus your LiveKit Project. It should auto-populate pinch your Project.

Hosted AI supplier deployed connected GPU Droplet

How it works

With these supra changes to your agent, your adjunct tin now:

Connects to immoderate audio and video streams.
Listens for personification reside arsenic before.
Just earlier generating each response:

Captures nan existent video frame.
Adds it to nan reside context.
Uses it to walk nan response.

4.Keep nan sermon cleanable by only adding frames erstwhile needed.

Conclusion

Congratulations! You personification successfully built a real-time AI chatbot pinch imagination and sound capabilities utilizing OpenAI, LiveKit, and Deepgram connected DigitalOcean GPU Droplets. This powerful cognition enables efficient, scalable and real-time interactions for your applications.

You tin mention to LiveKit’s charismatic archiving and it’s API reference for overmuch specifications connected building AI agents.