Building a ChatGPT-4 Voice Assistant With Vivid Unit

by Gavin85 in Circuits > Computers

2374 Views, 52 Favorites, 0 Comments

Building a ChatGPT-4 Voice Assistant With Vivid Unit

Lucy.jpg

Vivid Unit comes with screen, speaker and microphone, that makes it an ideal piece of hardware to implement a voice assistant. Lucy is a voice assistant powered by Google Speech Recognition, Google Text-to-Speech and ChatGPT-4, runs on a Vivid Unit.

Supplies

VividUnit_speaker_microphone.jpg
PowerAdapter.jpg
  • 1 x Vivid Unit
  • 1 x DC 5V USB type-C adapter (or 1 x Ethernet cable if you have PoE port availalble)

Vivid Unit has everything needed on hardware aspect, and we just need to have it powered and connected to the Internet.

Vivid Unit is not that power hungry, a power adapter with 5V/2A should be good enough. If you still worry, give it 2.5A.

You also need Internet connection, either wired or wireless. You need Internet access to install software, and Lucy also need Internet connection during the usage.

The Idea

simple.png

ChatGPT is an advanced artificial intelligence developed by OpenAI. It is designed to engage in natural language conversations. ChatGPT can assist users with tasks, answer questions, brainstorm ideas, and even generate text in different styles, making it a versatile tool for communication, learning, and problem-solving.

Text is the bridge between ChatGPT and human. If we use speech recognition to convert what we say to text, we can talk to ChatGPT. If we use Text-to-Speech technique to read the text generated by ChatGPT, we can hear ChatGPT too.

The idea is straightforward and is really nothing unique, but I really like it and I do it out of fun. I will use free API/service only, so the investment is just time, and I learned a lot from it and enjoyed the process.

Preparation

The program will be written with Python, and I also need to find packages to provide those functionalities.

Speech to Text (STT)

In order to convert our speech to text, we need a speech recognition package. I stumbled across the SpeechRecognition project and I am very impressed: it offers APIs to access different Speech-to-text Transcription (STT) tools, and some of them can even work offline. I decide to use it because it will be very easy to switch from different STT tools, which maximizes the fun.

ChatGPT (or alike)

There are many projects that provide APIs to access ChatGPT. My favorite is the gpt4free project. It provide APIs to access different AI engines from various providers. Again I choose it because that maximizes the fun.

Text to Speech (TTS)

After ChatGPT response with text, we need to convert it to speech (usually in MP3 format). I was hoping to find a project that allows me to easily switch between different TTS engines, but I could not find one. I tried pyttsx3 but felt its voice quality (using espeak) is terrible. I eventuly choose gTTS, which offers much better voice quality. The down-side however, is that it needs network connection during usage. Considering we need network connection for ChatGPT service anyway, this is also acceptable.

Playback

We still need to playback the MP3 generated by TTS engine. The simplest way is to save the MP3 as a file, and use os.system() function to call any player that can play MP3 file. However I feel it is less elegent to generate an MP3 file. I finally use the mixer in pygame, which can playback MP3 without actually generate the file.

Packages Installation

Vivid Unit comes with Python3 installed, but it doesn't have PIP (the Python package manager) yet. It will be convenient to have PIP to help install some packages, so we install PIP first:

sudo apt install pip

We install the "SpeechRecognition" package, and we will use it to convert our speech to text:

pip install SpeechRecognition

Install the "gpt4free" package, which provides access to ChatGPT4:

pip install -U g4f

Install the "pygame" package, which is used for playback without actually generating the MP3 file.

pip install pygame

Install the "gTTS" package, so it can actually read the text loudly.

pip install gTTS

We also need to install some packages related to the sound playback:

sudo apt install flac
pip install sounddevice
sudo apt-get install portaudio19-dev python3-pyaudio

Now we have installed all required packages.

A Very Simple Prototype

Here is a very simple and straightforward Python program (simple.py):

from io import BytesIO
import re
import speech_recognition as sr
import g4f
import pygame
from gtts import gTTS
import sounddevice

pygame.init()
r = sr.Recognizer()

def speak(txt):
  mp3_file_object = BytesIO()
  speech = gTTS(text=txt, slow=False, lang='en', tld='us')
  speech.write_to_fp(mp3_file_object)
  mp3_file_object.seek(0)
  pygame.init()
  pygame.mixer.init()
  pygame.mixer.music.load(mp3_file_object, 'mp3')
  pygame.mixer.music.play()

if __name__ == '__main__':
  while True:
    try:
      with sr.Microphone() as mic:
      
        print("Say something please...")
        
        voice = r.listen(mic)
        txt = r.recognize_google(voice)
        print('\n\nQ: ' + txt + '\n')
  
        resp = g4f.ChatCompletion.create(
              model=g4f.models.gpt_4,
              provider=g4f.Provider.Bing,
              messages=[{"role": "user", "content": txt}],
              stream=False,
        )
        resp = re.sub('[^A-Za-z0-9 ,.:_\'\"]+', '', resp)
  
        print('A: ' + resp + '\n')
  
        speak(resp)

    except sr.UnknownValueError:
      print("Something goes wrong.\n")

The speak() function accepts text parameter and calls gTTS to generate MP3 accordingly, and it uses pygame.mixer to play the MP3 without saving it into file.

This program demostrates how to convert human's speech (from microphone) to text, and then forward that text to ChatGPT4. As a very simple example, it works and you can chat with it already. Try to ask it some simple questions like "what day is today?" or "which country has tallest people?", you will find it actually answers your question.

However, there are still some issues to address:

Lack of context

This is the biggest problem. The AI always starts a new conversation whenever you ask a new question, you will feel frustrated because the AI doesn't remember whatever you have previously said. This prevents you to chat deeply about a topic with the AI.

Long waiting time when ChatGPT output is big

This is also very obvious. ChatGPT likes to talk, a lot. It sometimes generate thousands of words to answer your short and simple question. You may have to wait for long time when those output are processed.

Lack of GUI

It will be much nicer if the voice assitant has its own GUI, instead of printing the output on the console.

In following steps I will address these issues one by one.

Keep the Context

In order to let ChatGPT remember what we previously discussed, we need to save the chat history and send it to ChatGPT everytime we ask a new question. The chat history plays an "assistant" role in such case.

What to do?

If you look at the function call that gets response from ChatGPT:

        resp = g4f.ChatCompletion.create(
              model=g4f.models.gpt_4,
              provider=g4f.Provider.Bing,
              messages=[{"role": "user", "content": txt}],
              stream=False,
        )

You can see the "messages" parameter is actually an array, and each element is an object. In the simple example above, we always provide a new array that contains only one element (the new question) as the messages parameter.

If we always use the same array as the messages paramter, and append the answer from ChatGPT to the same array, then AI will know what we have dicussed before. Of course, the newly asked question should also be appended to the same array.

You can imagine this will bring some pressure to the device, to the network, and also to ChatGPT, because you are sending more and more data during the chat. Although those data are just pure text and they are not that big, we still should control how much context should we keep -- and it is pratical too: it is not likely the AI needs the information you mentioned 58 questions ago. So we can define a constant, say MAX_CONTEXT, and we put its value to 32. Everytime after we append something to that array, we check the array size, if it is bigger than MAX_CONTEXT, we delete its first two elements (the question and the answer).

Below is the code snippet:

            chat_data.append({'role': 'user', 'content': txt})
    
            resp = g4f.ChatCompletion.create(
                model=g4f.models.gpt_4,
                provider=g4f.Provider.Bing,
                messages=chat_data,
                stream=False,
            )
resp = re.sub('[^A-Za-z0-9 ,.:_\'\"]+', '', resp)
  
        print('A: ' + resp + '\n')
  
        speak(resp)

            chat_data.append({'role': 'assistant', 'content': resp})

            if len(chat_data) > MAX_CONTEXT:
                del chat_data[0]
                del chat_data[0]

Speak During Processing

Don't you think it is a good idea to play partial of the MP3 data before the whole response from ChatGPT gets fully processed?

The good news is that ChatGPT supports output as stream: instead of outputing everything in the buffer, it outputs text piece by piece. You will eventually get the same output text, but this way you have chance to access early text while other text are still being output.

The bad news however, is that I don't find a TTS engine that can generate streamed audio according to streamed text.

The solution

I start two theads. One as text generator and the other one as text consumer.

The text generator thread runs a loop that keeps getting output text from ChatGPT, and put the text into a queue.

The text consumer thread runs a loop that keeps taking text from the queue and assemble them as a sentence. Once a sentence is complete, it calls speak() function to read it out.

This way the sentence will be read out once it is ready, and no need to wait for other sentences to be processed.

The speak() function

The speak() function becomes a member function of QueueProcessingThread class (the text consumer thread). Because it calls the mixer.music.play() function, which is not blocking during the playback, I have to add a while loop to make it blocking, or it will try to play next sentence before the current playback is done.

  def speak(self, txt):
    if(txt and txt.strip()):
      mp3_file_object = BytesIO()
      speech = gTTS(text=txt, slow=False, lang='en', tld='us')
      speech.write_to_fp(mp3_file_object)
      mp3_file_object.seek(0)
      pygame.mixer.music.load(mp3_file_object, 'mp3')
      pygame.mixer.music.play()
      while(pygame.mixer.music.get_busy()): pass    # wait until playing done

Make the GUI

states.png

I create a fullscreen window as the GUI for this voice assistant. The conversation will be displayed on the screen when the chat goes on. I also define three states for the program: inactive, active and listening.

The three states

When the program just launched, it is in "inactive" state: the screen is black and it will not react to what you say.

If you say something that contains "Lucy", that will trigger it and its state will become "active" and immedetely go to "listening": the screen is green and it listens to your question.

After you ask the question, its state will go back to "active" while ChatGPT is outputing the answer: the screen is purple and your speech will be ignored. After all output are read out, the state will go to "listening" again.

If you haven't say anything after a while, the state will go to "inactive".

The GUI uses CSS to change the color of widgets.

GPT-4 Providers

By using the gpt4free API, we can easily choose different providers for ChatGPT. We would have much more choices if we accept using ChatGPT-3.5, but I still prefer ChatGPT-4 because it is indeed available and better.

There are severial providers for ChatGPT-4 service, and the gpt4free project gives a very detailed list of them. In the list there are two of them (openai and raycast) need authentication, which make them harder to use and (most probably) not free. Also the GeekGpt is no longer available, so there are currently only three remaining:

  • Bing (bing.com)
  • Liaobots (liaobots.site)
  • You (you.com)

When I do testing, I can not make the Liaobots work. I am not sure if it was a temorary issue.

Bing and You are both working quite well. I personally like You better, because I like the way it speaks: it tends to speak less and simple. Bing on the other hand, likes to talk more, sometimes a little bit too much.

Switching provider

Switching provider is every easy: you just replace the "provicer" parameter when calling the g4f.ChatCompletion.create() function. If you want to use Bing, you set provider to "g4f.Provider.Bing"; To use You, set provider to "g4f.Provider.You".

            resp = g4f.ChatCompletion.create(
                model=g4f.models.gpt_4,
                provider=g4f.Provider.Bing, # set provider here
                messages=chat_data,
                stream=True,
            )


The Complete Program

Below you can find the complete Python program.

import gi
gi.require_version('Gtk', '3.0')
from gi.repository import Gtk, Gdk, Pango, GLib
import threading
import time
import re
import queue
from io import BytesIO
import speech_recognition as sr
import g4f
import pygame
from gtts import gTTS
import sounddevice


NAME = 'Lucy'
RESP = 'Yes?'
BYE = 'Talk to you later.'
MAX_CONTEXT = 32
MAX_INACTIVE = 60


pygame.init()
q = queue.Queue()
r = sr.Recognizer()

get_sentence = False
sentence = ''
output_done = False
speech_done = True
chat_data = []
active_ts = 0;


class ChatView(Gtk.TextView):
  def __init__(self):
    Gtk.TextView.__init__(self)
    self.set_wrap_mode(Gtk.WrapMode.WORD)
    self.set_editable(False)
    self.set_cursor_visible(False)
    text_buffer = self.get_buffer()
    text_iter_end = text_buffer.get_end_iter()
    self.text_mark_end = text_buffer.create_mark("", text_iter_end, False)


  def append_text(self, text):
    text_buffer = self.get_buffer()
    text_iter_end = text_buffer.get_end_iter()
    text_buffer.insert_markup(text_buffer.get_end_iter(), text, -1)
    self.scroll_to_mark(self.text_mark_end, 0, False, 0, 0)


  def clear_text(self):
    text_buffer = self.get_buffer()
    text_iter_start = text_buffer.get_start_iter()
    text_iter_end = text_buffer.get_end_iter()
    text_buffer.delete(text_iter_start, text_iter_end);


class LucyWindow(Gtk.Window):
  active = False;
  listening = False;
  chat_view = ChatView()
  
  def __init__(self):
    Gtk.Window.__init__(self)
    self.set_title('Lucy')
    self.fullscreen()

    self.set_default_size(640, 360)
    self.grid = Gtk.Grid()
    self.scrolled_win = Gtk.ScrolledWindow()
    
    self.scrolled_win.set_hexpand(True)
    self.scrolled_win.set_vexpand(True)
    self.scrolled_win.add(self.chat_view)
    self.scrolled_win.set_policy(Gtk.PolicyType.NEVER, Gtk.PolicyType.AUTOMATIC)

    text_box = Gtk.Box(orientation=Gtk.Orientation.VERTICAL, spacing=20)
    text_box.set_margin_top(20)
    text_box.set_margin_bottom(20)
    text_box.set_margin_start(20)
    text_box.set_margin_end(20)
    text_box.add(self.scrolled_win)

    self.grid.add(text_box)
    self.add(self.grid)
    self.connect('destroy', Gtk.main_quit)
    self.show_all()
    
  def set_state(self, active, listening):
    self.active = active
    self.listening = listening
    
    window_context = self.get_style_context()
    window_context.remove_class('inactive')
    window_context.remove_class('active')
    window_context.remove_class('listening')
    
    view_context = self.chat_view.get_style_context()
    view_context.remove_class('inactive')
    view_context.remove_class('active')
    view_context.remove_class('listening')
    
    if active:
      if listening:
        window_context.add_class('listening')
        view_context.add_class('listening')
      else:
        window_context.add_class('active')
        view_context.add_class('active')
    else:
      window_context.add_class('inactive')
      view_context.add_class('inactive')


class QueueProcessingThread(threading.Thread):
  window = None
  
  def __init__(self, win):
    threading.Thread.__init__(self)
    self.window = win
    self.daemon = True
    
  def speak(self, txt):
    global active_ts
    if(txt and txt.strip()):
      active_ts = time.time()
      mp3_file_object = BytesIO()
      speech = gTTS(text=txt, slow=False, lang='en', tld='us')
      speech.write_to_fp(mp3_file_object)
      mp3_file_object.seek(0)
      pygame.mixer.music.load(mp3_file_object, 'mp3')
      pygame.mixer.music.play()
      while(pygame.mixer.music.get_busy()): pass    # wait until playing done
      active_ts = time.time()
      
  def run(self):
    global get_sentence
    global sentence
    global output_done
    global speech_done
    while True:
      if (get_sentence):
        item = q.get()
        sentence += item
        q.task_done()
        if item.endswith(".") or item.endswith("!") or item.endswith("?") or (output_done and q.empty()):
          self.speak(sentence)
          sentence = ''
          get_sentence = False
      else:
        if q.empty():
          if output_done:
            if not speech_done:
              speech_done = True
              if self.window.active:
                self.window.set_state(True, True)
        else:
          if output_done:
            get_sentence = True


class VoiceRecognizingThread(threading.Thread):
  window = None
  
  def __init__(self, win):
    threading.Thread.__init__(self)
    self.window = win
    self.daemon = True
    
  def run(self):
    global get_sentence
    global output_done
    global speech_done
    global chat_data
    global active_ts
    
    while True:
      Gtk.main_iteration_do(False)
      try:
        with sr.Microphone(sample_rate=44100) as mic:


          if not self.window.active and not self.window.listening:
              self.window.set_state(False, False);
          
          if not speech_done:
              continue
              
          ts = time.time()
          if self.window.active and active_ts and (ts - active_ts) > MAX_INACTIVE :
              active_ts = 0
              self.window.set_state(False, False)
              speech_done = False
              q.put(BYE)
              get_sentence = True
              output_done = True
              GLib.idle_add(self.window.chat_view.clear_text)
          
          voice = r.listen(mic)
          txt = r.recognize_google(voice)
          active_ts = ts
          
          if not self.window.active:
            if NAME in txt:
                self.window.set_state(True, False)
                speech_done = False
                q.put(RESP)
                get_sentence = True
                output_done = True
          else:
            active_ts = ts;
            output_done = False
            speech_done = False
            GLib.idle_add(self.window.chat_view.append_text, '\n\nQ: ' + txt + '\nA: ')
            self.window.set_state(True, False)
            
            chat_data.append({'role': 'user', 'content': txt})
    
            resp = g4f.ChatCompletion.create(
                model=g4f.models.gpt_4,
                provider=g4f.Provider.You,
                messages=chat_data,
                stream=True,
            )


            answer = ''
            for message in resp:
                msg = re.sub('[^A-Za-z0-9 ,.:_\'\"\+\-\*\/=]+', '', message.replace('**', ''))
                GLib.idle_add(self.window.chat_view.append_text, msg)
                answer += msg
                q.put(msg)
                if msg.endswith("."):
                    get_sentence = True


            output_done = True
            chat_data.append({'role': 'assistant', 'content': answer})
              
            if len(chat_data) > MAX_CONTEXT:
                del chat_data[0]
                del chat_data[0]
            
            active_ts = time.time()


      except sr.UnknownValueError as ue:
        output_done = True


if __name__ == '__main__':
  
  # load CSS
  screen = Gdk.Screen.get_default()
  provider = Gtk.CssProvider()
  style_context = Gtk.StyleContext()
  style_context.add_provider_for_screen(
      screen, provider, Gtk.STYLE_PROVIDER_PRIORITY_APPLICATION
  )
  css = b"""
  textview {
      font: 25px Arial;
      background: transparent;
  }
  textview text {
      color: white;
      background: transparent;
  }
  textview.inactive text {
      color: black;
  }
  window.inactive {
      background: black;
  }
  window.active {
      background: #7700df;
  }
  window.listening {
      background: #008c8c;
  }
  """
  provider.load_from_data(css)  
  
  # Lucy window
  win = LucyWindow()

  # voice recognizing thread
  thread1 = VoiceRecognizingThread(win)
  thread1.start()
  
  # queue processing thread
  thread2 = QueueProcessingThread(win)
  thread2.start()

  Gtk.main()


In attachment you can also find the lucy.py source file.

Downloads

The Result

Below is a video that shows how Lucy works. As you can see, it does remember the context during the conversation.

Sometimes the sentences are incorrectly connected without period or comma, and the gTTS engine just reads it out that way. I think this can be improved by tuning the text consumer thread (QueueProcessingThread).

Lucy's performance can be significantly affected by the network situation. Lucy uses several API that require Internet connection. If the network is slow, or the service server responses late, Lucy may answer you way later than you expected.

Offline version?

I can't help thinking, can this voice assistant work offline?

The SpeechRecognition library does provide some APIs that can work offline (e.g. the Vosk API). I tried them and confirm they indeed can work locally on Vivid Unit. However the recongnition accurancy is not as good as Google Speech Recognition.

Also the text-to-speech engine can switch to an offline version: pyttsx3. But the voice quality is really bad and you will not like it.

As for the ChatGPT-4 service, it definitely needs Internet connection. It may be possible to run a simplified LLM locally on Vivid Unit, but that will be very slow and that will not be practical.

With that said, if we really make Lucy offline, it will be unfortuantely quite un-usable.

Can Lucy do more?

Definitely! Vivid Unit comes with GPIOs and ADC channels, so it is possible to let Lucy to control some external circuits, read some data from sensor etc. It can actually become the center unit of home automation.