Real Time Streaming Speech Medical Transcription Using Google’s AI Speech Model

4 min readMay 7, 2023

Real Time Streaming Speech Medical Transcription Using Google’s AI Speech Model

In this knowledge sharing article, I would like to show how we can use Google’s AI Speech Model to perform Real Time Speech Translation between two or more individuals. We can assume these individuals as Doctor and Patient for our sample work here as we are going to use “medical_conversation” model for our task. Importance of this conversation would be to perform analytics on conversation data which can be saved in some Database to gain more insights into the patient’s medical history, demand for medicines, side effects etc.

To run our code we need to have Project Account in Google Cloud Platform. We can run the code in local Jupyter notebook but Account in GCP(Google Cloud Platform) is Mandatory. I am using local Jupyter notebook to run the code, going ahead we will see how we can proceed with this. Listed below are steps we need to follow to run our Python Script.

Step1: To start with we need to set up Google Cloud project for Speech-to-Text and create Service Account for this Project. Once Service Account is created we need to enable Cloud Speech-to-Text API. For complete reference on how to perform these 3 tasks mentioned above please check this link.

Step2: To access Speech-to-Text API we need to create JSON key for Service Account that we have created above, while creating key it automatically downloads as JSON file as show in screenshot below. So we can just save the downloaded .json file in our current working directory for Python. I am using Juypter Notebook in Anaconda for Windows but same procedure applies for Linux Platform or any other IDE.

Step3: Prio to run the code we need to have following libraries installed for Google Cloud i.e, google-auth, google-auth-oauthlib, google-cloud, google-cloud-speech and PyAudio rest of them comes by default with Python. Here we are using client libraries to translate speech-to-text so it is good to have an idea about client libraries by following this link.

Note: I have divided entire code into three cells of Juypter Notebook but all this can be run in one cell as well.

# Juypter Cell 1

from __future__ import division
import re
import sys
import os
import pyaudio
from google.oauth2 import service_account
from google.cloud import speech
from six.moves import queue

After installing libraries, need to define a Context Manager Class MicrophoneStream where in __enter__ method helps to get the audio stream asynchronously by accessing device microphone through PyAudio and whatever chunk of data has entered through __enter__ method that specific stream will be closed by __exit__ method of this Class. For more information on Context Manager one can refer to this Python link. Generator function returns chunks of speech data received from a buffer object. Chunk object retrieves some chunk of data from the buffer. continuous while loop tries to get any buffered data available and if queue(linear Datastructure) is found empty loop breaks and returns concatenated data as string.

# Juypter Cell 2

RATE = 16000
CHUNK = int(RATE / 10)

class MicrophoneStream(object):
    
    def __init__(self, rate, chunk):
        self._rate = rate
        self._chunk = chunk
        self._buff = queue.Queue()
        self.closed = True

    def __enter__(self):
        self._audio_interface = pyaudio.PyAudio()
        self._audio_stream = self._audio_interface.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self._rate,
            input=True,
            frames_per_buffer=self._chunk,
            stream_callback=self._fill_buffer)
        self.closed = False
        return self

    def __exit__(self, type, value, traceback):
        self._audio_stream.stop_stream()
        self._audio_stream.close()
        self.closed = True
        self._buff.put(None)
        self._audio_interface.terminate()

    def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
        """Continuously collect data from the audio stream, into the buffer."""
        self._buff.put(in_data)
        return None, pyaudio.paContinue

    def generator(self):
        while not self.closed:
            chunk = self._buff.get()
            if chunk is None:
                return
            data = [chunk]
            while True:
                try:
                    chunk = self._buff.get(block=False)
                    if chunk is None:
                        return
                    data.append(chunk)
                except queue.Empty:
                    break

            yield b"".join(data)

So after Class object is created with proper Constructor, ContextManager and Generator method, other two Functions are listen_print_loop and main(). Here listen_print_loop Function prints the response recorded once it identifies a Pause in the Sentence. In main() Function we have are using .json JSON key which we have saved earlier so that we can use it as credentials to access speech-to-text API, then we have SpeakerDiarizationConfig method which takes parameters like number of Speakers involved. In RecognitionConfig method we are defining parameters like audio encoding format, langauge, model to be used etc. As we do not need interim results from the speech to text translation so we have set interim_results as False in StreamingRecognitionConfig method.

# Juypter Cell 3

def listen_print_loop(responses):
    
    for response in responses:
        if not response.results:
            continue
        result = response.results[0]
        if not result.alternatives:
            continue
        transcript = result.alternatives[0].transcript
        print(transcript)
            
def main():
    
    language_code = "en-US" 
    google_speechtotext_apikey = service_account.Credentials.from_service_account_file('***************.json')
    client = speech.SpeechClient(credentials = google_speechtotext_apikey)
    
    diarization_config = speech.SpeakerDiarizationConfig(enable_speaker_diarization=True,
                                                         min_speaker_count=2,
                                                         max_speaker_count=3)
    
    config = speech.RecognitionConfig(encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
                                      sample_rate_hertz=RATE,
                                      language_code=language_code,
                                      model = 'medical_conversation',
                                      diarization_config=diarization_config)
        
    streaming_config = speech.StreamingRecognitionConfig(config=config, 
                                                         interim_results=False)
    
    with MicrophoneStream(RATE, CHUNK) as stream:
        audio_generator = stream.generator()
        requests = (speech.StreamingRecognizeRequest(audio_content=content) for content in audio_generator)
        responses = client.streaming_recognize(streaming_config, requests)
        listen_print_loop(responses)
    
if __name__ == "__main__":
    
    try:
        print()
        main()
    except KeyboardInterrupt:
        print()
        print("Stopping.. as requested")

Once we run the above code the output of Transcript will be look something like shown in below screenshot. In coming days I will try to differentiate conversation of different speakers like Doctor Speaking: Patient Speaking: etc and I hope this Article will be helpful if someone is trying real time streaming speech transcription with GCP. If somehow this code is stuck please ask in comments and I will be happy to answer.

Complete Code in Github .

Written by Dharmendra Sahani

Responses (3)