Skip to main content

Command Palette

Search for a command to run...

Mastering Google Cloud Speech API: 3 Powerful Ways to Convert Speech and Text

Updated
6 min read
Mastering Google Cloud Speech API: 3 Powerful Ways to Convert Speech and Text

As a junior cloud architect, working with Google Cloud’s Speech API can be an exciting and rewarding challenge. In this guide, we’ll walk through three powerful ways to utilize the Cloud Speech API: speech-to-text transcription, text-to-speech synthesis, and language translation.

Why Use Google Cloud Speech API?

Google Cloud Speech API provides accurate, fast, and scalable solutions for converting speech to text, synthesizing natural-sounding speech, translating text, and detecting languages. These capabilities enable developers to build sophisticated voice-enabled applications, automated transcription services, and multilingual support tools.


Step 1: Create an API Key

To use the Google Cloud Speech API, you need to generate an API key.

Steps to Create an API Key:

  1. Go to the Google Cloud Console and navigate to the API & Services section.

  2. Select Credentials and click on "Create Credentials."

  3. Choose API Key, and a new key will be generated.

  4. Copy and save the API key, as you will need it for authentication in the upcoming steps.


Step 2: Create and Connect to a VM Instance

Before using the Cloud Speech API, you need to create and connect to a VM instance.

Steps to Create a VM Instance via Google Cloud Console:

  1. Go to Google Cloud Console and navigate to the Compute Engine.

  2. Click on "Create Instance" and configure the necessary settings (name, region, machine type, etc.).

  3. Allow HTTP and HTTPS Traffic in the firewall settings.

  4. Click on "Create" to provision the VM instance.

Steps to Create a VM Instance via Command Line (gcloud CLI):

gcloud compute instances create INSTANCE_NAME \
    --machine-type=e2-medium \
    --image-project=debian-cloud \
    --image-family=debian-11 \
    --scopes=https://www.googleapis.com/auth/cloud-platform

Steps to Connect to the VM Instance:

  1. Open Google Cloud Console.

  2. Navigate to Compute Engine → VM Instances.

  3. Find your provisioned instance and click "SSH" to connect.

If You Already Have a VM Instance:

Connect to it using the following command:

gcloud compute ssh INSTANCE_NAME

Way 1: Convert Text to Speech

Google’s Text-to-Speech API allows developers to convert written text into natural-sounding speech.

Steps to Synthesize Speech from Text:

  1. Activate the Virtual Environment

     source venv/bin/activate
    
  2. Create a JSON Configuration File (e.g., synthesize-text.json)

    Use nano or vim to create the file:

     nano synthesize-text.json
    
     {
         "input": {"text": "Cloud Text-to-Speech API allows developers to include natural-sounding, synthetic human speech as playable audio."},
         "voice": {"languageCode": "en-GB", "name": "en-GB-Standard-A", "ssmlGender": "FEMALE"},
         "audioConfig": {"audioEncoding": "MP3"}
     }
    

    Save and exit: Press Ctrl + X, then Y, then Enter.

  3. Call the Text-to-Speech API

     curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     --data @synthesize-text.json \
     "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt
    

    This will send the JSON request to the Text-to-Speech API and store the response in synthesize-text.txt.

  4. Create the decode.py File for a script to convert it into a playable audio file
    Use nano or vim to create the file:

     nano decode.py
    

    Add this Python script:

     import argparse
     from base64 import decodebytes
     import json
    
     """
     Usage:
             python tts_decode.py --input "synthesize-text.txt" \
             --output "synthesize-text-audio.mp3"
     """
    
     def decode_tts_output(input_file, output_file):
         """ Decode output from Cloud Text-to-Speech.
    
         input_file: the response from Cloud Text-to-Speech
         output_file: the name of the audio file to create
         """
    
         with open(input_file) as input:
             response = json.load(input)
             audio_data = response['audioContent']
    
             with open(output_file, "wb") as new_file:
                 new_file.write(decodebytes(audio_data.encode('utf-8')))
    
     if __name__ == '__main__':
         parser = argparse.ArgumentParser(
             description="Decode output from Cloud Text-to-Speech",
             formatter_class=argparse.RawDescriptionHelpFormatter)
         parser.add_argument('--input',
                            help='The response from the Text-to-Speech API.',
                            required=True)
         parser.add_argument('--output',
                            help='The name of the audio file to create',
                            required=True)
    
         args = parser.parse_args()
         decode_tts_output(args.input, args.output)
    

    Save and exit: Press Ctrl + X, then Y, then Enter.

    The API returns the synthesized speech in base64 format, which isn't directly playable and this script will convert it into a playable audio file (.mp3).

  5. Decode the Response to a MP3 File

     python decode.py --input "synthesize-text.txt" --output "synthesize-text-audio.mp3"
    

    This will create an MP3 file named synthesize-text-audio.mp3.

  6. To download the generated MP3 file

    1. Open your VM instance's SSH session in Google Cloud.

    2. Click the DOWNLOAD FILE option.

    3. Select synthesize-text-audio.mp3 and download it to your local machine.

Now, you have a fully functional audio file generated from text!


Way 2: Convert Speech to Text

Google’s Speech-to-Text API can transcribe speech into text in multiple languages.

Steps to Transcribe Audio:

  1. Upload an Audio File or Use a URI

    Before making the API request, you need an audio file. You can either upload a local file to Google Cloud Storage or use a pre-existing publicly available URI.

    Option 1: Upload a Local File to Google Cloud Storage

    If you have a local audio file (e.g., audio.flac), upload it to your Cloud Storage bucket:

     gsutil cp audio.flac gs://your-bucket-name/
    

    Replace your-bucket-name with your actual bucket name.

    Option 2: Use a Pre-existing Public URI

    I am using this Google sample audio file:

     "uri": "gs://cloud-samples-data/speech/corbeau_renard.flac"
    
  2. Create a JSON Configuration File (e.g., speech_request.json)
    Once inside the VM, create the JSON request file using nano:

     nano speech_request.json
    
     {
         "config": {
             "encoding": "FLAC",
             "languageCode": "fr-FR"
         },
         "audio": {
             "uri": "gs://cloud-samples-data/speech/corbeau_renard.flac"
         }
     }
    

    Here's an example of what you should see:

    Save and exit: Press Ctrl + X, then Y, then Enter.

  3. Call the Speech-to-Text API

     curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     --data @speech_request.json \
     "https://speech.googleapis.com/v1/speech:recognize" > speech_response_fr.json
    
  4. Check the Response

     cat speech_response_fr.json
    

    You should now see a valid transcription in French something like this:

    The transcribed text will now be stored in speech_response_fr.json.


Way 3: Translate and Detect Language

Google Cloud’s Translation API allows you to translate text and detect unknown languages.

Steps to Translate Text:

  1. Create the Translation Request File (e.g., translate_request.json)
    Once inside the VM, create the JSON request file using nano:

     nano translate_request.json
    
     {
       "q": "これは日本語です。",
       "source": "ja",
       "target": "en",
       "format": "text"
     }
    

    Save and exit: Press Ctrl + X, then Y, then Enter.

  2. Call the Cloud Translation API

     curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
         -H "Content-Type: application/json; charset=utf-8" \
         --data @translate_request.json \
         "https://translation.googleapis.com/language/translate/v2" > translation_response.txt
    

    This will sends a request to the Google Cloud Translation API and the translated text will be stored in translation_response.txt.

  3. Verify the Translation Output

    Run the following command:

     cat translation_response.txt
    

    You should see a translation output something like this:

Steps to Detect Language:

  1. Create the Language Detection Request File(e.g., detect_language_request.json)
    Once inside the VM, create the JSON request file using nano:

     nano detect_language_request.json
    
     {
       "q": "Este%é%japonês."
     }
    

    Save and exit: Press Ctrl + X, then Y, then Enter.

  2. Detect the Language of a Sentence

     curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
         -H "Content-Type: application/json; charset=utf-8" \
         --data @detect_language_request.json \
         "https://translation.googleapis.com/language/translate/v2/detect" > detection_response.txt
    

    This will send a request to Google Cloud Translation API to detect the language and detected language will be stored in detection_response.txt.

  3. Verify the Detection Output

    Run the following command:

     cat detection_response.txt
    

    You should see a detected language output something like this:

    "language": "pt" means the text is Portuguese and "confidence": 1" means the detection is 100% accurate.


By mastering these three techniques, you now have the skills to create cloud-based applications that leverage speech and language processing effectively. Whether you're developing automated transcription services, real-time translation tools, or voice-enabled applications, Google Cloud's Speech API provides the capabilities you need.

Thanks for reading! Keep exploring, keep building, and take your cloud development skills to the next level!