Mastering Google Cloud Speech API: 3 Powerful Ways to Convert Speech and Text

As a junior cloud architect, working with Google Cloud’s Speech API can be an exciting and rewarding challenge. In this guide, we’ll walk through three powerful ways to utilize the Cloud Speech API: speech-to-text transcription, text-to-speech synthesis, and language translation.
Why Use Google Cloud Speech API?
Google Cloud Speech API provides accurate, fast, and scalable solutions for converting speech to text, synthesizing natural-sounding speech, translating text, and detecting languages. These capabilities enable developers to build sophisticated voice-enabled applications, automated transcription services, and multilingual support tools.
Step 1: Create an API Key
To use the Google Cloud Speech API, you need to generate an API key.
Steps to Create an API Key:
Go to the Google Cloud Console and navigate to the API & Services section.
Select Credentials and click on "Create Credentials."
Choose API Key, and a new key will be generated.
Copy and save the API key, as you will need it for authentication in the upcoming steps.
Step 2: Create and Connect to a VM Instance
Before using the Cloud Speech API, you need to create and connect to a VM instance.
Steps to Create a VM Instance via Google Cloud Console:
Go to Google Cloud Console and navigate to the Compute Engine.
Click on "Create Instance" and configure the necessary settings (name, region, machine type, etc.).
Allow HTTP and HTTPS Traffic in the firewall settings.
Click on "Create" to provision the VM instance.
Steps to Create a VM Instance via Command Line (gcloud CLI):
gcloud compute instances create INSTANCE_NAME \
--machine-type=e2-medium \
--image-project=debian-cloud \
--image-family=debian-11 \
--scopes=https://www.googleapis.com/auth/cloud-platform
Steps to Connect to the VM Instance:
Open Google Cloud Console.
Navigate to Compute Engine → VM Instances.
Find your provisioned instance and click "SSH" to connect.
If You Already Have a VM Instance:
Connect to it using the following command:
gcloud compute ssh INSTANCE_NAME
Way 1: Convert Text to Speech
Google’s Text-to-Speech API allows developers to convert written text into natural-sounding speech.
Steps to Synthesize Speech from Text:
Activate the Virtual Environment
source venv/bin/activateCreate a JSON Configuration File (e.g.,
synthesize-text.json)Use
nanoorvimto create the file:nano synthesize-text.json{ "input": {"text": "Cloud Text-to-Speech API allows developers to include natural-sounding, synthetic human speech as playable audio."}, "voice": {"languageCode": "en-GB", "name": "en-GB-Standard-A", "ssmlGender": "FEMALE"}, "audioConfig": {"audioEncoding": "MP3"} }Save and exit: Press
Ctrl + X, thenY, thenEnter.Call the Text-to-Speech API
curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ --data @synthesize-text.json \ "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txtThis will send the JSON request to the Text-to-Speech API and store the response in
synthesize-text.txt.Create the
decode.pyFile for a script to convert it into a playable audio file
Usenanoorvimto create the file:nano decode.pyAdd this Python script:
import argparse from base64 import decodebytes import json """ Usage: python tts_decode.py --input "synthesize-text.txt" \ --output "synthesize-text-audio.mp3" """ def decode_tts_output(input_file, output_file): """ Decode output from Cloud Text-to-Speech. input_file: the response from Cloud Text-to-Speech output_file: the name of the audio file to create """ with open(input_file) as input: response = json.load(input) audio_data = response['audioContent'] with open(output_file, "wb") as new_file: new_file.write(decodebytes(audio_data.encode('utf-8'))) if __name__ == '__main__': parser = argparse.ArgumentParser( description="Decode output from Cloud Text-to-Speech", formatter_class=argparse.RawDescriptionHelpFormatter) parser.add_argument('--input', help='The response from the Text-to-Speech API.', required=True) parser.add_argument('--output', help='The name of the audio file to create', required=True) args = parser.parse_args() decode_tts_output(args.input, args.output)Save and exit: Press
Ctrl + X, thenY, thenEnter.The API returns the synthesized speech in base64 format, which isn't directly playable and this script will convert it into a playable audio file (.mp3).
Decode the Response to a MP3 File
python decode.py --input "synthesize-text.txt" --output "synthesize-text-audio.mp3"This will create an MP3 file named
synthesize-text-audio.mp3.To download the generated MP3 file
Open your VM instance's SSH session in Google Cloud.
Click the DOWNLOAD FILE option.
Select
synthesize-text-audio.mp3and download it to your local machine.
Now, you have a fully functional audio file generated from text!
Way 2: Convert Speech to Text
Google’s Speech-to-Text API can transcribe speech into text in multiple languages.
Steps to Transcribe Audio:
Upload an Audio File or Use a URI
Before making the API request, you need an audio file. You can either upload a local file to Google Cloud Storage or use a pre-existing publicly available URI.
Option 1: Upload a Local File to Google Cloud Storage
If you have a local audio file (e.g.,
audio.flac), upload it to your Cloud Storage bucket:gsutil cp audio.flac gs://your-bucket-name/Replace
your-bucket-namewith your actual bucket name.Option 2: Use a Pre-existing Public URI
I am using this Google sample audio file:
"uri": "gs://cloud-samples-data/speech/corbeau_renard.flac"Create a JSON Configuration File (e.g.,
speech_request.json)
Once inside the VM, create the JSON request file usingnano:nano speech_request.json{ "config": { "encoding": "FLAC", "languageCode": "fr-FR" }, "audio": { "uri": "gs://cloud-samples-data/speech/corbeau_renard.flac" } }Here's an example of what you should see:

Save and exit: Press
Ctrl + X, thenY, thenEnter.Call the Speech-to-Text API
curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ --data @speech_request.json \ "https://speech.googleapis.com/v1/speech:recognize" > speech_response_fr.jsonCheck the Response
cat speech_response_fr.jsonYou should now see a valid transcription in French something like this:

The transcribed text will now be stored in
speech_response_fr.json.
Way 3: Translate and Detect Language
Google Cloud’s Translation API allows you to translate text and detect unknown languages.
Steps to Translate Text:
Create the Translation Request File (e.g.,
translate_request.json)
Once inside the VM, create the JSON request file usingnano:nano translate_request.json{ "q": "これは日本語です。", "source": "ja", "target": "en", "format": "text" }Save and exit: Press
Ctrl + X, thenY, thenEnter.Call the Cloud Translation API
curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ --data @translate_request.json \ "https://translation.googleapis.com/language/translate/v2" > translation_response.txtThis will sends a request to the Google Cloud Translation API and the translated text will be stored in
translation_response.txt.Verify the Translation Output
Run the following command:
cat translation_response.txtYou should see a translation output something like this:

Steps to Detect Language:
Create the Language Detection Request File(e.g.,
detect_language_request.json)
Once inside the VM, create the JSON request file usingnano:nano detect_language_request.json{ "q": "Este%é%japonês." }Save and exit: Press
Ctrl + X, thenY, thenEnter.Detect the Language of a Sentence
curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ --data @detect_language_request.json \ "https://translation.googleapis.com/language/translate/v2/detect" > detection_response.txtThis will send a request to Google Cloud Translation API to detect the language and detected language will be stored in
detection_response.txt.Verify the Detection Output
Run the following command:
cat detection_response.txtYou should see a detected language output something like this:

"language": "pt"means the text is Portuguese and"confidence": 1"means the detection is 100% accurate.
By mastering these three techniques, you now have the skills to create cloud-based applications that leverage speech and language processing effectively. Whether you're developing automated transcription services, real-time translation tools, or voice-enabled applications, Google Cloud's Speech API provides the capabilities you need.
Thanks for reading! Keep exploring, keep building, and take your cloud development skills to the next level!
