In an age where communication happens through various mediums, the ability to convert spoken language into written text has become increasingly valuable. Speech-to-text technology has a wide range of applications that can enhance productivity, accessibility, and user experience.
Speech to Text Use Cases
Here are some examples of how you might use Speech To Text
Meeting Transcription: Automatically create text records of business meetings, conferences, or interviews.
Closed Captioning: Generate captions for videos, making content accessible to deaf or hard-of-hearing viewers.
Voice Commands: Enable voice control in apps or smart home devices by converting spoken commands to text.
Podcast Transcription: Create text versions of podcasts to improve SEO and make content searchable.
Voice Notes: Allow users to dictate notes or memos, which are then converted to text for easy editing and sharing.
Call Center Analytics: Transcribe customer service calls to analyze common issues, sentiment, and agent performance.
Medical Dictation: Help healthcare professionals create patient notes or reports by speaking rather than typing.
Legal Documentation: Transcribe court proceedings, depositions, or client interviews for accurate record-keeping.
Educational Content: Convert lectures or educational videos into text for students to review or for creating study materials.
Accessibility Tools: Help people with disabilities interact with digital content by converting audio to text.
These are just a few examples of how speech-to-text technology can be applied across various industries and scenarios. Now, let's dive into how Groq's Whisper API can help you implement this powerful technology in your own applications
What can Whisper API Do?
Groq's Whisper API can turn audio into text.
How to Use It
To use the API, you'll need to call use this API:
- For transcription:
https://api.groq.com/openai/v1/audio/transcriptions
The API uses a model called whisper-large-v3
. This model is very good at transcribing audio.
Things to Remember
You can only upload files up to 25 MB in size.
The API can work with these types of audio files: mp3, mp4, mpeg, mpga, m4a, wav, and webm.
If your file has more than one audio track (like a video with different language tracks), the API will only use the first track.
Preparing Your Audio File
Before using the API, you might need to change your audio file a bit. The API works best with audio that has a certain quality (16,000 Hz mono). Here's how you can change your file using a tool called ffmpeg:
Copyffmpeg \
-i <your file> \
-ar 16000 \
-ac 1 \
-map 0:a: \
<output file name>
Replace <your file>
with the name of your audio file, and <output file name>
with what you want to call the new file.
Sample Code
package main
import (
"bytes"
"fmt"
"io"
"mime/multipart"
"net/http"
"os"
)
const (
apiBaseUrl = "https://api.groq.com/openai"
STTWhisperLargeV3 = "whisper-large-v3"
)
type GroqClient struct {
ApiKey string
}
type GroqMessage struct {
Role string `json:"role"`
Content string `json:"content"`
}
func main() {
apiKey := os.Getenv("GROQ_API_KEY")
gclient := &GroqClient{
ApiKey: apiKey,
}
transcriptText, err := gclient.TranscribeAudio("audio.m4a")
if err != nil{
panic(err)
}
fmt.Println("transcriptText", transcriptText)
fmt.Println("Transcript saved to transcript.txt")
}
func (g *GroqClient) TranscribeAudio(audioFile string) (string, error) {
// File to upload
filepath := audioFile
file, err := os.Open(filepath)
if err != nil {
fmt.Println("Error opening file:", err)
return "", err
}
defer file.Close()
transcriptionUrl := "/v1/audio/transcriptions"
finalUrl := fmt.Sprintf("%s%s", apiBaseUrl, transcriptionUrl)
// Prepare form data
body := &bytes.Buffer{}
writer := multipart.NewWriter(body)
// Add file field
part, err := writer.CreateFormFile("file", filepath)
if err != nil {
fmt.Println("Error creating form file:", err)
return "", err
}
_, err = io.Copy(part, file)
if err != nil {
fmt.Println("Error copying file:", err)
return "", err
}
temp := "0"
responseFormat := "json"
language := "en"
// Add other fields
_ = writer.WriteField("model", STTWhisperLargeV3)
_ = writer.WriteField("temperature", temp)
_ = writer.WriteField("response_format", responseFormat)
_ = writer.WriteField("language", language)
// Close writer
err = writer.Close()
if err != nil {
fmt.Println("Error closing writer:", err)
return "", err
}
// Create POST request
req, err := http.NewRequest(http.MethodPost, finalUrl, body)
if err != nil {
return "", err
}
// Set headers
req.Header.Set("Content-Type", writer.FormDataContentType())
req.Header.Set("Authorization", fmt.Sprintf("bearer %s", g.ApiKey)) // Replace with your actual access token
// Create HTTP client
client := &http.Client{}
// Send request
resp, err := client.Do(req)
if err != nil {
fmt.Println("Error sending request:", err)
return "", err
}
defer func(Body io.ReadCloser) {
err1 := Body.Close()
if err1 != nil {
}
}(resp.Body)
// Read response body
responseBody, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Println("Error reading response body:", err)
return "", err
}
r := string(responseBody)
return r, nil
}
With Groq's Whisper API, turning speech into text is now easier than ever. Whether you're building an app for meeting transcriptions, creating accessibility tools, or any of the other use cases we've explored, this technology can help you achieve your goals.
Give it a try in your next project and see how it can transform your approach to handling audio content!
If you're interested in learning more about developing apps with Generative AI, subscribe to my blog for more tutorials and sample code. You can also follow me in Twitter or connect with me in LinkedIn.