Speaking instead of typing: How AI transforms speech into structured data
Voice input is revolutionizing interaction with digital applications. While typing is slow and prone to errors, voice offers a faster, more natural, and more accessible alternative. With OpenAI’s Whisper API, spoken words—even Swiss German—can be reliably converted into structured text.
Our demo app shows how voice commands can replace traditional forms. The potential of this technology goes far beyond this and opens up new possibilities for intuitive user interfaces in a wide range of applications.
How AI transforms speech into structured data
Text to speech
In many digital applications, interaction with users still takes place via traditional text forms. However, voice is becoming increasingly popular as a form of input, especially on mobile devices.
Typing is slow, error-prone, and particularly tedious. Voice is faster, more natural, and more accessible.
With modern tools such as Open AI’s Whisper API, it is now feasible to use voice input for the first time, as they can reliably transcribe even with ambient noise, dialects, and colloquial language.
We have therefore analyzed the technology in more detail to find out in which other use cases it can contribute to optimizing the user experience.
Whisper API in practice
Open AI’s Whisper API can be easily integrated into your own applications. The connection is made using a valid API key and a suitable client package for the respective programming language.
Key generation: https://platform.openai.com/api-keys
Overview of packages: https://platform.openai.com/docs/libraries
After these simple steps, the API can be used to convert speech into text. We tested this right away and were pleasantly surprised, as the API was able to transcribe spoken Swiss German in a wide variety of dialects into written High German without any problems.
After this successful test, we looked at the potential this technology could offer – especially when integrated into a user flow and combined with other tools. This is exactly the approach we are taking with our demo app.
Our demo app: Voice-controlled clothing search
The demo app helps customers find suitable clothing items—without having to type a single word. They can describe what they are looking for using voice input.
On this basis, a structured JSON object is generated that contains the relevant search fields and the desired characteristics of the garment.
In addition, the app shows which information is still missing or could be added to further refine the search. The result is an interactive, voice-based interface that replaces classic filter forms and is much closer to natural communication.
Wide range of applications for voice-based interfaces
The method used in our demo can be applied far beyond clothing searches. Wherever users want to enter, search for, or understand information, a combination of language processing, structured output, and targeted prompt engineering can create added value.
Possible areas of application:
Interactive instructions: Instead of clicking through help pages, users describe the problem verbally. The system asks questions and guides them to the solution.
Information retrieval: Users can ask questions about complex content, such as contracts, which the system answers based on context.
Form entry: Instead of manually typing in information, users can briefly describe the situation. The system automatically fills in the fields and points out any missing information.
At Bitforge, we are constantly on the lookout for technologies that create real added value for users. The combination of OpenAI Whisper and GPT-4 has proven to be particularly promising: it enables complex voice inputs to be captured, intelligently structured, and processed in a targeted manner.