Voice-to-Text: Xeoma

Voice-to-Text: Xeoma’s Intellectual Module for Speech Recognition

Voice-to-Text: Xeoma's Intellectual Module for Speech Recognition

The AI-powered Voice-to-Text module of the Xeoma video surveillance software ‘listens’ to the audio stream from a camera or a separate microphone, hears speech, and saves the transcript of it in a CSV report or overlays it on the preview as text. Alternatively, you can set it to react to certain words or phrases. It can also work with .mp3 audio files – recordings of conversations, training videos, etc. – transcribing speech and providing it as text.

Working with Xeoma’s Voice-to-Text does not require specialized equipment: the sound stream from any camera or a separate microphone as well as regular off-the-shelf computers and video graphics cards are suitable.

Warning: this module is available starting from Xeoma 24.8.12 and is in the beta state so it can skip words or contain loops.

Get demo

Purchase

Get details

APPLICATION SCENARIOS

The Voice-to-Text module is a flexible tool that can be used for various purposes:

Call center: transcription of ongoing calls or call recordings to monitor compliance with the company policy and conversation scripts
Taking care of the elderly: the ability to instantly react to a cry for help
City surveillance: recognition of words that promise danger in counter-terrorism security
Parental control: assistance in ensuring the safety of a child, protecting them against bullying or communicating with scammers, molesters
Police: part of the body-worn cameras for transcription of conversations between a police officer and a suspect and the ability to detect a dangerous situation
Banks, pawnshops: panic button that doesn’t need to be physically pressed
Research, analytics: background collection of statistics on frequency of use of various words and other speech-related studies
Marketing: finding out whether customers are discussing a promotional campaign, their reaction to a banner or an ad, etc.
Any business: automated control of the customer service quality (for example, detection of swear words)
Filtering and automation: detection of unwanted, prohibited words or phrases in conversations, and directing certain such episodes for closer inspection, without having to listen to all conversations

As you can see, the “Voice-to-Text” tool of the Xeoma video surveillance program can be used in a wide range of scenarios! Not only does it help to improve security in private life, the life of the city and citizens, as well as in the commercial sphere, but also contributes to the optimization of business operations.

ADVANTAGES OF THE VOICE-TO-TEXT MODULE:

Voice-to-Text module's advantages: any equipment

No special equipment required:
Regular commonly available computers and almost any camera can be used.

Voice-to-Text module's advantages: flexible and universal

Simply flexible:
Various reactions, integration with third-party systems.

Voice-to-Text module's advantages: real-time on-the-fly work

Real-time work:
On-the-fly work in real-time mode, without any latency. Works on your computer only.

Voice-to-Text module's advantages: simply affordable

Affordable solution:

The module in already included into Xeoma Pro licenses!

HOW IT WORKS:

First of all, it is worth noting that the module is shown in the list of modules only when the server part of Xeoma is running on suitable hardware. If you do not find the module in the list of modules, you should make sure that you are using a suitable processor and a suitable edition of Xeoma (the module is only available in the Xeoma Pro edition). Since the module works with an audio stream, you need to have some kind of sound source in the chain: either a microphone built into the camera, or a separate USB or IP microphone.

For example, let’s assume that the sound stream in your case is coming from the IP camera itself. In this case, simply use a chain of modules that has “Universal Camera” – “Voice-to-Text” – “Preview and Archive” in your Xeoma:

A sample of a chain with the Voice-to-Text intellectual module

Click on the Voice-to-Text icon in the chain to open the module settings. The first step in working with the Voice-to-Text module is to download additional resources it requires for work. The download process will start automatically when you first open the module settings. When the process of downloading additional resources is complete, the “Downloading in progress” message will disappear.

Settings of the Voice-to-Text intellectual module

Additional resources contain data arrays for artificial intelligence that the Voice-to-Text is based on, and are downloaded on request from servers of FelenaSoft. They are not supplied with the software to keep the program size small, as they are not required in all CCTV systems.

New options that open after the download of additional resources is complete allow you to choose from several AI-powered voice recognition models that will be used for speech recognition. Each model has its own strengths and weaknesses – as a rule, they differ in the degree of recognition accuracy and the level of load on the processor. Conventionally, they are called tiny, base, small, medium, large in order of increasing model size, their recognition quality and the load on the hardware from their use.

Settings of the Voice-to-Text intellectual module

In the “Language” field, select the language in which the transcript of the speech will be provided (note that the language of the speech itself does not need to be specified).

If you need to transcribe all audible conversations, you can go to the “Save data in CSV report” checkbox directly and check it. This way, the transcript of conversations will be saved in a spreadsheet file on the disk in the directory you specified, which can be integrated into other systems, for example, statistical ones.

Also, “Voice-to-Text” can detect certain phrases or words. Specify the searched for words or phrases in the “Keywords for recognition” field. After that, the module will be still listening to all speech in the camera or microphone vicinity, but will react only to hearing the keywords. Connect the desired reaction module after the “Voice-to-Text” module to receive a notification, start recording or send a command in this case.

In our case, we have the “Preview and Archive” module connected as a destination module, so when the set keywords are detected, it will start recording the camera stream – and allow to search for episodes with the keyword you specify. This option can also be easily combined with the saving to a CSV report option: to do this, check the corresponding box below.

The “Voice-to-Text” has its own macro – %VOICE% – that can be used in destination modules like “Email Sending”, “Application Runner” or “HTTP Request Sender” in you want to send transcription of speech into them.

INTEGRATION WITH EXTERNAL PROGRAMS

Xeoma’s Voice-to-Text also has the ability to be used from external programs – for example, to transcribe VoIP conversations. Following the instructions below, you can give an .mp3 file to Voice-to-Text to decode, and get the result as text. Thus, this module can be used even for working with operator workstations where there is no Xeoma or cameras. This can be done in two ways: via the Xeoma API or by running a console command. Important: only .mp3 files are supported.

1. API. For the first option, you need to use the Xeoma API with JSON requests. Using commands, you can make a request to a remote or local Xeoma server for it to transcribe an .mp3 file into text.

For example:
curl -F "audio_file=@speech.mp3" "http://192.168.0.135:10090/api?login=Administrator&password=123&speech_recognition=recognition&model=large&language=en&denoise=true"

where
“speech.mp3” should be replaced with the path to the audio file on your computer;

“192.168.0.135:10090” should be replaced with the IP address of a running Xeoma server that is appropriate to run Voice-to-Text and its port (usually 10090);

“Administrator” should be kept as is since this is only available for the Xeoma’s Administrator profile;

“123” should be replaced with the password of the Xeoma’s Administrator profile;

“model=large” is where you choose the recognition model. See more about options above;

“denoise=true” is included if you’d like to also enable noise cancellation which in some cases helps increase recognition accuracy;

“en” should be replaced with the 2-3 character code (see below) of the language that you’d like to get the transcribed text in. If it differs from the actual speech language that the Voice-to-Text listens to, it will be automatically translated to the language you specified.

Note: This request will get you the text transcription of the file directly in the console or whatever tool you use to send the request from. If you’d like to save the transcription of the file as a text file instead, please add “>filename.txt” after the command:

curl -F "audio_file=@speech.mp3" "http://192.168.0.135:10090/api?login=Administrator&password=123&speech_recognition=recognition&model=large&language=en&denoise=true">savetext.txt
where
savetext.txt should be replaced with the name you’d like the transcription file to have.

2. Launch command. The second option allows you to perform recognition not through the API, but locally on the PC via commands that you can execute in a console.

Example:

{Path to Xeoma executable file} -speech2text file.mp3;out.log;large;en;denoise

where
“file.mp3” should be replaced with the path to the audio file on your computer;

“out.log” should be replaced with the path to the resulting transcription text file and its name;

“large” is where you choose the recognition model. See more about options above;

“denoise” is included if you’d like to also enable noise cancellation which in some cases helps increase recognition accuracy.

List of language codes:

“en”: “english”,
“zh”: “chinese”,
“de”: “german”,
“es”: “spanish”,
“ru”: “russian”,
“ko”: “korean”,
“fr”: “french”,
“ja”: “japanese”,
“pt”: “portuguese”,
“tr”: “turkish”,
“pl”: “polish”,
“ca”: “catalan”,
“nl”: “dutch”,
“ar”: “arabic”,
“sv”: “swedish”,
“it”: “italian”,
“id”: “indonesian”,
“hi”: “hindi”,
“fi”: “finnish”,
“vi”: “vietnamese”,
“he”: “hebrew”,
“uk”: “ukrainian”,
“el”: “greek”,
“ms”: “malay”,
“cs”: “czech”,
“ro”: “romanian”,
“da”: “danish”,
“hu”: “hungarian”,
“ta”: “tamil”,
“no”: “norwegian”,
“th”: “thai”,
“ur”: “urdu”,
“hr”: “croatian”,
“bg”: “bulgarian”,
“lt”: “lithuanian”,
“la”: “latin”,
“mi”: “maori”,
“ml”: “malayalam”,
“cy”: “welsh”,
“sk”: “slovak”,
“te”: “telugu”,
“fa”: “persian”,
“lv”: “latvian”,
“bn”: “bengali”,
“sr”: “serbian”,
“az”: “azerbaijani”,
“sl”: “slovenian”,
“kn”: “kannada”,
“et”: “estonian”,
“mk”: “macedonian”,
“br”: “breton”,
“eu”: “basque”,
“is”: “icelandic”,
“hy”: “armenian”,
“ne”: “nepali”,
“mn”: “mongolian”,
“bs”: “bosnian”,
“kk”: “kazakh”,
“sq”: “albanian”,
“sw”: “swahili”,
“gl”: “galician”,
“mr”: “marathi”,
“pa”: “punjabi”,
“si”: “sinhala”,
“km”: “khmer”,
“sn”: “shona”,
“yo”: “yoruba”,
“so”: “somali”,
“af”: “afrikaans”,
“oc”: “occitan”,
“ka”: “georgian”,
“be”: “belarusian”,
“tg”: “tajik”,
“sd”: “sindhi”,
“gu”: “gujarati”,
“am”: “amharic”,
“yi”: “yiddish”,
“lo”: “lao”,
“uz”: “uzbek”,
“fo”: “faroese”,
“ht”: “haitian creole”,
“ps”: “pashto”,
“tk”: “turkmen”,
“nn”: “nynorsk”,
“mt”: “maltese”,
“sa”: “sanskrit”,
“lb”: “luxembourgish”,
“my”: “myanmar”,
“bo”: “tibetan”,
“tl”: “tagalog”,
“mg”: “malagasy”,
“as”: “assamese”,
“tt”: “tatar”,
“haw”: “hawaiian”,
“ln”: “lingala”,
“ha”: “hausa”,
“ba”: “bashkir”,
“jw”: “javanese”,
“su”: “sundanese”,
“yue”: “cantonese”.

HOW TO TEST

1. Download Xeoma from our website and launch it. Make sure that the server part of Xeoma is running on a machine with a required processor.
Also make sure that Xeoma is running in the Trial edition or activate a Xeoma Pro license to work with this module.
2. Add a camera or wait while Xeoma adds cameras found in your network automatically. If you need to work with a separate microphone that is not built into the camera, connect the “Microphone” module and select the appropriate sound source.
3. Add the “Voice-to-Text” module to the chain and set it up.
4. If needed add other modules to set necessary reactions e.g. archive recording, sending email, or your own reaction.
5. Done! Now you can use Xeoma’s outstanding intellectual speech recognition.

*The Voice-to-Text module is shown and works only on the following processors:

Intel 64-bit processors of the following series:
-IntelCore processors starting from the 4th generation (including 10+ generations);
-XEON processors starting from the 6th generation;
-Atom processors of the “C23”, “C25”, “C27”, “C33”, “C35”, “C37”, “C38”, “C39”, “P59”, “Z34”, “Z35”, “x5-E39”, or “x5-E8000” series;
-Processors Intel Xeon E5-24 series, i5-2450M or i7-2600.

Although this module can work using the CPU capacity, it is recommended to have a video graphics card on the server machine.

Xeoma has more!
Xeoma also offers other modules that process audio streams:
• Microphone is a module that allows you to select a USB microphone or a separate IP microphone as the sound source.
• Sound Detector is a module that allows you to analyze audio streams and trigger when the sound level exceeds a specified limit.
• Sound Events Detector is an intelligent module capable of recognizing certain sounds: car alarms, a child crying, gunshots, screams, breaking glass.

Watch video about Xeoma’s Voice-to-Text

Do you need something else? We can develop it and add it into Xeoma as the paid development. See details

FREE TRIAL OF XEOMA

Try Xeoma for free! Fill in the fields below and you will get an email with a demo license.

To do that, enter your name and your email to send the license to in the fields below, and click the ‘Get Xeoma free demo licenses to email’ button.

We urge you to refrain from using emails that contain personal data, and from sending us personal data in any other way. If you still do, by submitting this form, you confirm your consent to processing of your personal data

Have questions? Need help? Please contact us! We’ll be happy to help!

August 14, 2024