Streaming transcription
Invox Medical offers the ability to transcribe audio captured directly from the microphone. To use this service, the following tasks must be performed on the client side:
- Capture audio from a microphone
- Encode audio to single-channel PCM with a sampling rate of 16 kHz
- Interact with our service using web sockets
Below we will exemplify these points to facilitate the integration process.
Capture audio from a microphone
The audio capture process is based on allowing input devices to acquire audio from their environment and transform it into data that applications can process. There are many ways to perform these processes, so each integrator will have completely different scenarios in their environments.
Below we will give an example of how this process would be carried out using Javascript/Typescript:
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);
processor.onaudioprocess = (event) => {
// get audio from microphone
const inputData = event.inputBuffer.getChannelData(0);
// continue with the logic
};
Encode audio to single-channel PCM with a sampling rate of 16 kHz
Our streaming transcription service needs to receive audio encoded under certain standards: single-channel PCM with a sampling frequency of 16 kHz.
When audio is captured directly from the microphone, it isn't encoded as required, so we'd have to perform a transformation on the received information. Below, we'll show some examples in JavaScript/Typescript that perform this transformation:
// Resample to 16 KHz
function downsampleBuffer(buffer, inputSampleRate, targetSampleRate = 16000) {
if (targetSampleRate === inputSampleRate) return buffer;
const sampleRateRatio = inputSampleRate / targetSampleRate;
const newLength = Math.round(buffer.length / sampleRateRatio);
const result = new Float32Array(newLength);
let offsetResult = 0;
let offsetBuffer = 0;
while (offsetResult < result.length) {
const nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);
let accum = 0,
count = 0;
for (let i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
accum += buffer[i];
count++;
}
result[offsetResult] = accum / count;
offsetResult++;
offsetBuffer = nextOffsetBuffer;
}
return result;
}
// Convert Float32 → PCM16
function floatTo16BitPCM(float32Array) {
const buffer = new ArrayBuffer(float32Array.length * 2);
const view = new DataView(buffer);
let offset = 0;
for (let i = 0; i < float32Array.length; i++, offset += 2) {
let s = Math.max(-1, Math.min(1, float32Array[i]));
view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
}
return buffer;
}
Using the two methods above, we can modify the audio capture so that all information sent to the transcription service is done with the expected encoding and sampling frequency.
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);
processor.onaudioprocess = (event) => {
// get audio from microphone
const inputData = event.inputBuffer.getChannelData(0);
const downsampled = downsampleBuffer(
inputData,
audioContext.sampleRate,
16000
);
const pcm16 = floatTo16BitPCM(downsampled);
// continue with the logic
};
Interact with our service using web sockets
Once we have the audio received and ready to send it to transcribe, it is time to consume the endpoint that we expose via web socket to complete this process.
The connection URL to our websocket will be the following: wss://live.invoxmedical.com
Web socket initialization
To create a new connection via websocket we need three parameters:
- Web socket URL
- accessToken: authorization token for the request
- callbackUrl: Public URL to which we will send the transcript once completed.
Below is an example using Javascript/Typescript of how a Websocket is initialized:
const webSocketUrl="wss://live.invoxmedical.com"
const accessToken = "mockAccessToken"
const callbackUrl = encodeURIComponent("https://example.mydomain.com")
const ws = new WebSocket(
`${webSocketUrl}?callbackUrl=${callbackUrl}&accessToken=${accessToken}`
);
If you don't remember how to generate an accessToken, you can go to the following section: Credentials
If the web socket can be opened successfully, the response structure received will be as follows:
Successful request
Describe the characteristics of a satisfactory response
200
Response example:
{
"message": "Connection established.",
"connectionId": "mock-connectionId"
}
type OpenWebSocketType = {
message: string,
connectionId: string
}
Important: It is recommended to store the connectionId, since this value will be sent along with the completed transcription as an identifying element of the same.
If we are unable to open a websocket connection, we will get one of the following errors:
Unauthorized request
Describes the response when the request is not authorized
401
Response body
type TranscribeStreamingAudioResponseType = {
messsage: string
messageType: 'invalidAccessToken'
}
Forbidden request
Describes the response when the API Key is not allowed to consume this service
403
Response body
type TranscribeStreamingAudioResponseType = {
messsage: string
errorType: 'insuficientPermissions'
}
REAL-TIME-TRANSCRIBE
Bad request
Describe the features of a bad request
400
Response body
type SuccessWebSocketType = {
messsage: string
errorType: OPEN_WS_BAD_REQUEST_ERROR
}
enum OPEN_WS_BAD_REQUEST_ERROR {
MISSING_CALLBACK_URL= 'missingCallbackUrl'
INVALID_INPUT_PARAMETER = 'invalidInputParameter',
}
Send data to web socket
Once we have the web socket open, we can proceed to send data. The structure the web socket expects is as follows:
type WebSocketPayload = {
messsage: string
messageType: 'Audio' | 'PING'
}
Below we will explain each field:
Message attribute
Refers to the Base64 representation of the audio fragment received by the microphone. It is used when the messageType attribute is Audio.
An example of how to convert audio received by microphone to Base64, after being transformed into single-channel PCM with a sampling frequency of 16KHz would be:
function arrayBufferToBase64(buffer) {
// Create an 8-bit array
const uint8Array = new Uint8Array(buffer);
// Convert array to base64 string
let binary = "";
for (let i = 0; i < uint8Array.length; i++) {
binary += String.fromCharCode(uint8Array[i]);
}
return btoa(binary);
}
If this type of message is sent, the expected response would be:
type AudioMessageResponse = {
audioReceived: boolean, // Must be true
duration: number // duration related to the received audio
}
MessageType attribute
It tells us the type of message we are sending:
- Audio message
- Message to keep the service alive
The use of the audio message is clear: send audio data to be transcribed (once the Web Socket is closed).
PING messages are used to prevent the Web Socket from being closed due to inactivity. For example:
If we are recording a consultation, and in the middle of it a pause is put, the Web Socket should not be closed, in this case while the recording state is paused, it is recommended that every so often (between 10 - 20 seconds) this type of message be sent through the Web Socket, thus indicating to the server that it should keep the session open.
If this type of message is sent, the expected response would be:
type PingMessageResponse = {
audioReceived: boolean, // Must be false
message: string
}
Capturing and sending audio
Continuing with the example we have been developing, after seeing each of the previous sections, we can summarize it as follows:
const webSocketUrl="wss://live.invoxmedical.com"
const accessToken = "mockAccessToken"
const callbackUrl = encodeURIComponent("https://example.mydomain.com")
const ws = new WebSocket(
`${webSocketUrl}?callbackUrl=${callbackUrl}&accessToken=${accessToken}`
);
ws.onopen = () => {
console.log("✅ Connected to Web Socket server");
};
ws.onclose = () => {
console.log("❌ Closed");
};
ws.onerror = (err) => {
console.error("⚠️ Error WebSocket:", err);
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log("🏓 Server response:", data);
};
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);
processor.onaudioprocess = (event) => {
// get audio from microphone
const inputData = event.inputBuffer.getChannelData(0);
const downsampled = downsampleBuffer(
inputData,
audioContext.sampleRate,
16000
);
const pcm16 = floatTo16BitPCM(downsampled);
ws.send(
JSON.stringify({
message: arrayBufferToBase64(pcm16),
messageType: "AUDIO",
})
);
};
Transcription response
Once the web socket is closed, the platform will assume that no more audio related to the current recording will be sent. Once this occurs, the system will complete the transcription process and send it to the callbackUrl sent as the web socket opening parameter. The message sent will have the following structure:
type TranscriptionFinished = {
transcriptionStatus: 'SUCCESS' | 'FAILED'
errorMessage?: string
connectionId: string
signature: string,
transcription?: ITranscriptionSegment[]
}
interface ITranscriptionSegment {
speakerId: string
text: string
start: number
end: number
speakerName: string
}
Important note: The endpoint you need to prepare to receive the transcription result must be of type POST and must be able to accept requests from our source.
Security Warning: To verify that the request genuinely comes from Invox Medical, you must validate the signature.
Using NodeJS
, an example of how to decrypt the signature and validate that the signature is valid would be:
import * as CryptoJS from 'crypto-js';
export const validateSignature = (signature: string): boolean => {
try {
const secret = process.env.APP_SECRET;
const appId = process.env.APP_ID;
const apiKey = process.env.API_KEY;
const expectedValue = `${appId}~${apiKey}`
const bytes = CryptoJS.AES.decrypt(signature, secret);
const originalText = bytes.toString(CryptoJS.enc.Utf8);
if (!originalText) {
return false;
}
return originalText === expectedValue
} catch (error) {
return false
}
}