Streaming guide

Invox Medical offers the ability to transcribe audio captured directly from the microphone. To use this service, the following tasks must be performed on the client side:

Capture audio from a microphone
Encode audio to single-channel PCM with a sampling rate of 16 kHz
Interact with our service using web sockets

Below we will exemplify these points to facilitate the integration process.

Capture audio from a microphone

The audio capture process is based on allowing input devices to acquire audio from their environment and transform it into data that applications can process. There are many ways to perform these processes, so each integrator will have completely different scenarios in their environments.

Below we will give an example of how this process would be carried out using Javascript/Typescript:

const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });

const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);

const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  // get audio from microphone
  const inputData = event.inputBuffer.getChannelData(0);

  // continue with the logic
};

Encode audio to single-channel PCM with a sampling rate of 16 kHz

Our streaming transcription service needs to receive audio encoded under certain standards: single-channel PCM with a sampling frequency of 16 kHz.

When audio is captured directly from the microphone, it isn't encoded as required, so we'd have to perform a transformation on the received information. Below, we'll show some examples in JavaScript/Typescript that perform this transformation:

// Resample to 16 KHz
function downsampleBuffer(buffer, inputSampleRate, targetSampleRate = 16000) {
  if (targetSampleRate === inputSampleRate) return buffer;
  const sampleRateRatio = inputSampleRate / targetSampleRate;
  const newLength = Math.round(buffer.length / sampleRateRatio);
  const result = new Float32Array(newLength);
  let offsetResult = 0;
  let offsetBuffer = 0;

  while (offsetResult < result.length) {
    const nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);
    let accum = 0,
      count = 0;
    for (let i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
      accum += buffer[i];
      count++;
    }
    result[offsetResult] = accum / count;
    offsetResult++;
    offsetBuffer = nextOffsetBuffer;
  }
  return result;
}

// Convert Float32 → PCM16
function floatTo16BitPCM(float32Array) {
  const buffer = new ArrayBuffer(float32Array.length * 2);
  const view = new DataView(buffer);
  let offset = 0;
  for (let i = 0; i < float32Array.length; i++, offset += 2) {
    let s = Math.max(-1, Math.min(1, float32Array[i]));
    view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
  }
  return buffer;
}

Using the two methods above, we can modify the audio capture so that all information sent to the transcription service is done with the expected encoding and sampling frequency.

const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });

const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);

const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  // get audio from microphone
  const inputData = event.inputBuffer.getChannelData(0);

  const downsampled = downsampleBuffer(
    inputData,
    audioContext.sampleRate,
    16000
  );

  const pcm16 = floatTo16BitPCM(downsampled);

  // continue with the logic
};

Interact with our service using web sockets

Once we have the audio received and ready to send it to transcribe, it is time to consume the endpoint that we expose via web socket to complete this process.

The connection URL to our websocket will be the following: wss://live.invoxmedical.com

Web socket initialization

To create a new connection via websocket we need following parameters:

Below is an example using Javascript/Typescript of how a Websocket is initialized:

If you don't remember how to generate an accessToken, you can go to the following section: Credentials

If the web socket can be opened successfully, you can immediately begin sending audio data for further transcription.

Remember that, when a web socket connection is opened successfully, a response will be received indicating that the connection has been established. You are able to use following code to handle the response:

ws.onopen = () => {
  console.log("✅ Connected to Web Socket server");
};

If an error occurs while opening the web socket, it can be handled as follows:

ws.onerror = (error) => {
  console.error("❌ Web Socket error:", error);
};

Remember that to consume this service, the API Key must have the permission: REAL-TIME-TRANSCRIBE

Send data to web socket

Once we have the web socket open, we can proceed to send data. The structure the web socket expects is as follows:

If you don't send version in the payload, the latest version will be used.

The maximum payload size that can be sent in a single WebSocket message is 32 KB. Ensure your audio chunks are properly sized to avoid exceeding this limit.

We recommend sending audio chunks of 0.5 seconds for optimal transcription quality and processing performance.

Message attribute

Refers to the Base64 representation of the audio fragment received by the microphone. It is used when the messageType attribute is Audio.

An example of how to convert audio received by microphone to Base64, after being transformed into single-channel PCM with a sampling frequency of 16KHz would be:

function arrayBufferToBase64(buffer) {
  // Create an 8-bit array
  const uint8Array = new Uint8Array(buffer);

  // Convert array to base64 string
  let binary = "";
  for (let i = 0; i < uint8Array.length; i++) {
    binary += String.fromCharCode(uint8Array[i]);
  }

  return btoa(binary);
}

If this type of message is sent, the expected response would be:

type AudioMessageResponse = {
  audioReceived: boolean; // Must be true
  duration: number; // duration related to the received audio
};

MessageType attribute

The use of the audio message is clear: send audio data to be transcribed (once the Web Socket is closed).

PING messages are used to prevent the Web Socket from being closed due to inactivity. For example:

If we are recording a consultation, and in the middle of it a pause is put, the Web Socket should not be closed, in this case while the recording state is paused, it is recommended that every so often (between 10 - 20 seconds) this type of message be sent through the Web Socket, thus indicating to the server that it should keep the session open.

If this type of message is sent, the expected response would be:

type PingMessageResponse = {
  audioReceived: boolean; // Must be false
  message: string;
};

Capturing and sending audio

Continuing with the example we have been developing, after seeing each of the previous sections, we can summarize it as follows:

const webSocketUrl = "wss://live.invoxmedical.com";
const accessToken = "mockAccessToken";
const sessionId = "mockSessionId";
const callbackUrl = encodeURIComponent("https://example.mydomain.com");
const ws = new WebSocket(
  `${webSocketUrl}?callbackUrl=${callbackUrl}&accessToken=${accessToken}&sessionId=${sessionId}`
);

ws.onopen = () => {
  console.log("✅ Connected to Web Socket server");
};

ws.onclose = () => {
  console.log("❌ Closed");
};

ws.onerror = (err) => {
  console.error("⚠️ Error WebSocket:", err);
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log("🏓 Server response:", data);
};

const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });

const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);

const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  // get audio from microphone
  const inputData = event.inputBuffer.getChannelData(0);

  const downsampled = downsampleBuffer(
    inputData,
    audioContext.sampleRate,
    16000
  );

  const pcm16 = floatTo16BitPCM(downsampled);

  ws.send(
    JSON.stringify({
      message: arrayBufferToBase64(pcm16),
      messageType: "AUDIO",
    })
  );
};

Transcription response

When you receive the transcription result, the structure will be as follows:

type TranscriptionFinished = {
  transcriptionStatus: "SUCCESS" | "FAILED";
  errorMessage?: string;
  connectionId: string;
  signature: string;
  transcription?: ITranscriptionSegment[];
  vttTranscription?: string;
};

interface ITranscriptionSegment {
  speakerId: string;
  text: string;
  start: number;
  end: number;
  speakerName: string;
}

Important note: The endpoint you need to prepare to receive the transcription result must be of type POST and must be able to accept requests from our source.

Security Warning: To verify that the request genuinely comes from Invox Medical, you must validate the signature.

Using NodeJS, an example of how to decrypt the signature and validate that the signature is valid would be:

import * as CryptoJS from "crypto-js";
export const validateSignature = (signature: string): boolean => {
  try {
    const secret = process.env.APP_SECRET;
    const appId = process.env.APP_ID;
    const apiKey = process.env.API_KEY;
    const expectedValue = `${appId}~${apiKey}`;

    const bytes = CryptoJS.AES.decrypt(signature, secret);
    const originalText = bytes.toString(CryptoJS.enc.Utf8);

    if (!originalText) {
      return false;
    }
    return originalText === expectedValue;
  } catch (error) {
    return false;
  }
};

Generate Clinical NoteDelete document

Streaming TranscriptionAudio Protection