Streaming guide


Invox Medical offers the ability to transcribe audio captured directly from the microphone. To use this service, the following tasks must be performed on the client side:

  • Capture audio from a microphone
  • Encode audio to single-channel PCM with a sampling rate of 16 kHz
  • Interact with our service using web sockets

Below we will exemplify these points to facilitate the integration process.

Capture audio from a microphone

The audio capture process is based on allowing input devices to acquire audio from their environment and transform it into data that applications can process. There are many ways to perform these processes, so each integrator will have completely different scenarios in their environments.

The recommended approach is to use the AudioWorklet API, which runs audio processing on a dedicated thread without blocking the main UI thread. The older ScriptProcessorNode API is deprecated and should not be used in new integrations.

ScriptProcessorNode is deprecated since the Web Audio API specification update and has been removed from Chromium-based browsers. Use AudioWorkletProcessor for all new integrations.

The worklet processor handles the complete pipeline inside a dedicated audio thread:

  • Buffers raw input samples (microphone is typically 44.1 kHz or 48 kHz)
  • Downsamples to 16 kHz using averaging (required by our transcription service)
  • Converts Float32 samples to 16-bit signed PCM (little-endian)
  • Posts ready-to-send ArrayBuffer chunks (~85 ms each) back to the main thread via port.postMessage

The processor code can be loaded as an inline blob URL — no extra file needs to be hosted on your server:

const processorCode = `
class AudioProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this.inputSampleRate = 48000;
    this.targetSampleRate = 16000;
    // ~85 ms of output audio per chunk
    this.targetChunkSamples = Math.floor(this.targetSampleRate * 0.085);
    this.samplesNeeded = Math.floor(
      this.targetChunkSamples * (this.inputSampleRate / this.targetSampleRate)
    );
    this.bufferSize = this.inputSampleRate * 2; // 2-second safety buffer
    this.buffer = new Float32Array(this.bufferSize);
    this.writeIndex = 0;
  }

  downsample(length) {
    const ratio = this.inputSampleRate / this.targetSampleRate;
    const outLen = Math.round(length / ratio);
    const result = new Float32Array(outLen);
    for (let i = 0; i < outLen; i++) {
      const start = Math.floor(i * ratio);
      const end = Math.min(Math.floor((i + 1) * ratio), length);
      let sum = 0;
      for (let j = start; j < end; j++) sum += this.buffer[j];
      result[i] = sum / (end - start);
    }
    return result;
  }

  floatTo16BitPCM(input) {
    const buf = new ArrayBuffer(input.length * 2);
    const view = new DataView(buf);
    for (let i = 0; i < input.length; i++) {
      const s = Math.max(-1, Math.min(1, input[i]));
      view.setInt16(i * 2, s < 0 ? s * 0x8000 : s * 0x7fff, true);
    }
    return buf;
  }

  process(inputs) {
    const input = inputs[0];
    if (!input || !input[0]) return true;
    const data = input[0];

    if (this.writeIndex + data.length <= this.bufferSize) {
      this.buffer.set(data, this.writeIndex);
      this.writeIndex += data.length;
    } else {
      // Buffer overflow protection: reset and continue
      this.writeIndex = 0;
      return true;
    }

    while (this.writeIndex >= this.samplesNeeded) {
      const downsampled = this.downsample(this.samplesNeeded);
      const pcm16 = this.floatTo16BitPCM(downsampled);
      // Shift remaining samples to the front
      this.buffer.copyWithin(0, this.samplesNeeded, this.writeIndex);
      this.writeIndex -= this.samplesNeeded;
      // Transfer ownership for zero-copy delivery
      this.port.postMessage({ pcm16, samples: downsampled.length }, [pcm16]);
    }
    return true;
  }
}
registerProcessor('audio-processor', AudioProcessor);
`;

// Load processor as inline blob — no file hosting required
const blob = new Blob([processorCode], { type: 'application/javascript' });
const processorUrl = URL.createObjectURL(blob);

const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext();

await audioContext.audioWorklet.addModule(processorUrl);
URL.revokeObjectURL(processorUrl); // safe to revoke after addModule resolves

const source = audioContext.createMediaStreamSource(mediaStream);
const workletNode = new AudioWorkletNode(audioContext, 'audio-processor');
source.connect(workletNode);
workletNode.connect(audioContext.destination);

workletNode.port.onmessage = (event) => {
  const { pcm16 } = event.data; // ArrayBuffer: PCM16 @ 16 kHz, mono
  // continue with the logic
};

Encode audio to single-channel PCM with a sampling rate of 16 kHz

Our streaming transcription service requires audio encoded as single-channel PCM at 16 kHz, 16-bit signed little-endian. When using the AudioWorkletProcessor shown above, this encoding is handled automatically inside the worklet thread — the pcm16 value delivered in onmessage is already in the correct format.

If you are working in an environment where AudioWorklet is not available (e.g., certain server-side or non-browser runtimes), the same transformation can be performed on the main thread using these helper functions:

// Resample Float32 buffer from any sample rate to 16 kHz
function downsampleBuffer(
  buffer: Float32Array,
  inputSampleRate: number,
  targetSampleRate = 16000
): Float32Array {
  if (targetSampleRate === inputSampleRate) return buffer;
  const ratio = inputSampleRate / targetSampleRate;
  const outLength = Math.round(buffer.length / ratio);
  const result = new Float32Array(outLength);
  for (let i = 0; i < outLength; i++) {
    const start = Math.floor(i * ratio);
    const end = Math.min(Math.floor((i + 1) * ratio), buffer.length);
    let sum = 0;
    for (let j = start; j < end; j++) sum += buffer[j];
    result[i] = sum / (end - start);
  }
  return result;
}

// Convert Float32 samples → PCM16 ArrayBuffer
function floatTo16BitPCM(float32Array: Float32Array): ArrayBuffer {
  const buffer = new ArrayBuffer(float32Array.length * 2);
  const view = new DataView(buffer);
  for (let i = 0; i < float32Array.length; i++) {
    const s = Math.max(-1, Math.min(1, float32Array[i]));
    view.setInt16(i * 2, s < 0 ? s * 0x8000 : s * 0x7fff, true);
  }
  return buffer;
}

The AudioWorklet emits small chunks of approximately 85 ms each. Sending every tiny chunk as a separate WebSocket message is inefficient — it increases protocol overhead and gives the transcription engine less audio context per segment. We recommend accumulating at least 0.5 seconds of PCM audio before sending each message.

The threshold in bytes for 0.5 s of PCM16 at 16 kHz mono:

TARGET_BYTES = 16000 samples/s × 2 bytes/sample × 0.5 s = 16 000 bytes

The following accumulator can be placed between the worklet's onmessage handler and your WebSocket send call:

// 0.5 s of PCM16 @ 16 kHz mono = 16 000 bytes
const TARGET_PCM_BYTES = 16000 * 2 * 0.5;

let pcmBuffer: ArrayBuffer[] = [];
let pcmBytesCollected = 0;

function onPcmChunk(chunk: ArrayBuffer): void {
  pcmBuffer.push(chunk);
  pcmBytesCollected += chunk.byteLength;

  if (pcmBytesCollected >= TARGET_PCM_BYTES) {
    flushBuffer();
  }
}

function flushBuffer(): ArrayBuffer | null {
  if (pcmBuffer.length === 0) return null;

  // Combine all buffered chunks into one contiguous ArrayBuffer
  const combined = new Uint8Array(pcmBytesCollected);
  let offset = 0;
  for (const buf of pcmBuffer) {
    combined.set(new Uint8Array(buf), offset);
    offset += buf.byteLength;
  }

  pcmBuffer = [];
  pcmBytesCollected = 0;

  return combined.buffer;
}

Wire it into the worklet's message handler:

workletNode.port.onmessage = (event) => {
  const { pcm16 } = event.data;
  onPcmChunk(pcm16);
};

And in flushBuffer, once you have a combined buffer, encode it and send it through the WebSocket (see the next sections for how to send data).

Interact with our service using web sockets

Once we have the audio received and ready to send it to transcribe, it is time to consume the endpoint that we expose via web socket to complete this process.

The connection URL to our websocket will be the following: wss://live.invoxmedical.com

Web socket initialization

To create a new connection via websocket we need following parameters:

  • Web socket URL
  • accessToken: authorization token for the request
  • callbackUrl: Public URL where we'll send the completed transcript.
  • version: The version of the API to use for the request. This is important for ensuring compatibility with the service. This parameter is optional; if not provided, the latest version will be used.
  • Web socket URL
  • accessToken: authorization token for the request
  • sessionId: Unique identifier for the transcription session. The sessionId must be unique for each transcription session and can be used to create a new transcription session or continue an existing one. This field must be an UUID.
  • callbackUrl: Public URL where we'll send the completed transcript. This field isn't required; if you don't send it, you'll need to use another method to retrieve it.
  • version: The version of the API to use for the request. This is important for ensuring compatibility with the service. This parameter is optional; if not provided, the latest version will be used.

Below is an example using Javascript/Typescript of how a Websocket is initialized:

If you don't remember how to generate an accessToken, you can go to the following section: Credentials

If the web socket can be opened successfully, you can immediately begin sending audio data for further transcription.

Remember that, when a web socket connection is opened successfully, a response will be received indicating that the connection has been established. You are able to use following code to handle the response:

ws.onopen = () => {
  console.log("✅ Connected to Web Socket server");
};

If an error occurs while opening the web socket, it can be handled as follows:

ws.onerror = (error) => {
  console.error("❌ Web Socket error:", error);
};
Remember that to consume this service, the API Key must have the permission: REAL-TIME-TRANSCRIBE

Send data to web socket

Once we have the web socket open, we can proceed to send data. The structure the web socket expects is as follows:

If you don't send version in the payload, the latest version will be used.

The maximum payload size that can be sent in a single WebSocket message is 32 KB. Ensure your audio chunks are properly sized to avoid exceeding this limit.
We recommend sending audio chunks of 0.5 seconds for optimal transcription quality and processing performance.

Message attribute

Refers to the Base64 representation of the audio fragment received by the microphone. It is used when the messageType attribute is Audio.

An example of how to convert audio received by microphone to Base64, after being transformed into single-channel PCM with a sampling frequency of 16KHz would be:

function arrayBufferToBase64(buffer) {
  // Create an 8-bit array
  const uint8Array = new Uint8Array(buffer);

  // Convert array to base64 string
  let binary = "";
  for (let i = 0; i < uint8Array.length; i++) {
    binary += String.fromCharCode(uint8Array[i]);
  }

  return btoa(binary);
}

If this type of message is sent, the expected response would be:

type AudioMessageResponse = {
  audioReceived: boolean; // Must be true
  duration: number; // duration related to the received audio
};

MessageType attribute

The use of the audio message is clear: send audio data to be transcribed (once the Web Socket is closed).

PING messages are used to prevent the Web Socket from being closed due to inactivity. For example:

If we are recording a consultation, and in the middle of it a pause is put, the Web Socket should not be closed, in this case while the recording state is paused, it is recommended that every so often (between 10 - 20 seconds) this type of message be sent through the Web Socket, thus indicating to the server that it should keep the session open.

If this type of message is sent, the expected response would be:

type PingMessageResponse = {
  audioReceived: boolean; // Must be false
  message: string;
};

Capturing and sending audio

Continuing with the example we have been developing, after seeing each of the previous sections, we can summarize it as follows.

This example combines all the techniques described above:

  • AudioWorklet for modern, non-blocking audio capture (PCM16 @ 16 kHz)
  • 0.5 s audio accumulation before each WebSocket send
  • In-memory send queue so chunks captured while the socket is not yet open are held and drained automatically once connected
  • Online/offline detection to pause sending during network loss and resume when connectivity returns
  • Exponential backoff reconnection so temporary outages are handled gracefully without manual retries
// ─── Helpers ──────────────────────────────────────────────────────────────────

function arrayBufferToBase64(buffer: ArrayBuffer): string {
  const bytes = new Uint8Array(buffer);
  let binary = '';
  for (let i = 0; i < bytes.length; i++) binary += String.fromCharCode(bytes[i]);
  return btoa(binary);
}

function generateSessionId(): string {
  return crypto.randomUUID();
}

function jitter(ms: number): number {
  return Math.floor(ms * (Math.random() * 0.4 + 0.8)); // ±20%
}

// ─── WebSocket with reconnection ──────────────────────────────────────────────

const WS_URL       = 'wss://live.invoxmedical.com';
const ACCESS_TOKEN = 'your_access_token';
const CALLBACK_URL = encodeURIComponent('https://example.mydomain.com');
const SESSION_ID   = generateSessionId();

let ws: WebSocket | null = null;
let backoffMs = 500;
const MAX_BACKOFF_MS = 30_000;
let intentionalClose = false;

// In-memory queue: chunks captured before the socket opened, or during reconnect
const sendQueue: string[] = []; // base64-encoded PCM16 payloads
let isOnline = navigator.onLine;

function flushSendQueue(): void {
  while (sendQueue.length > 0 && ws?.readyState === WebSocket.OPEN && isOnline) {
    const payload = sendQueue.shift()!;
    ws.send(payload);
  }
}

function connectWebSocket(): void {
  if (intentionalClose) return;

  ws = new WebSocket(
    `${WS_URL}?callbackUrl=${CALLBACK_URL}&accessToken=${ACCESS_TOKEN}&sessionId=${SESSION_ID}`
  );

  ws.onopen = () => {
    backoffMs = 500; // reset on successful connection
    flushSendQueue();
  };

  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);

    if (data.isTranscriptionCompleted === true) {
      // Transcription is ready — retrieve it via the Get Transcription endpoint
      console.log('Transcription complete', data);
    }
  };

  ws.onclose = () => {
    if (intentionalClose) return;
    // Reconnect with exponential backoff
    setTimeout(() => {
      backoffMs = Math.min(MAX_BACKOFF_MS, backoffMs * 2);
      connectWebSocket();
    }, jitter(backoffMs));
  };

  ws.onerror = () => {
    ws?.close(); // triggers onclose → reconnect loop
  };
}

// Pause/resume queue processing based on network availability
window.addEventListener('online',  () => { isOnline = true;  flushSendQueue(); });
window.addEventListener('offline', () => { isOnline = false; });

// ─── Audio accumulator (0.5 s threshold) ─────────────────────────────────────

const TARGET_PCM_BYTES = 16000 * 2 * 0.5; // 0.5 s of PCM16 @ 16 kHz mono
let pcmBuffer: ArrayBuffer[] = [];
let pcmBytesCollected = 0;

function onPcmChunk(chunk: ArrayBuffer): void {
  pcmBuffer.push(chunk);
  pcmBytesCollected += chunk.byteLength;

  if (pcmBytesCollected >= TARGET_PCM_BYTES) {
    flushAudioBuffer();
  }
}

function flushAudioBuffer(): void {
  if (pcmBuffer.length === 0) return;

  const combined = new Uint8Array(pcmBytesCollected);
  let offset = 0;
  for (const buf of pcmBuffer) {
    combined.set(new Uint8Array(buf), offset);
    offset += buf.byteLength;
  }
  pcmBuffer = [];
  pcmBytesCollected = 0;

  const payload = JSON.stringify({
    message: arrayBufferToBase64(combined.buffer),
    messageType: 'AUDIO',
    sessionId: SESSION_ID,
  });

  if (ws?.readyState === WebSocket.OPEN && isOnline) {
    ws.send(payload);
  } else {
    // Queue the chunk — it will be sent once the socket reconnects
    sendQueue.push(payload);
  }
}

// ─── AudioWorklet capture ─────────────────────────────────────────────────────

const processorCode = `
class AudioProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this.inputSampleRate = 48000;
    this.targetSampleRate = 16000;
    this.targetChunkSamples = Math.floor(this.targetSampleRate * 0.085);
    this.samplesNeeded = Math.floor(
      this.targetChunkSamples * (this.inputSampleRate / this.targetSampleRate)
    );
    this.bufferSize = this.inputSampleRate * 2;
    this.buffer = new Float32Array(this.bufferSize);
    this.writeIndex = 0;
  }
  downsample(length) {
    const ratio = this.inputSampleRate / this.targetSampleRate;
    const outLen = Math.round(length / ratio);
    const result = new Float32Array(outLen);
    for (let i = 0; i < outLen; i++) {
      const start = Math.floor(i * ratio);
      const end = Math.min(Math.floor((i + 1) * ratio), length);
      let sum = 0;
      for (let j = start; j < end; j++) sum += this.buffer[j];
      result[i] = sum / (end - start);
    }
    return result;
  }
  floatTo16BitPCM(input) {
    const buf = new ArrayBuffer(input.length * 2);
    const view = new DataView(buf);
    for (let i = 0; i < input.length; i++) {
      const s = Math.max(-1, Math.min(1, input[i]));
      view.setInt16(i * 2, s < 0 ? s * 0x8000 : s * 0x7fff, true);
    }
    return buf;
  }
  process(inputs) {
    const input = inputs[0];
    if (!input || !input[0]) return true;
    const data = input[0];
    if (this.writeIndex + data.length <= this.bufferSize) {
      this.buffer.set(data, this.writeIndex);
      this.writeIndex += data.length;
    } else {
      this.writeIndex = 0;
      return true;
    }
    while (this.writeIndex >= this.samplesNeeded) {
      const downsampled = this.downsample(this.samplesNeeded);
      const pcm16 = this.floatTo16BitPCM(downsampled);
      this.buffer.copyWithin(0, this.samplesNeeded, this.writeIndex);
      this.writeIndex -= this.samplesNeeded;
      this.port.postMessage({ pcm16, samples: downsampled.length }, [pcm16]);
    }
    return true;
  }
}
registerProcessor('audio-processor', AudioProcessor);
`;

const blob = new Blob([processorCode], { type: 'application/javascript' });
const processorUrl = URL.createObjectURL(blob);

const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext();

await audioContext.audioWorklet.addModule(processorUrl);
URL.revokeObjectURL(processorUrl);

const source = audioContext.createMediaStreamSource(mediaStream);
const workletNode = new AudioWorkletNode(audioContext, 'audio-processor');
source.connect(workletNode);
workletNode.connect(audioContext.destination);

workletNode.port.onmessage = (event) => {
  const { pcm16 } = event.data;
  onPcmChunk(pcm16);
};

// ─── Start ────────────────────────────────────────────────────────────────────

// Open WebSocket immediately — chunks queued before it opens will be flushed in onopen
connectWebSocket();

// To stop recording and trigger transcription:
// 1. Flush any remaining buffered audio
// 2. Send END_TRANSCRIPTION
// 3. Mark the socket as intentionally closing

function stopRecording(): void {
  intentionalClose = true;

  // Flush any sub-threshold audio still in the accumulator
  flushAudioBuffer();

  // Tell the service no more audio is coming
  ws?.send(JSON.stringify({
    messageType: 'END_TRANSCRIPTION',
    sessionId: SESSION_ID,
  }));
}
For production use, consider adding acknowledgment handling, per-chunk retries, and a circuit breaker to protect against sustained service outages. See Audio Protection for a complete reference implementation.

Transcription response

When you receive the transcription result, the structure will be as follows:

type TranscriptionFinished = {
  transcriptionStatus: "SUCCESS" | "FAILED";
  errorMessage?: string;
  connectionId: string;
  signature: string;
  transcription?: ITranscriptionSegment[];
  vttTranscription?: string;
};

interface ITranscriptionSegment {
  speakerId: string;
  text: string;
  start: number;
  end: number;
  speakerName: string;
}

Important note: The endpoint you need to prepare to receive the transcription result must be of type POST and must be able to accept requests from our source.

Security Warning: To verify that the request genuinely comes from Invox Medical, you must validate the signature.

Using NodeJS, an example of how to decrypt the signature and validate that the signature is valid would be:

import * as CryptoJS from "crypto-js";
export const validateSignature = (signature: string): boolean => {
  try {
    const secret = process.env.APP_SECRET;
    const appId = process.env.APP_ID;
    const apiKey = process.env.API_KEY;
    const expectedValue = `${appId}~${apiKey}`;

    const bytes = CryptoJS.AES.decrypt(signature, secret);
    const originalText = bytes.toString(CryptoJS.enc.Utf8);

    if (!originalText) {
      return false;
    }
    return originalText === expectedValue;
  } catch (error) {
    return false;
  }
};