Streaming transcription

Invox Medical offers the ability to transcribe audio captured directly from the microphone. To use this service, the following tasks must be performed on the client side:

Capture audio from a microphone
Encode audio to single-channel PCM with a sampling rate of 16 kHz
Interact with our service using web sockets

Below we will exemplify these points to facilitate the integration process.

Capture audio from a microphone

The audio capture process is based on allowing input devices to acquire audio from their environment and transform it into data that applications can process. There are many ways to perform these processes, so each integrator will have completely different scenarios in their environments.

Below we will give an example of how this process would be carried out using Javascript/Typescript:

const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });

const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);

const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  // get audio from microphone
  const inputData = event.inputBuffer.getChannelData(0);

  // continue with the logic
};

Encode audio to single-channel PCM with a sampling rate of 16 kHz

Our streaming transcription service needs to receive audio encoded under certain standards: single-channel PCM with a sampling frequency of 16 kHz.

When audio is captured directly from the microphone, it isn't encoded as required, so we'd have to perform a transformation on the received information. Below, we'll show some examples in JavaScript/Typescript that perform this transformation:

// Resample to 16 KHz
function downsampleBuffer(buffer, inputSampleRate, targetSampleRate = 16000) {
  if (targetSampleRate === inputSampleRate) return buffer;
  const sampleRateRatio = inputSampleRate / targetSampleRate;
  const newLength = Math.round(buffer.length / sampleRateRatio);
  const result = new Float32Array(newLength);
  let offsetResult = 0;
  let offsetBuffer = 0;

  while (offsetResult < result.length) {
    const nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);
    let accum = 0,
      count = 0;
    for (let i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
      accum += buffer[i];
      count++;
    }
    result[offsetResult] = accum / count;
    offsetResult++;
    offsetBuffer = nextOffsetBuffer;
  }
  return result;
}

// Convert Float32 → PCM16
function floatTo16BitPCM(float32Array) {
  const buffer = new ArrayBuffer(float32Array.length * 2);
  const view = new DataView(buffer);
  let offset = 0;
  for (let i = 0; i < float32Array.length; i++, offset += 2) {
    let s = Math.max(-1, Math.min(1, float32Array[i]));
    view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
  } 
  return buffer;
}

Using the two methods above, we can modify the audio capture so that all information sent to the transcription service is done with the expected encoding and sampling frequency.

const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });

const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);

const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  // get audio from microphone
  const inputData = event.inputBuffer.getChannelData(0);

  const downsampled = downsampleBuffer(
        inputData,
        audioContext.sampleRate,
        16000
      );

  const pcm16 = floatTo16BitPCM(downsampled);
  
  // continue with the logic
};

Interact with our service using web sockets

Once we have the audio received and ready to send it to transcribe, it is time to consume the endpoint that we expose via web socket to complete this process.

The connection URL to our websocket will be the following: wss://live.invoxmedical.com

Web socket initialization

To create a new connection via websocket we need three parameters:

Web socket URL
accessToken: authorization token for the request
callbackUrl: Public URL to which we will send the transcript once completed.

Below is an example using Javascript/Typescript of how a Websocket is initialized:

  const webSocketUrl="wss://live.invoxmedical.com"
  const accessToken = "mockAccessToken"
  const callbackUrl = encodeURIComponent("https://example.mydomain.com")
  const ws = new WebSocket(
    `${webSocketUrl}?callbackUrl=${callbackUrl}&accessToken=${accessToken}`
  );

If you don't remember how to generate an accessToken, you can go to the following section: Credentials

If the web socket can be opened successfully, the response structure received will be as follows:

Successful request

Describe the characteristics of a satisfactory response

200

Response example:

{
  "message": "Connection established.",
  "connectionId": "mock-connectionId" 
}

Response structure

type OpenWebSocketType = {
  message: string,
  connectionId: string
}

Important: It is recommended to store the connectionId, since this value will be sent along with the completed transcription as an identifying element of the same.

If we are unable to open a websocket connection, we will get one of the following errors:

Unauthorized request

Describes the response when the request is not authorized

401

Response body

type TranscribeStreamingAudioResponseType = {
    messsage: string
    messageType: 'invalidAccessToken'
}

Forbidden request

Describes the response when the API Key is not allowed to consume this service

403

Response body

type TranscribeStreamingAudioResponseType = {
    messsage: string
    errorType: 'insuficientPermissions'
}

Remember that to consume this service, the API Key must have the permission: REAL-TIME-TRANSCRIBE

Bad request

Describe the features of a bad request

400

Response body

type SuccessWebSocketType = {
    messsage: string
    errorType: OPEN_WS_BAD_REQUEST_ERROR
}

enum OPEN_WS_BAD_REQUEST_ERROR {
  MISSING_CALLBACK_URL= 'missingCallbackUrl'
  INVALID_INPUT_PARAMETER = 'invalidInputParameter',
}

Send data to web socket

Once we have the web socket open, we can proceed to send data. The structure the web socket expects is as follows:

  type WebSocketPayload = {
      messsage: string
      messageType: 'Audio' | 'PING'
  }

Below we will explain each field:

Message attribute

Refers to the Base64 representation of the audio fragment received by the microphone. It is used when the messageType attribute is Audio.

An example of how to convert audio received by microphone to Base64, after being transformed into single-channel PCM with a sampling frequency of 16KHz would be:

  function arrayBufferToBase64(buffer) {
    // Create an 8-bit array
    const uint8Array = new Uint8Array(buffer);

    // Convert array to base64 string
    let binary = "";
    for (let i = 0; i < uint8Array.length; i++) {
      binary += String.fromCharCode(uint8Array[i]);
    }

    return btoa(binary);
  }

If this type of message is sent, the expected response would be:

type AudioMessageResponse = {
  audioReceived: boolean, // Must be true 
  duration: number // duration related to the received audio
}

MessageType attribute

It tells us the type of message we are sending:

Audio message
Message to keep the service alive

The use of the audio message is clear: send audio data to be transcribed (once the Web Socket is closed).

PING messages are used to prevent the Web Socket from being closed due to inactivity. For example:

If we are recording a consultation, and in the middle of it a pause is put, the Web Socket should not be closed, in this case while the recording state is paused, it is recommended that every so often (between 10 - 20 seconds) this type of message be sent through the Web Socket, thus indicating to the server that it should keep the session open.

If this type of message is sent, the expected response would be:

type PingMessageResponse = {
  audioReceived: boolean, // Must be false 
  message: string
}

Capturing and sending audio

Continuing with the example we have been developing, after seeing each of the previous sections, we can summarize it as follows:

const webSocketUrl="wss://live.invoxmedical.com"
const accessToken = "mockAccessToken"
const callbackUrl = encodeURIComponent("https://example.mydomain.com")
const ws = new WebSocket(
  `${webSocketUrl}?callbackUrl=${callbackUrl}&accessToken=${accessToken}`
);

ws.onopen = () => {
  console.log("✅ Connected to Web Socket server");
  
};

ws.onclose = () => {
  console.log("❌ Closed");
  
};

ws.onerror = (err) => {
  console.error("⚠️ Error WebSocket:", err);
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log("🏓 Server response:", data);
};

const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });

const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(mediaStream);

const processor = audioContext.createScriptProcessor(4096, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  // get audio from microphone
  const inputData = event.inputBuffer.getChannelData(0);

  const downsampled = downsampleBuffer(
        inputData,
        audioContext.sampleRate,
        16000
      );

  const pcm16 = floatTo16BitPCM(downsampled);
  
   ws.send(
    JSON.stringify({
      message: arrayBufferToBase64(pcm16),
      messageType: "AUDIO",
    })
  );
};

Transcription response

Once the web socket is closed, the platform will assume that no more audio related to the current recording will be sent. Once this occurs, the system will complete the transcription process and send it to the callbackUrl sent as the web socket opening parameter. The message sent will have the following structure:

type TranscriptionFinished = {
  transcriptionStatus: 'SUCCESS' | 'FAILED' 
  errorMessage?: string
  connectionId: string
  signature: string,
  transcription?: ITranscriptionSegment[]
}

interface ITranscriptionSegment {
  speakerId: string
  text: string
  start: number
  end: number
  speakerName: string
}

Important note: The endpoint you need to prepare to receive the transcription result must be of type POST and must be able to accept requests from our source.

Security Warning: To verify that the request genuinely comes from Invox Medical, you must validate the signature.

Using NodeJS, an example of how to decrypt the signature and validate that the signature is valid would be:


import * as CryptoJS from 'crypto-js';
export const validateSignature = (signature: string): boolean => {
    try {
        const secret = process.env.APP_SECRET;
        const appId = process.env.APP_ID;
        const apiKey = process.env.API_KEY;
        const expectedValue = `${appId}~${apiKey}` 

        const bytes = CryptoJS.AES.decrypt(signature, secret);
        const originalText = bytes.toString(CryptoJS.enc.Utf8);
        
        if (!originalText) {
           return false;
        }
        return originalText === expectedValue 
    } catch (error) {
       return false
    }
}

Ai ServicesGenerate report

Transcribe Pre Recorded AudioAPI Guide