javascript - Fastest way to process audio segments and concatenate buffers in Node.js

I am trying to optimize the performance of a Node.js function that generates audio segments (for example something like OpenAIs TTS API) and then concatenates the resulting audio buffers. My goal is to speed up the reading of each audio response into the buffer and the final concatenation of audio buffers.

Current Approach

Naively I thought that it would be as simple as calling all of the audio segments, then reading them all into their own buffer and putting them together. Here is a simplified version of what I am doing now:

const createAudioSegment = async (text) => {
    const response = await openai.audio.speech.create({
        model: "tts-1",
        voice: "echo",
        input: text,
    });
    return response;
};

const audio_texts = ["text 1", "text 2"] // list of text to turn into audio

const segmentsTimeStart = new Date().getTime()
// Process dialogue segments in parallel
const audioSegments = await Promise.all(
    audio_texts.map((text) => createAudioSegment(text))
);
const segmentsTimeEnd = new Date().getTime()
const segmentsTimeDiff = segmentsTimeEnd - segmentsTimeStart
console.log(`Total Audio Segment Time: ${segmentsTimeDiff}ms`)

const audioBufferReadStart = new Date().getTime()
let responseTimes = []
const audioBuffers = await Promise.all(
    audioSegments.map(async (segment) => {
     const responseStartTime = new Date().getTime()
     const arrayBuffer = await segment.arrayBuffer();
     const responseTimeDiff = responseEndTime - responseStartTime
     responseTimes.push(responseTimeDiff)
     return Buffer.from(arrayBuffer);
    })
);
const audioBufferReadEnd = new Date().getTime()

console.log(`Audio Buffer Total Read Time: ${audioBufferReadEnd - audioBufferReadStart}ms`)
console.log(`Audio Buffer Individual Read Times (ms): ${responseTimes}`)

const concatStartTime = new Date().getTime()
const finalBuffer = Buffer.concat(audioBuffers);
const concatEndTime = new Date().getTime()
console.log(`Buffer Concatenation Time: ${concatEndTime - concatStartTime}ms`)

Performance Issues

Overall, the time it takes to do this is usually 40-60 seconds. However, when logging the individual operations, I noticed that it is really the reading of each audio segment into its individual buffer that takes the majority of the time. As an example, for a scenario with 12 audio segments, I see the following timing:

Total Audio Segment Time: 2333 ms
Audio Buffer Total Read Time: 27455 ms
Audio Buffer Individual Read Times (ms): [1035,3420,6497,96,150,360,70,88,20344,32,83,254]
Buffer Concatenation Time: 2 ms

It takes over 20 seconds to read all of the responses to their buffers.. why are some <1sec and some >20sec? I don't understand.

Question

What could be causing such a large discrepancy between the individual read times? Is this something that is avoidable with a more efficient way to do this?

Current Approach

const createAudioSegment = async (text) => {
    const response = await openai.audio.speech.create({
        model: "tts-1",
        voice: "echo",
        input: text,
    });
    return response;
};

const audio_texts = ["text 1", "text 2"] // list of text to turn into audio

const segmentsTimeStart = new Date().getTime()
// Process dialogue segments in parallel
const audioSegments = await Promise.all(
    audio_texts.map((text) => createAudioSegment(text))
);
const segmentsTimeEnd = new Date().getTime()
const segmentsTimeDiff = segmentsTimeEnd - segmentsTimeStart
console.log(`Total Audio Segment Time: ${segmentsTimeDiff}ms`)

const audioBufferReadStart = new Date().getTime()
let responseTimes = []
const audioBuffers = await Promise.all(
    audioSegments.map(async (segment) => {
     const responseStartTime = new Date().getTime()
     const arrayBuffer = await segment.arrayBuffer();
     const responseTimeDiff = responseEndTime - responseStartTime
     responseTimes.push(responseTimeDiff)
     return Buffer.from(arrayBuffer);
    })
);
const audioBufferReadEnd = new Date().getTime()

console.log(`Audio Buffer Total Read Time: ${audioBufferReadEnd - audioBufferReadStart}ms`)
console.log(`Audio Buffer Individual Read Times (ms): ${responseTimes}`)

const concatStartTime = new Date().getTime()
const finalBuffer = Buffer.concat(audioBuffers);
const concatEndTime = new Date().getTime()
console.log(`Buffer Concatenation Time: ${concatEndTime - concatStartTime}ms`)

Performance Issues

Total Audio Segment Time: 2333 ms
Audio Buffer Total Read Time: 27455 ms
Audio Buffer Individual Read Times (ms): [1035,3420,6497,96,150,360,70,88,20344,32,83,254]
Buffer Concatenation Time: 2 ms

It takes over 20 seconds to read all of the responses to their buffers.. why are some <1sec and some >20sec? I don't understand.

Question

What could be causing such a large discrepancy between the individual read times? Is this something that is avoidable with a more efficient way to do this?

Share Improve this question edited Mar 18 at 17:32 asked Mar 13 at 15:59 Trevor Woods 481 gold badge1 silver badge8 bronze badges

It would be helpful to know how you are measuring "Audio Buffer Individual Read Times". – James Commented Mar 13 at 17:59
const createAudioSegment = async ([i, segment]) - when you call it createAudioSegment(text) you are passing a string. Please fix/clarify. – James Commented Mar 13 at 18:05
can put console.time and .timeEnd (example) between calls and share the result ? – bogdanoff Commented Mar 16 at 17:17
Thanks for the comments/edits gents. @James yes simply the text is being passed in. @ bogdanoff I've added in the logging statements for clarity. – Trevor Woods Commented Mar 18 at 15:04
Assuming response there is just a HTTP response, you're essentially making N parallel calls to OpenAI, so sure, some of them might finish quicker, others more slowly. .. – AKX Commented Mar 18 at 15:33

| Show 4 more comments

1 Answer 1

Sorted by: Reset to default 0

Note: This is not the final answer, but my attempt to narrow down the problem.

I have noticed few issues with your code first being

Issue 1

const audioSegments = await Promise.all(
    audio_texts.map((text) => createAudioSegment(text))
);

notice you are passing a string (text) to createAudioSegment but in definition you are expecting array of string

const createAudioSegment = async ([text]) => {

so you were sending only very first character (which is just t) to openai.

Issue 2

const responseTimeDiff = responseEndTime - responseStartTime

responseEndTime is not defined.

After simplifying you get this

Final solution

import OpenAI from 'openai'
const openai = new OpenAI()

/** @param {string} text */
const createAudioSegment = (text) =>
  openai.audio.speech.create({
    model: 'tts-1',
    voice: 'echo',
    input: text,
  })

const audio_texts = ['I am text 1', 'I am text 2']

console.time('audioSegments')
const audioSegments = await Promise.all(audio_texts.map((text) => createAudioSegment(text)))
console.timeEnd('audioSegments')

console.time('audioArrayBuffers')
const audioArrayBuffers = await Promise.all(audioSegments.map((segment) => segment.arrayBuffer()))
console.timeEnd('audioArrayBuffers')

console.time('buffers')
const buffers = audioArrayBuffers.map((ab) => Buffer.from(ab))
console.timeEnd('buffers')

console.time('finalBuffer')
const finalBuffer = Buffer.concat(buffers)
console.timeEnd('finalBuffer')

console.log(finalBuffer.length, finalBuffer.byteLength)

Now run this code as-is and share the whole terminal output here. (I don't have access to openai, so I could not test it.)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Fastest way to process audio segments and concatenate buffers in Node.js - Stack Overflow

Current Approach

Performance Issues

Question

Current Approach

Performance Issues

Question

1 Answer 1

Issue 1

Issue 2

Final solution

与本文相关的文章

评论列表(0)