macos - Is there anyway to combine .audio and .microphone data to single pcm in swift

I'm new on Swift and Native concept in general. I was working on an application that purpose to capture application and microphone audio -> format to pcm16 -> send to other service to process. But having problem while combining application and microphone aduio, if you have experient in this, any advice would help

Here the code that processing audio data

import AVFoundation
import ScreenCaptureKit

class AudioStreamOutputHandler: NSObject, SCStreamOutput {
    private let audioEngine = AVAudioEngine()

    let deepgramService: DeepgramService

    init(deepgramService: DeepgramService) {
        self.deepgramService = deepgramService
        super.init()
    }
    
    nonisolated func stream(_ stream: SCStream, didStopWithError error: Error) {
        print("SCStream stopped with error: \(error.localizedDescription)")
    }


    // Called when a sample buffer is received.
    nonisolated func stream(
        _ stream: SCStream,
        didOutputSampleBuffer sampleBuffer: CMSampleBuffer,
        of type: SCStreamOutputType
    ) {
        
        
        switch type {
        case .audio:
            let pcmBuffer: AVAudioPCMBuffer? = createPCMBuffer(from: sampleBuffer)

            guard let pcmBuffer else { return }

            if let convertedPCMData: Data = convertTo16BitPCM(from: pcmBuffer) {
                deepgramService.sendAudioData(convertedPCMData)
            }
            break;
        case .microphone:
            break
            
        default:
            break;

        }
    
        
    }

    private func createPCMBuffer(from sampleBuffer: CMSampleBuffer) -> AVAudioPCMBuffer? {
        guard CMSampleBufferIsValid(sampleBuffer),
            let formatDescription = sampleBuffer.formatDescription,
            let absd = formatDescription.audioStreamBasicDescription
        else {
            NSLog("Invalid CMSampleBuffer or missing format description.")
            return nil
        }

        var audioBufferListCopy: AudioBufferList?
        do {
            try sampleBuffer.withAudioBufferList { audioBufferList, _ in
                audioBufferListCopy = audioBufferList.unsafePointer.pointee
            }
        } catch {
            NSLog("Error accessing AudioBufferList: \(error.localizedDescription)")
            return nil
        }

        guard
            let format = AVAudioFormat(
                standardFormatWithSampleRate: absd.mSampleRate,
                channels: AVAudioChannelCount(absd.mChannelsPerFrame)
            )
        else {
            NSLog("Failed to create AVAudioFormat.")
            return nil
        }

        return AVAudioPCMBuffer(
            pcmFormat: format,
            bufferListNoCopy: &audioBufferListCopy!
        )
    }

    private func convertTo16BitPCM(from buffer: AVAudioPCMBuffer) -> Data? {
        guard let floatChannelData = buffer.floatChannelData else {
            NSLog("Failed to get floatChannelData.")
            return nil
        }

        let frameLength = Int(buffer.frameLength)
        let channelCount = Int(buffer.format.channelCount)
        var pcmData = Data(capacity: frameLength * channelCount * MemoryLayout<Int16>.size)

        for channel in 0..<channelCount {
            let channelData = floatChannelData[channel]
            for sampleIndex in 0..<frameLength {
                let intSample = Int16(
                    max(-1.0, min(1.0, channelData[sampleIndex])) * Float(Int16.max)
                )
                pcmData.append(contentsOf: withUnsafeBytes(of: intSample.littleEndian) { Data($0) })
            }
        }

        return pcmData
    }


}

Here the code that processing audio data

import AVFoundation
import ScreenCaptureKit

class AudioStreamOutputHandler: NSObject, SCStreamOutput {
    private let audioEngine = AVAudioEngine()

    let deepgramService: DeepgramService

    init(deepgramService: DeepgramService) {
        self.deepgramService = deepgramService
        super.init()
    }
    
    nonisolated func stream(_ stream: SCStream, didStopWithError error: Error) {
        print("SCStream stopped with error: \(error.localizedDescription)")
    }


    // Called when a sample buffer is received.
    nonisolated func stream(
        _ stream: SCStream,
        didOutputSampleBuffer sampleBuffer: CMSampleBuffer,
        of type: SCStreamOutputType
    ) {
        
        
        switch type {
        case .audio:
            let pcmBuffer: AVAudioPCMBuffer? = createPCMBuffer(from: sampleBuffer)

            guard let pcmBuffer else { return }

            if let convertedPCMData: Data = convertTo16BitPCM(from: pcmBuffer) {
                deepgramService.sendAudioData(convertedPCMData)
            }
            break;
        case .microphone:
            break
            
        default:
            break;

        }
    
        
    }

    private func createPCMBuffer(from sampleBuffer: CMSampleBuffer) -> AVAudioPCMBuffer? {
        guard CMSampleBufferIsValid(sampleBuffer),
            let formatDescription = sampleBuffer.formatDescription,
            let absd = formatDescription.audioStreamBasicDescription
        else {
            NSLog("Invalid CMSampleBuffer or missing format description.")
            return nil
        }

        var audioBufferListCopy: AudioBufferList?
        do {
            try sampleBuffer.withAudioBufferList { audioBufferList, _ in
                audioBufferListCopy = audioBufferList.unsafePointer.pointee
            }
        } catch {
            NSLog("Error accessing AudioBufferList: \(error.localizedDescription)")
            return nil
        }

        guard
            let format = AVAudioFormat(
                standardFormatWithSampleRate: absd.mSampleRate,
                channels: AVAudioChannelCount(absd.mChannelsPerFrame)
            )
        else {
            NSLog("Failed to create AVAudioFormat.")
            return nil
        }

        return AVAudioPCMBuffer(
            pcmFormat: format,
            bufferListNoCopy: &audioBufferListCopy!
        )
    }

    private func convertTo16BitPCM(from buffer: AVAudioPCMBuffer) -> Data? {
        guard let floatChannelData = buffer.floatChannelData else {
            NSLog("Failed to get floatChannelData.")
            return nil
        }

        let frameLength = Int(buffer.frameLength)
        let channelCount = Int(buffer.format.channelCount)
        var pcmData = Data(capacity: frameLength * channelCount * MemoryLayout<Int16>.size)

        for channel in 0..<channelCount {
            let channelData = floatChannelData[channel]
            for sampleIndex in 0..<frameLength {
                let intSample = Int16(
                    max(-1.0, min(1.0, channelData[sampleIndex])) * Float(Int16.max)
                )
                pcmData.append(contentsOf: withUnsafeBytes(of: intSample.littleEndian) { Data($0) })
            }
        }

        return pcmData
    }


}

Share Improve this question asked Feb 16 at 21:38 Garfield is on board. 632 silver badges4 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Yes! You can combine .audio and .microphone into a single PCM stream in swift. You simply mix the two streams together by summing their samples.

With some caveats:

the sample timestamps may not line up
the stream formats may not match
the stream sample rates may not match
the sample channel counts may not match
the .microphone stream often contains a delayed copy of .audio

That's a lot of caveats! Here are some possibilities for dealing with them:

you could do your best to line the sample timestamps up, or you could make the simplifying assumption that the most recently arrived CMSampleBuffers are "close enough" to mix together, regardless of their timestamps.
you've already got code for converting float to integer samples for .audio, so you can do the same for .microphone. This is slightly odd because I thought ScreenCaptureKit .audio was often integer and .microphone was float. In any case, you can do something similar for .microphone. If you're mixing the samples as integers, take care to avoid overflow.
to match sample rates most people reach for an AVAudioConverter, and this can also cover points 2. and 4. Note that ScreenCaptureKit uses CMSampleBuffers instead of AVAudioPCMBuffers, so you'll need to convert. This is per stream, so you should convert to mono to allow mixing, although keep in mind that AVAudioConverter's idea of converting stereo to mono is to discard the right channel . This is probably a fine simplifying assumption.
see point 3.
When mixing microphone and system audio, if the microphone can hear the speakers then you get a second copy of the system audio in the result. This is called an echo. As far as I know, macOS doesn't have a general echo canceller available to 3rd party apps, so I use a chunk of the webrtc code.

One slight simplification could be not mixing the two streams by using Deepgram's multichannel feature, putting a mono version of .audio in the left channel and a mono version of .microphone in the right channel. This plus diarization may be useful to you. It's only a slight simplification because you still need to (maybe) synchronize, rate and format convert and deal with echoes.

Good luck!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

macos - Is there anyway to combine .audio and .microphone data to single pcm in swift - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)