Synchronizing two USB audio streams

We need to have synchronized audio data if we want to calculate the sound source direction. Kinect has a microphone array with 4 microphones which are syncronized on the hardware layer.

This project uses two separate USB audio adapters and we do the syncronizing on the software layer.

1. Collecting data packets

Windows Audio Session API (WASAPI) provides a low level access to the audio streams from the audio devices.

Image 1 shows the high level view of the system. The Data Collector class collects the audio packets from the audio devices and then gives them to Data Consumer for syncronization and analysis. The Devices and the Consumer all run on separate threads. The Data Collector is protected by CriticalSection (shown by the red dashed line in the image).

Image 1

The WASAPI devices expect a quick handling time when they have audio data available (calls DataCollector::AddData). The size of the audio data can vary between calls. For that reason the Collector just copies the audio buffer from the device and adds it to a linked list of audio packets and gives the buffer back.

The WASAPI device provides a time stamp for each audio packet.
The time stamp or QPCPosition gives a 100-nanosecond accuracy.

class AudioDataPacket
            const BYTE* pData, 
            DWORD cbBytes, 
            UINT64 u64QPCPosition, 
            bool bDiscontinuity, 
            bool bSilence) :
        if (!m_bDiscontinuity && cbBytes > 0)
            m_cbBytes = cbBytes;
            m_pData = new BYTE[cbBytes];

            BYTE* d = m_pData;
            const BYTE* const dend = m_pData + cbBytes;
            while (d != dend)
                *d++ = *pData++;

        if (m_pData != NULL) delete[] m_pData;

    const BYTE* Data() const { return m_pData; }
    DWORD Bytes() const { return m_cbBytes; }

    UINT64 Position() const { return m_u64QPCPosition; }

    bool Discontinuity() const { return m_bDiscontinuity; }
    bool Silence() const { return m_bSilence;  }

    void SetNext(AudioDataPacket* next) { m_next = next; }
    AudioDataPacket* Next() const { return m_next; }

    BYTE* m_pData;
    DWORD m_cbBytes;

    UINT64 m_u64QPCPosition;

    bool m_bDiscontinuity;
    bool m_bSilence;

    AudioDataPacket* m_next;


AddData makes a memory copy of the audio buffer and adds it to the linked list. If the audio data has a discontinuity or a silence no audio data needs to be copied.

void DataCollector::AddData(size_t device, 
    BYTE* pData, DWORD cbBytes, UINT64 u64QPCPosition, 
    bool bDiscontinuity, bool bSilence)

    if (m_store)
        m_packetCounts[device] = m_packetCounts[device] + 1;

        AudioDataPacket* item = new AudioDataPacket(
            pData, cbBytes, u64QPCPosition, 
            bDiscontinuity, bSilence);

        if (m_audioDataFirst[device] == NULL)
            m_audioDataFirst[device] = item;
            m_audioDataLast[device] = item;
            m_audioDataLast[device] = item;



To make the function call fast no audio data is not copied between Collector and Consumer. Collector allocated the memory and then transfers the ownership to Consumer which takes care of the deallocation.

DeviceInfo DataCollector::RemoveData(
    size_t device,
    AudioDataPacket** first, AudioDataPacket** last, 
    size_t *count, bool* error)

    *first = m_audioDataFirst[device];
    *last = m_audioDataLast[device];
    *count = m_packetCounts[device];
    *error = m_error;

    m_audioDataFirst[device] = NULL;
    m_audioDataLast[device] = NULL;
    m_packetCounts[device] = 0;

    DeviceInfo info = m_devices[device];

    return info;


The StoreData function sets the flag to tell whether audio data is stored or not.

void DataCollector::StoreData(bool store)
    m_store = store;
2. Syncronizing channels

The Consumer takes the linked list and converts it to a data vector. It assign each sample a time stamp using the packet time stamps.

Image 2

Once we have data for each channel we match the samples which have the closest time stamps.

Image 3

2.1 Discontinuity

The WASAPI device reports any missing data in the audio stream as discontinuity. If any packet has the discontinuity flag set as the packet 4 in image 4 the data before it is also invalidated since we cannot be sure when the previous audio packet ended.

The discontinuity means that we need to flush the data before it and restart the collection after it.

Image 4

Github repository