Ask HN: How do browsers isolate internal audio from microphone input?

The way this works (and I'm obviously taking a high level view here) is by comparing what is being played to what is being captured. There is an inherent latency in between what is called the capture stream (the mic) and the reverse stream (what is being output to the speakers, be it people taking or music or whatever), and by finding this latency and comparing, one can cancel the music from the speech captured.

Within a single process, or tree of processes that can cooperate, this is straightforward (modulo the actual audio signal processing which isn't) to do: keep what you're playing for a few hundreds milliseconds around, compare to what you're getting in the microphone, find correlations, cancel.

If the process aren't related there are multiple ways to do this. Either the OS provides a capture API that does the cancellation, this is what happens e.g. on macOS for Firefox and Safari, you can use this. The OS knows what is being output. This is often available on mobile as well.

Sometimes (Linux desktop, Windows) the OS provides a loopback stream: a way to capture the audio that is being played back, and that can similarly be used for cancellation.

If none of this is available, you mix the audio output and perform cancellation yourself, and the behaviour your observe happens.

Source: I do that, but at Mozilla and we unsurprisingly have the same problems and solutions.

Johnie · a year ago

This reminds me of:

>The missile knows where it is at all times. It knows this because it knows where it isn't. By subtracting where it is from where it isn't, or where it isn't from where it is (whichever is greater), it obtains a difference, or deviation

https://knowyourmeme.com/memes/the-missile-knows-where-it-is

enva2712 · a year ago

https://www.youtube.com/watch?v=c8jWx2qmBWI

CasperH2O · a year ago

Up to a point that text makes a lot of sense for describing a PID controller, which is a form of control that only really looks at error and tries to get it to zero.

bitwize · a year ago

I've observed that hacker culture exists because DARPA funded institutions like MIT for AI research, because the military wanted the missile to know where it is.

Sakos · a year ago

This is almost weirdly philosophical. I've been thinking about this all morning.

wormius · a year ago

For a little more context on negative feedback to those who want to know more (I believe this is what you're referring to?)

Here's a short historical interview with Harold Black from AT&T on his discovery/invention of the negative feedback technique for noise reduction. It's not super explanatory but a nice historical context: https://youtu.be/iFrxyJAtJ7U?si=8ONC8N2KZwq3Jfsq

Here's a more indepth circuit explanation: https://youtu.be/iFrxyJAtJ7U?si=8ONC8N2KZwq3Jfsq

IIRC the issue was AT&T was trying to get cross-country calling, but to make the signal carry further you needed a louder signal. Amplifying the signal also the distortion.

So Harold came up with this method that ultimately allowed enough signal reduction to allow calls to cross the country within the power constraints available.

For some reason I recall something about transmission about Denver being a cut off point before the signal was too degraded... But I'm too old and forgetful so I could be misremembering something I read a while ago. If anyone has more specific info/context/citations that'd be great. Since this is just "hearsay" from memory, but I think it's something like this.

gpvos · a year ago

It just seems more logical for the OS to do that, rather than the application. Basically every application that uses microphone input will want to do this, and will want to compensate for all audio output of the device, not just its own. Why does the OS not provide a way to do this?

duped · a year ago

> Basically every application that uses microphone input will want to do this

The OS doesn't have more information about this than applications and it's not that obvious whether an application wants the OS to fuck around with the audio input it sees. Even in the applications where this might be the obvious default behavior, you're wrong - since most listeners don't use loudspeakers at all, and this is not a problem when they wear headphones. And detecting that (also, is the input a microphone at all?) is not straightforward.

Not all audio applications are phone calls.

swatcoder · a year ago

> Why does the OS not provide a way to do this?

Some do.

But you need to have a strong-handed OS team that's willing to push everybody towards their most modern and highly integrated interfaces and sunset their older interfaces.

Not everybody wants that in their OS. Some want operating systems that can be pieced together from myriad components maintained by radically different teams, some want to see their API's/interfaces preserved for decades of backwards compatibility, some want minimal features from their OS and maximum raw flexibility in user space, etc

dilap · a year ago

On mac/iOS, you get this using the AVAudioEngine API if you set voiceProcessingEnabled to true on the input node. It corrects for audio being played from all applications on the device.

kbolino · a year ago

This assumes there is an OS-managed software mixer sitting in the middle of all audio streams between programs and devices. Historically, that wasn't the case, because it would introduce a lot of latency and jitter in the audio. I believe it is still possible for a program to get exclusive access to an audio output device on Windows (WASAPI) and Linux (ALSA).

asveikau · a year ago

The OS doesn't know that the application doesn't want feedback from the speaker, and not 100% of applications will want such filtering. I think a best practice from the OS side would be to provide it as an optional flag. (Default could be on or off, with reasonable possibility for debate in either direction, but an app that really knows what it wants should be able to ask for it.)

rlpb · a year ago

There is a third place: a common library that all the apps use. If it is in the OS then it becomes brittle. If there's an improvement in the technology which requires an API change, that becomes difficult without keeping backwards compatibility or the previous implementation forever. Instead, there would be a newer generation common library which might eventually replace the first but only if the entire ecosystem chooses to leave the old one behind. Meanwhile there'd be a place for both. Apps that share use of a library would simply dynamically link to it.

This is the way things usually work in the Free Software world. For example: need JPEG support? You'll probably end up linking to libjpeg or an equivalent. Most languages have a binding to the same library.

Is that part of the OS? I guess the answer depends on how you define OS. On a Free Software platform it's difficult to say when a given library is part of the OS and when it is not.

bryanrasmussen · a year ago

I suppose the OS probably makes something like this available, when using Voiceover on Mac and presenting in teams by default only the mic comes into teams, you need to do something to share the other processes audio.

That's mac of course but in my experience Windows is much more trusting of what it gives applications access to so I suppose the same thing is available there.

hpen · a year ago

How sure are you that Basically every application wants this? So should there be a flag at the os level for enabling the cancellation? How do you control that flag?

IvyMike · a year ago

Did you just invent yet another linux audio stack?

Log_out_ · a year ago

At the lowest level its a fouriertransform over a systems (your room the echochambers response is know from some testsound )and the expected output going through that transform on its way to the mic is subtracted. Most socks and machines have dedicated systems for that. The very same chip produces the echo of the surroundings.

generalizations · a year ago

Is there any way to apply this outside the browser? Like, is there a version of this that can be used with Pulseaudio?

correct-horse · a year ago

To spare others from googling:

https://docs.pipewire.org/page_module_echo_cancel.html

https://wiki.archlinux.org/title/PipeWire/Examples#Echo_canc...

If you're still on pulseaudio for some reason, it ships with a similar module named "module-echo-cancel":

https://www.freedesktop.org/wiki/Software/PulseAudio/Documen...

The technical term that you're looking for is acoustic echo cancellation[1].

It's a fairly common problem in signal processing, and comes up in "simple" devices like telephones too.

[1] https://www.mathworks.com/help/audio/ug/acoustic-echo-cancel...

mananaysiempre · a year ago

I seem to remember analog telephone lines used a very simple but magic-looking transformer-based circuit of some sort for this purpose. Presumably that worked because they didn’t need to worry about a processing delay?

ajb · a year ago

That's slightly different. analog phone lines suffer from 'line echo', which is an echo generated by the phone line itself - because the same pair of wires is used for signals traveling in both directions. If you're thinking of the 'hybrid' , which connects the phone line to the speaker and mic, that matches the impedence of each of the three pairs of wire (to the mic, speaker, and phone line) to avoid generating echo in the first place. But they weren't perfect (and echo could be generated at other points in the line) so later, digital electronics was used to do first "echo suppression" (just zero any signal below certain magnitude) and then echo cancellation, which is very similar to acoustic echo cancellation (an AEC can probably do anything an LEC can do, but the LEC is less capable).

I'm not aware of anyone doing echo cancellation using an analog circuit, but that doesn't mean no-one did. I guess it's theoretically possible but I don't see how the adaption could work.

chadcmulligan · a year ago

There used to be a transformer in phones for side tone, a small amount of what you say is piped back into the earpiece, they did this because they found people would shout if they couldn't here their own voice. I've often wished mobiles would do this.

padenot · a year ago

umutisik · a year ago

It's called Acoustic Echo Cancellation. An implementation is included in WebRTC included in Chrome. A FIR filter (1D convolution) is applied to what the browser knows is coming out of the speakers; and this filter is continually optimized to to cancel out as much as possible of what's coming into the microphone (this is a first approximation, the actual algorithm is more involved).

Sponge5 · a year ago

To spare a search: https://webrtc.googlesource.com/src/+/refs/heads/main/module...

Dead Comment

sojuz151 · a year ago

Remember that convolution is multiplication in the frequency domain, so this also handles different responses at different frequencies, not just delays

meindnoch · a year ago

Search for the compilation flag "CHROME_WIDE_ECHO_CANCELLATION" in the Chromium sources, and you will find your answer.

Can't tell you anything else due to NDAs.

Wowfunhappy · a year ago

It's kind of nuts that (I'm assuming) the source code is publicly available but the developers who wrote it can't talk about it.

(I realize this situation isn't up to you and I appreciate that you chimed in as you could!)

ygjb · a year ago

This is super common.

When I worked at Mozilla, most stuff was open, but I still couldn't talk about stuff publicly because I wasn't a spokesperson for Mozilla. Same at OpenDNS/Cisco, or at Fastly, and now at Amazon. Lots of stuff I can talk about, but I generally avoid threads and comments about Amazon, or if I do, it's strictly to reference public documentation, public releases, or that sort of thing.

It's easier to simply not participate, link a document, or say no comment than it is to cross reference what I might say against what's public, and what's not.

geor9e · a year ago

Thanks, I see it's a user toggle too chrome://flags#chrome-wide-echo-cancellation or edge://flags/#edge-wide-echo-cancellation . All these years I was praising my macbook, thinking it was the hardware doing the cancellation, but it was Chromium the whole time.

codetrotter · a year ago

Side note, this can also cause a bit of difficulty in some situations apparently as seen in a HN post from a few months ago that didn’t get much attention

https://news.ycombinator.com/item?id=39669626

> I've been working on an audio application for a little bit, and was shocked to find Chrome handles simultaneous recording & playback very poorly. Made this site to demo the issue as clearly as possible

https://chrome-please-fix-your-audio.xyz/

filleokus · a year ago

Not sure if it's the whole story, but the latest response in the linked Chrome ticket seems to indicate that the api's were used incorrectly by the author

> <he...@google.com>

> Status: Won't Fix (Intended Behavior)

> Looking at the sample in https://chrome-please-fix-your-audio.xyz, the issue seems to be that the constraints just aren't being passed correctly [...]

> If you supply the constraints within the audio block of the constraints, then it seems to work [...]

> See https://jsfiddle.net/40821ukc/4/ for an adapted version of https://chrome-please-fix-your-audio.xyz. I can repro the issue on the original page, not on that jsfiddle.

https://issues.chromium.org/issues/327472528#comment14

supriyo-biswas · a year ago

kajecounterhack · a year ago

Google Meet uses source separation technology to denoise the audio. It's a neural net that's been trained to separate speech from non-speech and ensure that only speech is being piped through. It can even separate different speakers from one another. This technology got really good around 2021 when semi-supervised ways of training the models were developed, and is still improving :)

atoav · a year ago

A side effect of echo cancellation. Browser knows what audio it is playing, can correlate that to whatever comes in through the mic, maybe even by outputing inaudible test signals, or by picking wide supported defaults.

This is needed because many people don't use headphones and if you have more than one endpoint with mic and speakers open you will get feedback gallore if you don't do something to suppress it.

j45 · a year ago

Have used audio a lot on windows/mac for a long time, and a bit of linux too.

I'd say it depends on the combination of the hardware/software/OS that does pieces of it on how audio routing comes together.

Generally you have to see what's available, how it can or can't be routed, what software or settings could be enabled or added to introduce more flexibility in routing, and then making the audio routing work how you want.

More specifically some datapoints:

SOUND DRIVERS: Part of this can be managed by the sound drivers on the computer. Applications like web browsers can access those settings or list of devices available.

Software drivers can let you pick what's that's playing on a computer, and then specifically in browsers it can vary.

CHANNELS: There are often different channels for everything. Physical headphone/microphone jacks, etc. They all become devices with channels (input and output).

ROUTING: The input into a microphone can be just the voice, and/or system audio. System audio can further be broken down to be specific ones. OBS has some nice examples of this functionality.

ADVANCED ROUTING: There are some audio drivers that are virtual audio drivers that can also help you achieve the audio isolation or workflow folks are after.