Readit News logoReadit News
Posted by u/ralfelfving 2 years ago
Show HN: Open-source macOS AI copilot using vision and voicegithub.com/elfvingralf/ma...
Heeey! I built a macOS copilot that has been useful to me, so I open sourced it in case others would find it useful too.

It's pretty simple:

- Use a keyboard shortcut to take a screenshot of your active macOS window and start recording the microphone.

- Speak your question, then press the keyboard shortcut again to send your question + screenshot off to OpenAI Vision

- The Vision response is presented in-context/overlayed over the active window, and spoken to you as audio.

- The app keeps running in the background, only taking a screenshot/listening when activated by keyboard shortcut.

It's built with NodeJS/Electron, and uses OpenAI Whisper, Vision and TTS APIs under the hood (BYO API key).

There's a simple demo and a longer walk-through in the GH readme https://github.com/elfvingralf/macOSpilot-ai-assistant, and I also posted a different demo on Twitter: https://twitter.com/ralfelfving/status/1732044723630805212

e28eta · 2 years ago
Did you find that calling it “OSX” in the prompt worked better than macOS? Or was that just an early choice that you didn’t spend much time on?

I was skimming through the video you posted, and was curious.

https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s

code link: https://github.com/elfvingralf/macOSpilot-ai-assistant/blob/...

ralfelfving · 2 years ago
No, this is an oversight by me. To be completely honest, up until the other day I thought it was still called OSX. So the project was literally called cOSXpilot, but at some point I double checked and realize it's been called macOS for many years. Updated the project, but apparently not the code :)

I suspect OSX vs macOS has marginal impact on the outcome :)

e28eta · 2 years ago
Haha, makes perfect sense, thanks for the reply!
hot_gril · 2 years ago
Heh. I remember calling it Mac OS back in the day and getting corrected that it's actually OS X, as in "OS ten," and hasn't been called Mac OS since Mac OS 9. Glad Apple finally saw it my way (except it's cased macOS).
jondwillis · 2 years ago
You should add an option for streaming text as the response instead of TTS. And also maybe text in place of the voice command as well. I have been tire-kicking a similar kind of copilot for awhile, hit me up on discord @jonwilldoit
ralfelfving · 2 years ago
There's definitely some improvements to shuttling the data between interface<->API, all that was done in a few hours on day 1 and there's a few things I decided to fix later.

I prefer speaking over typing, and I sit alone, so probably won't add a text input anytime soon. But I'll hit you up on Discord in a bit and share notes.

jondwillis · 2 years ago
Yeah, just some features I could see adding value and not being too hard to implement :)
tomComb · 2 years ago
> text in place of the voice command as well

That would be great for people with Mac mini who don't have a mic.

ralfelfving · 2 years ago
Hmmm... what if I added functionality that uses the webcam to read your lips?

Just kidding. Text seem to be the most requested addition, and it wasn't on my own list :) Will see if I add it, should be fairly easy to make it configurable and render a text input window with a button instead of triggering the microphone.

Won't make any promises, but might do it.

ralfelfving · 2 years ago
Added text input instead of voice as an option today.
faceless3 · 2 years ago
Wrote some similar scripts for my Linux setup, that I bind with XFCE keyboard shortcuts:

https://github.com/samoylenkodmitry/Linux-AI-Assistant-scrip...

F1 - ask ChatGPT API about current clipboard content F5 - same, but opens editor before asking num+ - starts/stops recording microphone, then passes to Whisper (locally installed), copies to clipboard

I find myself rarely using them however.

ralfelfving · 2 years ago
Nice!
Art9681 · 2 years ago
Make sure to set OpenAI API spend limits when using this or you'll quickly find yourself learning the difference between the cost of the text models and vision models.

EDIT: I checked again and it seems the pricing is comparable. Good stuff.

ralfelfving · 2 years ago
I think a prompt cost estimator might be a nifty thing to add to the UI.

Right now there's also a daily API limit on the Vision API too that kicks in before it gets really bad, 100+ requests depending on what your max spend limit is.

krschacht · 2 years ago
I love it! I’ve been circling around a similar set of ideas, although my version integrates with the web-based ChatGPT:

https://news.ycombinator.com/item?id=38244883

There are some pros and cons to that. I’m intrigued by your stand-alone MacOS app.

hackncheese · 2 years ago
Love it! Will definitely use this when a quick screenshot will help specify what I am confused about. Is there a way to hide the window when I am not using it? i.e. I hit cmd+shift+' and it shows the window, then when the response finishes reading, it hides again?
ralfelfving · 2 years ago
There's a way for sure, it's just not implemented. Allowing for more configurability of the window(s) is on my list, because it annoys me too! :)
hackncheese · 2 years ago
Annoyance Driven Development™
poorman · 2 years ago
Currently imagining my productivity while waiting 10 seconds for the results of the `ls` command.
ralfelfving · 2 years ago
It's a basic demo to show people how it works. I think you can imagine many other examples where it'll save you a lot of time.
hot_gril · 2 years ago
The demo on Twitter is a lot cooler, partially because you scroll to show the AI what the page has. Maybe there's a more impressive demo to put on the GH too?
thomashop · 2 years ago
Just used it with the digital audio workstation Ableton Live. It is amazing! Its tips were spot-on.

I can see how much time it will save me when I'm working with a software or domain I don't know very well.

Here is the video of my interaction: https://www.youtube.com/watch?v=ikVdjom5t0E&feature=youtu.be

Weird these negative comments. Did people actually try it?

ralfelfving · 2 years ago
So glad when I saw this, thanks for sharing this! It was exactly music production in Ableton was the spark that lit this idea in my head the other week. I tried to explain to a friend that don't use GPT much that with Vision, you can speed up your music production and learn how to use advanced tools like Ableton more quickly. He didn't believe me. So I grabbed a Ableton screenshot off Google and used ChatGPT -- then I felt there had to be a better way, I realized that I have my own use-cases, and it all evolved into this.

I sent him your video, hopefully he'll believe me now :)

thomashop · 2 years ago
You may be interested in two proof of concepts I've been working on. I work with generative AI and music at a company.

MidiJourney: ChatGPT integrated into Ableton Live to create MIDI clips from prompts. https://github.com/korus-labs/MIDIjourney

I have some work on a branch that makes ChatGPT a lot better at generating symbolic music (a better prompt and music notation).

LayerMosaic allows you to allow MusicGen text-to-music loops with the music library of our company. https://layermosaic.pixelynx-ai.com/

mikey_p · 2 years ago
Is it just me or is it incredibly useless?

"Here's a list of effects. Here's a list of things that make a song. Is it good? Yes. What about my drum effects? Yes here's the name of the two effects you are using on your drum channel"

None of this is really helpful and I can't get over how much it sounds like Eliza.

thomashop · 2 years ago
I just made a video where I test it with a proper use case. It helps me find effects to make a bassline more dubby and helps carve out frequencies in the kick drum to make space for the bass.

https://www.youtube.com/watch?v=zyMmurtCkHI

thomashop · 2 years ago
I made that video right at the start but since then I've asked it for example what kind of compression parameters would fit with a certain track and it could explain to me how to find an expert function which I would have had to consult a manual for otherwise.
urbandw311er · 2 years ago
Yeah I thought the same. Ultra generic advice and no evidence it has actually parsed anything unique or useful from the user’s actual composition.
pelorat · 2 years ago
I mean it does send a screenshot of your screen off to a 3rd party, and that screenshot will most likely be used in future AI training sets.

So... beware when you use it.

zwily · 2 years ago
OpenAI claims that data sent via the API (as opposed to chatGPT) will not be used in training. Whether or not you believe them is a separate question, but that's the claim.
thomashop · 2 years ago
Beware of it seeing a screenshot of my music set? OpenAI will start copying my song structure?

You can turn it on and off. Not necessary to turn it on when editing confidential documents.

You never enable screen-sharing in videoconferencing software?

Deleted Comment