eduren (u/eduren) - Readit News

eduren commented on Launch HN: DAGWorks – ML platform for data science teams · Posted by u/krawczstef

eduren · 3 years ago

Hey Stefan and Elijah, I really like the approach you're taking, especially with Hamilton being the open core.

I've got recent experience with data eng / pipleine startups and wondering if you are hiring for your first engineers at this time.

eduren commented on Automatic supercuts on the command line with Videogrep lav.io/notes/videogrep-tu... · Posted by u/saaaam

eduren · 4 years ago

If anyone else decides to give this a try on video files with multiple audio tracks, there doesn't seem to be an easy way to tell it to select a certain track.

I got it working by manually adding `-map 0:2` (`2` being the trackid I'm interested in) when calling ffmpeg.

You'll have to make that edit in both `videogrep/transcribe.py` as well as `moviepy/audio/io/readers.py`.

And I'm not sure how easy adding real support for that would be, considering that moviepy doesn't currently have a way to support it (https://github.com/Zulko/moviepy/issues/1654)

eduren commented on Automatic supercuts on the command line with Videogrep lav.io/notes/videogrep-tu... · Posted by u/saaaam

867-5309 · 4 years ago

great project! since it relies heavily on subtitle files, and as an alternative to generating your own, which websites would you recommend to find subtitles for videos which are not on youtube i.e. movies and series? preferably ones with ratings systems similar to guitar tabs websites - I can envisage a musical similarity in the variance and quality of user-submitted content e.g. timing, volume, tone, punctuation, expression, improvisation, etc. since I doubt many are composed from the actual scripts. I have never used vosk so am also wondering whether it would be quicker and more reliable than filtering and spot checking say a few subtitle files per video

eduren · 4 years ago

I just started playing around with the transcription part after seeing this blog post. Consider giving it a try.

I'm not sure how well most subtitle sources will work with this. I don't think they'll generally embed the word timings needed for picking out fragments (just line timings). The blog post mentions it being the case for `.srt` specifically. Not 100% sure, someone with better understanding of the subtitle formats would be able to correct me.

FWIW I'm finding the video transcription to be working quite well (and I even decided to use Japanese-speaking media because I wanted to see how well vosk handles it).

It might be my system, but the transcription is unfortunately a bit slow/single threaded. I quickly added a GNU `parallel` in front of the transcription step to speed up processing an entire season.

eduren commented on Ask HN: Who wants to be hired? (May 2022) · Posted by u/whoishiring

eduren · 4 years ago

Location: WA-US

Remote: Preferred

Willing to relocate: No

Technologies: Go, Python, Rust, Typescript, k8s, airflow, kafka, postgres

Résumé/CV: If requested

Linkedin: https://www.linkedin.com/in/reina-feather-7b676763/

Email: feather dot rw at gmail

eduren commented on Ask HN: Who wants to be hired? (November 2021) · Posted by u/whoishiring

eduren · 4 years ago

Location: Seattle, WA

Remote: Yes (require 100%)

Willing to relocate: No

Technologies: k8s, airflow, docker, kafka, golang, postgres, python, rust, C/C++, JS/TS, CI/CD

Resume: available on request, https://www.linkedin.com/in/reina-feather-7b676763

Email: feather dot rw at gmail

Looking for cloud infrastructure focused roles, background in Data Engineering, devops, backends.

eduren commented on Ask HN: Who wants to be hired? (September 2021) · Posted by u/whoishiring

eduren · 5 years ago

Location: Seattle, WA

Remote: Yes (require 100%)

Willing to relocate: No

Technologies: k8s, time series databases, kafka, golang, postgres, python, rust, C++, docker, JS/TS, CI/CD

Resume: available on request, https://www.linkedin.com/in/reina-feather-7b676763

Email: feather dot rw at gmail

Looking for infrastructure focused roles, background in Data Engineering, devops, backends.

eduren commented on Apache Arrow Datafusion 5.0.0 release arrow.apache.org/blog/202... · Posted by u/houqp

seddonm1 · 5 years ago

Disclosure: I am a contributor to Datafusion.

I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.

Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.

Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

eduren · 5 years ago

Oh hey, thanks for the info!

I spent some time evaluating Arc for my team's ETL purposes and I was really impressed. I hesitated somewhat to move forward with it because it seemed really tied into the Spark ecosystem (for great reasons). We just weren't at all familiar with deploying and operating Spark, so ended up rolling our own scripts on top of (an existing) Airflow cluster for now.

Besides performance reasons, are there any other advantages to porting Arc to run on top of datafusion? If the porting effort was shared somewhere I'd love to dig in and see what the proof-of-concept looks like.

eduren commented on Apache Arrow Datafusion 5.0.0 release arrow.apache.org/blog/202... · Posted by u/houqp

houqp · 5 years ago

One of the Arrow Datafusion committers here. Happy to help answer any question.

eduren · 5 years ago

I've been following Arrow and Datafusion dev for a little bit, mostly because the architecture and goals look interesting.

What I'd be curious about is one of the possible use cases mentioned in the Readme: ETL processes. I have yet to come across any projects that are building ETL/ELT/pipeline tools that leverage Datafusion. Might not be looking in the right places.

Would anyone have insight into whether this is simply unexplored territory, or just not as good of a fit as other use cases?

eduren commented on Amazon Braket – Get Started with Quantum Computing aws.amazon.com/blogs/aws/... · Posted by u/aloknnikhil

eduren · 6 years ago

Interesting choice for the name: https://en.wikipedia.org/wiki/Bra%E2%80%93ket_notation

eduren commented on Please Buy Less pleasebuyless.com/... · Posted by u/anilshanbhag

daenz · 6 years ago

>Without having to shame people into removing themselves from the economy.

No, you'll just remove them from the economy without their consent, by introducing regulation to artificially lower supply. Everyone is against this: the companies who won't make as much profit and the consumers who won't be able to purchase the goods that they want. Good luck with that.

eduren · 6 years ago

>No, you'll just remove them from the economy without their consent, by introducing regulation to artificially lower supply

A few things:

1. Presumably any carbon tax would have to be secured and defended by our democratic institutions. Thus we would have consent (or as close as you can get to large scale consent in our multi-actor society). While I agree that regulating basic consumption for large swaths of the economy has a bit of an authoritarian bend to it, I'm not sure how else we incentivize ourselves to decrease consumption.

2. Lowered supply is not a given. Companies would be incentivized to find production chains, energy sources, and materials that had a lower impact (and thus a lower tax). Less impactful products would be able to price themselves under the high-impact products and satiate the demand.

EDIT Added 3. Consumption itself is not the enemy. The thing we want to minimize is negative externalities. It just so happens that under our current system, manipulating levels of consumption is the only lever our society has for affecting industrial emissions.