Bad Data and Data Engineering: Dissecting Google Play Music Takeout Data

faizshah · 4 years ago

Great post, for this pipeline I would have probably used a makefile for the batch pipeline instead of airflow just to keep it simple. I would also make my sink a SQLite database so that you can easily search through it with a web interface using datasette.

For the places where bash was used I would just use python and any cli tools you want to call I just use subprocess. It’s much simpler and I can run the scripts in a repl and execute cells in Jupyter or just normal pycharm so its quick and interactive.

Love that you included something on building a data dictionary, I am honestly guilty of in the past not including a good data dictionary for the source data. I would just leave in the output of df.describe() or df.info() at the top of the jupyter notebook where you restructure the source data before processing it. I now think you should include and save as a CSV a data dictionary of the source data and the final data as it’s more maintainable or at least leave a comment in your script.

Otherwise everything else is pretty similar to what I would do, I just went to my google takeout and apparently all my google play data and songs are gone so I guess I can’t try this myself…

wodenokoto · 4 years ago

My first thought was also "why not SQLite?", but the author says he already has a MariaDB running. So, using the tools you know.

I guess it is the same for make vs airflow. I had no idea they could be used interchangeably for single machine workloads.

While I've seen datasette mentioned a lot of places, I still don't really know what it is, but if it makes exploring sqlite databases easy, I should give it a try!

faizshah · 4 years ago

The makefile data pipeline is definitely an underrated technique a couple great HN comments on this technique:

- https://news.ycombinator.com/item?id=22283368

- https://news.ycombinator.com/item?id=18896204

I personally learned it from bioinformaticians theres great coverage of this and other command line data skills in this book: https://www.oreilly.com/library/view/bioinformatics-data-ski...

The SQLite, pandas, bash, make stack for quick data science projects is a great and maintainable one that doesn’t require too much specialized knowledge.

otter-in-a-suit · 4 years ago

I use SQLite quite a bit and I think it's fantastic, especially since Hipp seems to be a wonderful guy (judging by The Changelog podcast).

And you're right - use the tools you know and have running. I have all sorts of schemas and tables on that old instance, since I tend to use it if I need "anything SQL" - when I'm at home, is it as easy as using SQLite. My latest article used Trino and Hadoop-adjacent stuff... while fascinating it its own right, sometimes it's nice to just say "jdcb:// ..." :)

progbits · 4 years ago

So are the mp3 files not the same as what the author uploaded? I could imagine weird organization for tracks from the service but for self-uploaded data I would be surprised if they didn't just give them back the same.

The article never mentioned how this showed up in the GPM app itself which feels lacking.

Otherwise a nice article but it reminds me why I long ago gave up on media metadata organization. So much work, so much mess...

randomifcpfan · 4 years ago

IIRC, GPM stored user uploads in MP3. If you uploaded a non-MP3 file, it was transcoded into a MP3 during upload. It is this file that GPM takeout is providing.

Separate from that, GPM matched your uploaded MP3 file against the service music corpus, and if there was a match, the service streamed the canonical version. Originally the streaming service used 320 kbps MP3, but later the service switched to 256 kbps AAC. GPM takeout does not provide the canonical version.

kiloDalton · 4 years ago

In the case of lossless files, the takeout files are empathically not the same files that were uploaded. Google Music would allow a user to upload lossless FLAC files, but internally it converted them to 320 kpbs MP3 files. So, GPM certainly transcoded a portion of uploaded files. I'm not sure to what extent it left files alone if they met Google's formatting specifications. Perhaps someone else knows.

jeffbee · 4 years ago

I don't think they did very much leaving things alone. One of my biggest problems with GPM was that my uploads would seemingly get de-duplicated alongside some other record that wasn't exactly the same, like a reissue or a remaster of the same record that sounded noticeably different. Sometimes an album I uploaded would gain a mysterious bonus track. They also at some point hosed up the whole system in such a way that many of my records contained every track twice, which meant I had to make playlists out of my old albums just to remove the even-numbered tracks and make it listenable again.

If you takeout from YTM it says your music files are "Your originally uploaded audio file" which is nice. Since music in YTM may have been migrated from GPM, that seems to imply that GPM retained the originals.

When they shut down GPM I migrated to YTM, which doesn't seem to have these specific catalog problems. I also just re-organized my local copy of my FLACs using MusicBrainz Picard. Unlike this author I no longer have the giant wall of CDs!

wodenokoto · 4 years ago

> The script should be decently self-explanatory [...] Please note that this is all single-threaded, which I don’t recommend - with nohup and the like, you can trivially parallelize this.

How do you parallelize a loop in bash without getting all the echo's intertwined and jumbled together?

karlding · 4 years ago

In general, you can partition the loop to delegate to "workers" and have each instance pipe the output to different files, each corresponding to an output stream. This avoids the need for mutual exclusion around your output streams. If you need to aggregate logs then run some log aggregator.

otter-in-a-suit · 4 years ago

Maybe not nohup itself, but I often do simple background tasks + a blocking `wait` statement in a loop, for instance here: https://github.com/chollinger93/debian-scripts/blob/master/u...

If you don't use nohup, you can just pipe stdout and err to a different file descriptor for logs within said loop.

Or, of course, script something in a language of your choice.