Hi HN! I'm excited to share Autolabel, an open-source Python library to label and enrich text datasets with any Large Language Model (LLM) of your choice.
We built Autolabel because access to clean, labeled data is a huge bottleneck for most ML/data science teams. The most capable LLMs are able to label data with high accuracy, and at a fraction of the cost and time compared to manual labeling. With Autolabel, you can leverage LLMs to label any text dataset with <5 lines of code.
We’re eager for your feedback!
Autolabel is quite orthogonal to this - it's a library that makes interacting with LLMs very easy for labeling text datasets for NLP tasks.
We are actively looking at integrating function calling into Autolabel though, for improving label quality, and support downstream processing.
But the key issue is going to be privacy. I’m not big on LLM, so I’m sorry if this is obvious, but can I use something like this without sending my data outside my own organisation?
https://github.com/ggerganov/llama.cpp
You need to be careful about liscencing - some of these models its a legal grey area whether you can use them for commercial projects.
The 'best' models require some quite large hardware to run, but a popular compression methodology at the moment is 'quantization', using lower precision model weights. I find it a bit hard to evaluate which open source models are better than others, and how they are impacted by quantization.
You can also use the Open-AI API. They don't use the data. They store for 30 days, which they use for fraud-monitoring, and then delete. It doesn't seem hugely different to using something like Slack/Google doc/AWS.
I think some people imagine their data will end up in the knowledge-base of GPT-5 if they use Open-AI products, but this would be a clear breach of TOS.
https://openai.com/policies/api-data-usage-policies
I wonder if one day they will sell a “self-hosted” version of GPT. We wouldn’t mind having a ChatGPT with its 2021 data set and no ability to use the internet if it meant it lives up to regulations.
But can you do that? Can you “download” a model and then just use it?
As far as the hardware goes I think we will be fine. My sector uses a lot of expensive hardware like mainframes for old legacy systems where we come together as organisations and buy the service from companies like IBM (or similar, typically there are 3-5 companies that take turns winning the 8-12 year contracts) who then operate the stuff inside our country. I’m sure we can do the same with LLMs.
How does this work exactly?
Pirate all LLMs. They're all yours anyway.
Outputs from models that they trained on stolen ebooks, unpaid reddit data, data scraped from millions of websites without credit, etc. Sort of like stealing a bike and then getting mad that it got stolen again later, because it was clearly rightfully yours.
https://i.pinimg.com/originals/d7/72/22/d77222df469b50e3b4cd...
>use output from the Services to develop models that compete with OpenAI;
Well, I still can use ChatGPT labeling for many other purposes anyway.
Dead Comment
Deleted Comment
It's one thing to show HN / share, its another thing to spam it with your ads.
The earlier post was a report summarizing LLM labeling benchmarking results. This post shares the open source library.
Neither is intended to be an ad. Our hope with sharing these is to demonstrate how LLMs can be used for data labeling, and get feedback from the community