Readit News logoReadit News
krawfy commented on Rawdog: A Natural Language CLI   github.com/AbanteAI/rawdo... · Posted by u/granawkins
krawfy · 2 years ago
How is this different from other solutions like Open Interpreter?
krawfy commented on Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs   github.com/hegelai/prompt... · Posted by u/krawfy
mmaia · 2 years ago
I like that it's not limited to single prompts and allows to have chat messages. It would be great if `OpenAIChatExperiment` could also handle OpenAI's function calling.
krawfy · 2 years ago
Good catch! We're looking to add function calling support very soon, and have an open issue for it on our GitHub. If you want to raise a PR and add it, we'll help you land it and get it merged
krawfy commented on Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs   github.com/hegelai/prompt... · Posted by u/krawfy
neelm · 2 years ago
Something like this is going to be needed to evaluate models effectively. Evaluation should be integrated into automated pipelines/workflows that can scale across models and datasets.
krawfy · 2 years ago
Thanks Neel! We totally agree that automated evals will become an essential part of production LLM systems.
krawfy commented on Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs   github.com/hegelai/prompt... · Posted by u/krawfy
tikkun · 2 years ago
This looks great, thanks

See also this related tool: https://news.ycombinator.com/item?id=36907074

krawfy · 2 years ago
Awesome! Let us know if there's anything from that tool that you think we should add to PromptTools
krawfy commented on Monitor and Optimize your large-scale model training   trainy.ai/blog/introducin... · Posted by u/roanakb
krawfy · 2 years ago
This is really cool! When we were trying to launch the GSPMD feature for PyTorch/XLA at Google, one of our biggest bottlenecks was network overhead, but we didn't really have any robust tools to dig into it and perform root cause analysis. I'm loving the tools I see come out of Trainy.
krawfy commented on Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs   github.com/hegelai/prompt... · Posted by u/krawfy
esafak · 2 years ago
I'd like to see support for qdrant.
krawfy · 2 years ago
We've actually been in contact with the qdrant team about adding it to our roadmap! Andre (CEO) was asking for an integration. If you want to work on the PR, we'd be happy to work with you and get that merged in
krawfy commented on Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs   github.com/hegelai/prompt... · Posted by u/krawfy
fatso784 · 2 years ago
I like the support for Vector DBs and LLaMa-2. I'm curious as to whether and what influences compelled PromptTools, and how it differs from other tools in this space. For context, we've also released a prompt engineering IDE, ChainForge, which is open-source and has many of the features here, such as querying multiple models at once, prompt templating, evaluating responses with Python/JS code and LLM scorers, plotting responses, etc (https://github.com/ianarawjo/ChainForge and a playground at http://chainforge.ai).

One big problem we're seeing in this space is over-trust in LLM scorers as 'evaluators'. I've personally seen that minor tweaks to a scoring prompt can sometimes result in vastly different evaluation 'results.' Given recent debacles (https://news.ycombinator.com/item?id=36370685), I'm wondering how we can design LLMOps tools for evaluation which both support the use of LLMs as scorers, but also caution users about their results. Are you thinking similarly about this question, or seen usability testing which points to over-trust in 'auto-evaluators' as an emerging problem?

krawfy · 2 years ago
Great question, chainforge looks interesting!

We offer auto-evals as one tool in the toolbox. We also consider structured output validations, semantic similarity to an expected result, and manual feedback gathering. If anything, I've seen that people are more skeptical of LLM auto-eval because of the inherent circularity, rather than over-trusting it.

Do you have any suggestions for other evaluation methods we should add? We just got started in July and we're eager to incorporate feedback and keep building.

krawfy commented on Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs   github.com/hegelai/prompt... · Posted by u/krawfy
catlover76 · 2 years ago
Super cool, the need for tooling like this is something one realizes pretty quickly when starting to build apps that leverage LLMs.
krawfy · 2 years ago
Glad you think so, we agree! If you end up trying it out, we'd love to hear what you think, and what other features you'd like to see.
krawfy commented on Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs   github.com/hegelai/prompt... · Posted by u/krawfy
politelemon · 2 years ago
Similar tool I was about to look at: https://github.com/promptfoo/promptfoo

I've seen this in both tools but I wasn't able to understand: In the screenshot with feedback, I see thumbs up and thumbs down options. Where do those values go, what's the purpose? Does it get preserved across runs? It's just not clicking in my head.

krawfy · 2 years ago
For now, we just aggregate those across the models / prompts / templates you're evaluating so that you can get an aggregate score. You can export to CSV, JSON, MongoDB, or Markdown files, and we're working on more persistence features so that you can get a history of which models / prompts / templates you gave the best scores to, and keep track of your manual evaluations over time.

u/krawfy

KarmaCake day61March 4, 2023View Original