Persimmon-8B - Readit News

I hope this is only a slight tangent; since the authors talk about their model serving throughput and I hope I can get a gut-check on my understanding of the state-of-the-art of model serving.

The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.

(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)

To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.

---

For folks in MLops, deploying models with streaming APIs:

1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?

2. How are you currently serving these models as an API and what upcoming tools are you exploring?

For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?

hansonw · 3 years ago

This is the best comparison I've found that benchmarks the current OSS inference solutions: https://hamel.dev/notes/llm/inference/03_inference.html

IME the streaming API in text-generation-inference works fine in production. (Though some of the other solutions may be better). I've used it with Starcoder (15B) and the time-to-first-token / tokens per second all seem quite reasonable out of the box.

> The standard practice for achieving fast inference is to rewrite the entire model inference loop in C++, as in FasterTransformer, and call out to special fused kernels in CUDA. But this means that any changes to the model require painfully reimplementing every feature twice: once in Python / PyTorch in the training code and again in C++ in the inference codebase. We found this process too cumbersome and error prone to iterate quickly on the model.

I am an AI novice but why can't they automated this with AI? I thought the whole point of these tools was to automated tasks that are error prone and require lots of attention to details. Computers are great at that kind of stuff so it's surprising they haven't applied AI techniques to automate parts of the AI pipeline like converting code from Python to C++.

ironrabbit · 3 years ago

Automatic kernel fusion (compilation) is a very active field, and most major frameworks support some easy-to-use compilation (e.g. jax's jit, or torch.compile which iirc uses openai's triton under the hood). Often you can still do better than the compiler by writing fused kernels yourself (either in cuda c++ or in something like triton (python which compiles down to cuda) but compilers are getting pretty good.

edit: not sure why op is getting downvotes, this is a very reasonable question imo; maybe the characterization of kernel compilation as "AI" vs. just "software"?

loopist · 3 years ago

Both AI and compilers are just software and right now the optimizers are written manually which is kinda weird because the whole point of LLMs is to generate sequences of tokens that minimize some scalar valued loss function. In the case of compilers the input is some high level code in python expressing tensor operations and the output is whatever is executable by GPUs as fast as possible by combination of kernels which are formally equivalent to the tensor operations expressed in Python (or whatever higher level language is used to write the tensor specifications to be optimized for the task at hand). Everything in this loop has a well defined input with a well defined output and an associated scalar valued metric (execution time) and even a normalization factor (output length with shorter sequences being "better").

The whole thing seems obviously amenable to gradient based optimization and data augmentation with synthetic code generators. It is surprising that no one is pursuing such approaches to improving the optimization pipeline in kernel compilation/fusion/optimization because it is just another symbol game with much better defined metrics than natural language models.

Bnjoroge · 3 years ago

thanks for explaining pretty concisely w/out being rude :)

automatistredo · 3 years ago

Can someone explain the down votes? What exactly is incorrect in OPs comment?

snissn · 3 years ago

i don't know why people downvote, but writing highly performant gpu code across multiple languages is still in the realm of only a few people with a lot of the right experience can do well, and while ai can help assist those people it's not a problem that can be fully solved by an ai at this moment, maybe a few years with a large feedback loop of iterating, testing, benchmarking, repeating. i guess one day but not now

TrueDuality · 3 years ago

I would have to guess it has something to do with that task not actually being suitable for language models at their current stage. Even if they could be trusted to perform the task, its actually not that much work to just... write code to handle keeping this kind of thing in sync. It's really really not that much more work. You really don't even need to do it, both training and inference can be done within PyTorch or in C++.

If it was necessary for some reason... Running a language model to keep something like this is sync over long term training and iteration would likely be more expensive than a developer's time AND block the researcher in a verification loop on the output which still probably needs to be checked by the developer (they could be the same person which will just deepen the frustration they experience).

The use of a lot of garbage accounts in this thread and lack of model details also looks pretty shady...

AaronFriel · 3 years ago

gardnr · 3 years ago

Two important takeaways on the base model:

* scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

* was trained from the beginning with a 16k context using a modified RoPe where many models are simply fine-tuned using RoPe to gain longer context windows after the base model has been trained at 4k.

Can anyone share ideas on how important the 2nd one is? Do LLMs benefit from large context windows using RoPe during pretraining?

sbierwagen · 3 years ago

phi-1 supposedly does 50.6 on HumanEval with 1.3B parameters. (Python only) https://arxiv.org/abs/2306.11644

Weights haven't been released, though.

euclaise · 3 years ago

phi-1 is a code-specific base model, with further finetuning on top of that. This is a general language base model, not really comparable.

imjonse · 3 years ago

no code or dataset either for phi-1.

swyx · 3 years ago

its not so much about benefit, as it is a design goal to want large context windows.

https://twitter.com/suchenzang/status/1699926157028897078?s=... notes some issues directly comparing the 16k context number. the odd choice of tokenizer means its effectively like a 10-12k model (? ballpark, not calculated)

That tweet had it backwards, more tokens in tokenizer means that the 16k token context window typically allows for even longer passages than if LLaMA were 16k

craigacp · 3 years ago

There's a correction to that tweet, larger vocab means fewer tokens for any given sequence (usually, assuming it's not to add other languages or character sets).

coder543 · 3 years ago

> scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

The article claims 18.9 for the base model, but also claims 20.7 for the fine tuned model.

thewataccount · 3 years ago

Awesome! I applaud everyone training new models and attempting different techniques!

I'm concerned about the current download's availability - its two URLs to some object storage. I find that these go dark rather quickly for many different reasons (accidentally moving it, bandwidth limits, deleting it later, etc).

I'm curious if there's a reason it's not also hosted on huggingface? I'm not saying they're the best place, but redundancy is good, most models have entries there, they have a very good cdn, and isn't as likely to go dark accidentally.

selfhoster11 · 3 years ago

If this model can be made to work as GGUF, TheBloke will probably have a set of quantizations in a day or two at most.

amks · 3 years ago

We're working on it!

123yawaworht456 · 3 years ago

I applaud you guys for not including any nauseating gibberish in this press release or seemingly anywhere else on your website. It's like a breath of fresh air comparing to every other AI-related resource I saw recently. Please, keep it up.

YetAnotherNick · 3 years ago

This is the least detailed foundational model release I have seen. Llama paper offers lot more details like ablations, loss curves etc. Falcon has data preparation details etc. Google's model release papers like T5 are some of the best and includes many ablations.

I mean "I am become death, destroyer of worlds" bullshit about AI safety/ethics/etc that is included in every press release from Google/Meta/OpenAI and even much smaller players.

solverist · 3 years ago

Why are ablations useful? Their release report seemed very informative to me without getting bogged down in jargon.

Congrats on the release! Two questions.

1) In the results table, Llama2 base is being compared to Persimmon base and finetuned, and only the latter performs better. Would a comparison to Llama2-chat be possible/fair?

2) The Llama-2 numbers for MMLU in that table seem different from those in the HF leaderboard and the Llama-2 webpage presentation. Is it the 1-shot variant that is different or are these measurements not 100% standard and reproducible?

ekelsen · 3 years ago

Llama2 chat performs worse and wasn't included for that reason.

The numbers are different because the measurement is different. The blog post explains that we sample from the models and expect answers rather than relying on perplexity measurements.

Could you share the results with standard way of benchmarking(accuracy of top selection). While the approach you guys took is reasonable, but it would be more informative to see to see how much better/worse it is with standard benchmark.

elietoubi · 3 years ago

Really cool! Honestly I wish these releases would come with a demo (like on replicate or hugging face)

deckar01 · 3 years ago

The docker container fails installing flash-attn… but honestly a giant API container on top of a custom model generation framework loses all the benefits of Torch’s standard interfaces. It doesn’t really matter how optimized your model runtime is if it’s cemented into a synchronous monolith. The metric that should be optimized is time to first decoded token, because that is how speed is perceived by humans reading the output.

Can you share details of the build failure on the github? We'll try to help.

The inference code is shared as a proof of concept, it is not meant to be a production ready deploy. Also worth noting that not all LLMs are used to produce text which is read by humans.

https://github.com/persimmon-ai-labs/adept-inference/issues/...

It’s funny you say production, because all of the errors I ran into suggest the container is expecting your production architecture.

My advice is stream first then make synchronous convenience wrappers on top of that. Also, lean on community standards for PoC. I’m guessing your investors are interested in making this scale as cheaply as possible, but that is probably the least important feature for people evaluating your model’s quality locally.

automatistist · 3 years ago