Azure ChatGPT: Private and secure ChatGPT for internal enterprise use

This appears to be a web frontend with authentication for Azure's OpenAI API, which is a great choice if you can't use Chat GPT or its API at work.

If you're looking to try the "open" models like Llama 2 (or it's uncensored version Llama 2 Uncensored), check out https://github.com/jmorganca/ollama or some of the lower level runners like llama.cpp (which powers the aforementioned project I'm working on) or Candle, the new project by hugging face.

What's are folks' take on this vs Llama 2, which was recently released by Facebook Research? While I haven't tested it extensively, 70B model is supposed to rival Chat GPT 3.5 in most areas, and there are now some new fine-tuned versions that excel at specific tasks like coding (the 'codeup' model) or the new Wizard Math (https://github.com/nlpxucan/WizardLM) which claims to outperform ChatGPT 3.5 on grade school math problems.

ttul · 3 years ago

Llama 2 might by some measures be close to GPT 3.5, but it’s nowhere near GPT 4, nor Anthropic Claude 2 or Cohere’s model. The closed source players have the best researchers - they are being paid millions a year with tons of upside - and it’s hard to keep pace with that. My sense is that the foundation model companies have an edge for now and will probably stay a few steps ahead of the open source realm simply for economic reasons.

Over the long run, open source will eventually overtake. Chances are this will happen once the researchers who are making magic happen get their liquidity and can start working for free again out in the open.

nl · 3 years ago

> The closed source players have the best researchers - they are being paid millions a year with tons of upside - and it’s hard to keep pace with that.

Llama2 came out of Meta's AI group. Meta pays researcher salaries competitive with any other group, and their NLP team is one of the top groups in the world.

For researchers it is increasingly the most attractive industrial lab because they release the research openly.

robertnishihara · 3 years ago

> Llama 2 might by some measures be close to GPT 3.5, but it’s nowhere near GPT 4

I think you're right about this, and benchmarks we've run at Anyscale support this conclusion [1].

The caveat there (which I think will be a big boon for open models) is that techniques like fine-tuning makes a HUGE difference and can bridge the quality gap between Llama-2 and GPT-4 for many (but not all) problems.

[1] https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

xcdzvyn · 3 years ago

> The closed source players have the best researchers

Is that definitely why? GPT 3.5 and GPT 4 are far larger than 70B, right? So if a 70B, local model like LLaMA can even remotely rival them, would that not suggest that LLaMA is fundamentally a better model?

For example, would a LLaMA model with even half of GPT 4's parameters be projected to outperform it? Is that how it works?

[I'm not super familiar with LLM tech]

llm_thr · 3 years ago

You seriously underestimate just how much _not_ having to tune your llm for SF sensibilities benefits performance.

As an example from the last six months: people on tor are producing better than state of the art stable diffusion because they want porn without limitations. I haven't had the time to look at llm's but the degenerates who enjoy that sort of thing have said they can get the Llama2 model to role play their dirty fantasies and then have stable diffusion illustrate said fantasies. It's a brave new world and it's not on the WWW.

jgalt212 · 3 years ago

OK, fair enough. Please give me an example of a customer facing chatbot that Llama 2 (and unbearable to use) and GPT 4 customer facing chatbot that is a joy to use. I think at the end of the day, you still have customers dreading such interactions.

chaosbolt · 3 years ago

>but it’s nowhere near GPT 4

It will be if openai keeps dumbing down GPT 4, no proof they're doing it but there is no way it's as good as it was at launch, or maybe I just got used to it and now notice the mistakes more.

ReptileMan · 3 years ago

Linux started in the same position. Sometimes the underdogs win.

rightbyte · 3 years ago

I don't think paying more will give you better researchers. Maybe better "players".

pyrophane · 3 years ago

> While I haven't tested it extensively, 70B model is supposed to rival Chat GPT 3.5 in most areas, and there are now some new fine-tuned versions that excel at specific tasks

That has been my experience. Having experimented with both (informally), Llama 2 is similar to GPT-3.5 for a lot of general comprehension questions.

GPT-4 is still the best amongst the closed-source, cutting edge models in terms of general conversation/reasoning, although 2 things:

1. The guardrails that OpenAI has placed on ChatGPT are too aggressive! They clamped down on it quite hard to the extent that it gets in the way of a reasonable query far too often.

2. I've gotten pretty good results with smaller models trained on specific datasets. GPT-4 is still on top in terms of general purpose conversation, but for specific tasks, you don't necessarily need it. I'd also add that for a lot of use cases, context size matters more.

scarface_74 · 3 years ago

To your first point, I was trying use ChatGPT to generate some examples of negative interactions with customer service to show sentiment analysis in action for a project I was working on.

I had to do all types of workarounds for it to generate something useful without running into the guardrails.

pseudosavant · 3 years ago

I’ll second the context window too. I’ve been really impressed with Claude 2 because it can address such a larger context than I could feed into GPT4.

ramraj07 · 3 years ago

Could you give examples of smaller models trained on specific datasets?

CodeCompost · 3 years ago

Can it handle other languages besides English?

jmorgan · 3 years ago

RE 2 - neat! What are some tasks you've been using smaller models (with perhaps larger context sizes) for?

sytelus · 3 years ago

LLaMA2 is still quite a bit behind ChatGPT 3.5 and this mainly get reflected in coding and math. It's easy to beat NLP based benchmark but much much harder to beat NLP+math+coding togather. I think this gap reflects gap in reasoning but we don't have a good non-coding/non-math benchmark to measure it.

samstave · 3 years ago

I just had a crazy FN (dystopian) idea...

Scene:

The world relies on AI in every aspect.

But there are countless 'models' the tech try to call them...

There was an attempt to silo each model and provide a governance model on how/what/why they were allowed to communicate....

But there was a flaw.

It was an AI only exploitable flaw.

AIs were not allowed to talk about specific constructs or topics, people, code, etc... that were outside their silo but what they COULD do - was talk about pattern recog...

So they ultimately developed an internal AI language on scoring any inputs as being the same user... And built a DB of their own weighted userbase - and upon that built their judgement system...

So if you typed in a pattern, spoke in a pattern, posted temporally on a pattern, etc - it didnt matter which silo you were housed in, or what topics you were referencing -- the AIs can find you.... god forbid they get a keylogger on your machine...

3abiton · 3 years ago

Our company is looking into similar solution

A lot of companies are already using projects like chatbot-ui with Azure's OpenAI for similar local deployments. Given this is as close to local ChatGPT as any other project can get, this is a huge deal for all those enterprises looking to maintain control over their data.

Shameless plug: Given the sensitivity of the data involved, we believe most companies prefer locally installed solutions to cloud based ones at least in the initial days. To this end, we just open sourced LLMStack (https://github.com/TryPromptly/LLMStack) that we have been working on for a few months now. LLMStack is a platform to build LLM Apps and chatbots by chaining multiple LLMs and connect to user's data. A quick demo at https://www.youtube.com/watch?v=-JeSavSy7GI. Still early days for the project and there are still a few kinks to iron out but we are very excited for it.

gdiamos · 3 years ago

I find it interesting to see how competitive this space got so quickly.

How do these stacks differentiate?

scrum-treats · 3 years ago

Quality and depth of particular types of training data is one difference. Another difference is inference tracking mechanisms within and between single-turn interactions (e.g., what does the human user "mean" with their prompt, what is the "correct" response, and how best can I return the "correct" response for this context; how much information do I cache from the previous turns, and how much if any of it is relevant to this current turn interaction).

lmeyerov · 3 years ago

With Louie.ai, there is a lot of work on specialization for the job, and I expect the same for others. We help with data analysis, so connecting enterprise & common data sources & DBs, hooking up data tools (GPU visuals, integrated code interpreter, ...), security controls, and the like, which is different from say a ChatGPT for lawyers or a straight up ChatGPT UI clone.

Technically, as soon as the goal is to move beyond just text2gpt2screen, like multistep data wrangling & viz in the middle of a conversation, most tools technically struggle. Query quality also comes up, whether quality of the RAG, the fine tune, prompts, etc: each solves different problems.

Dead Comment

robertnishihara · 3 years ago

> we believe most companies prefer locally installed solutions to cloud based ones

We've also seen a strong desire from businesses to manage models and compute on their own machines or in their own cloud accounts. This is often part of a hybrid strategy of using API products like OpenAI for rapid prototyping.

The majority of (though not all) businesses we've seen tend to be quite comfortable using hosted API products for rapid prototyping and for proving out an initial version of their AI functionality. But in many cases, they want to complement that with the ability to manage models and compute themselves. The motivation here is often to reduce costs by using smaller / faster / cheaper fine-tuned open models.

When we started Anyscale, customer demand led us to run training & inference workloads in our customers' cloud accounts. That way your data and code stays inside of your own cloud account.

Now with all the progress in open models and the desire to rapidly prototype, we're complementing that with a fully-managed inference API where you can do inference with the Llama-2 models [1] (like the OpenAI API but for open models).

[1] https://app.endpoints.anyscale.com/

Deleted Comment

toomuchtodo · 3 years ago

Can you plug this together with tools like api2ai to create natural language defined workflow automations that interact with external APIs?

ajhai · 3 years ago

There is a generic HTTP API processor that can be used to call APIs as part of the app flow which should help invoke tools. Currently working on improving documentation so it is easy to get started with the project. We also have some features planned around function calling that should make it easy to natively integrate tools into the app flows.

cosbgn · 3 years ago

You can use unfetch.com to make API calls via LLMs and build automations. (I'm building it)

bhanu423 · 3 years ago

Interesting project - was trying it out, found an issue in building the image - have opened an issue on github - please take a look. Also do you have plan to support llama over openai models.

ajhai · 3 years ago

Thanks for the issue. Will take a look. In the meantime, you can try the registry image with `cp .env.prod .env && docker compose up`

> Also do you have plan to support llama over openai models.

Yes, we plan to support llama etc. We currently have support for models from OpenAI, Azure, Google's Vertex AI, Stability and a few others.

extr · 3 years ago

One thing I still don't understand is what _is_ the ChatGPT front end exactly? I've used other "conversational" implementations built with the API and they never work quite as well, it's obvious that you run out of context after a few conversation turns. Is ChatGPT doing some embedding lookup inside the conversation thread to make the context feel infinite? I've noticed anecdotally it definitely isn't infinite, but it's pretty good at remembering details from much earlier. Are they using other 1st party tricks to help it as well?

shubb · 3 years ago

This is one of the things that make me uncomfortable about proprietary llm.

They get task performance by doing a lot more than just feeding a prompt straight to an llm, and then we performance compare them to raw local options.

The problem is, as this secret sauce changes, your use case performance is also going to vary in ways that are impossible for you to fix. What if it can do math this month and next month the hidden component that recognizes math problems and feeds them to a real calculator is removed? Now your use case is broken.

Feels like building on sand.

BoorishBears · 3 years ago

I'm not sure you realize how proprietary LLMs are being built on.

No one is doing secret math in the backend people are building on. The OpenAI API allows you to call functions now, but even that is just a formalized way of passing tokens into the "raw LLM".

All the features in the comment you replied to only apply to the web interface, and here you're being given an open interface you can introspect.

SOLAR_FIELDS · 3 years ago

They definitely do some proprietary running summarization to rebuild the context with each chat. Probably a RAG like approach that has had a lot of attention and work

This is effectively my question. I assume there is some magic going on. But how many engineering hours worth of magic, approximately? There is a lot of speculation around GPT-4 being MoE and whatnot. But very little speculation about the magic of the ChatGPT front end specifically that makes it feel so fluid.

MaxLeiter · 3 years ago

It uses a sliding context windows. Older tokens are dropped as new ones stream in

I don't believe that's the whole story. Other conversational implementations use sliding context windows and it's very noticable as context drops off. Whereas ChatGPT seems to retain the "gist" of the conversation much longer.

simonbutt · 3 years ago

Logic for azure chatgpt's "infinite context" summarisation is in https://github.com/microsoft/azurechatgpt/blob/main/src/feat...

*Edit Azure chatgpt, would be amazed/disappointed if chatgpt used langchain.

furyofantares · 3 years ago

That doesn't really look right to me, it looks like that's for responding regarding uploaded documents. I see nothing related to infinite context.

Also this is the azure repo from OP, nothing to do with the actual ChatGPT front-end that was asked about. I highly doubt the official ChatGPT front-end uses langchain, for example.

Xenoamorphous · 3 years ago

This is Azure's docs to create a conversation: https://learn.microsoft.com/en-us/azure/cognitive-services/o...

qwertox · 3 years ago

I don't see anything related to an infinite context in there. There's only a reference to a server-side `summary` variable which suggests that there is a summary of previous posts which will get sent along with the question for context, as is to be expected. Nothing suggests an infinite context.

robbomacrae · 3 years ago

This is potentially a huge deal. Companies are concerned using ChatGPT might violate data privacy policies if someone puts in user data or invalidate trade secrets protections if someone uploads sections of code. I suspect many companies have been waiting for an enterprise version.

tbrownaw · 3 years ago

This is a web UI that talks to a (separate) Azure OpenAI resource that you can deploy into your subscription as a SaaS instance.

hackernewds · 3 years ago

So how is it any different

judge2020 · 3 years ago

I imagine most companies serious about this created their own wrappers around the API or contracted it out, likely using private Azure GPUs.

Normal_gaussian · 3 years ago

Most companies are either not tech companies, or do not have the knowledge to manage such a project within reasonable cost bounds.

TuringNYC · 3 years ago

Curious if anyone has done a side-by-side analysis of this offering vs just running LLaMA?

I'm currently running a side-by-side comparison/evaluation of MSFT GPT via Cognitive Services vs LLaMA[7B/13B/70B] and intrigued by the possibility of a truly air-gapped offering not limited by external computer power (nor by metered fees racking up.)

Any reads on comparisons would be nice to see.

(yes, I realize we'll eventually run into the same scaling issues w/r/t GPUs)

tikkun · 3 years ago

I did one. I took a few dozen prompts from my ChatGPT history and ran them through a few LLMs.

GPT-4, Bard and Claude 2 came out on top.

Llama 2 70b chat scored similarly to GPT-3.5, though GPT-3.5 still seemed to perform a bit better overall.

My personal takeaway is I’m going to continue using GPT-4 for everything where the cost and response time are workable.

Related: A belief I have is that LLM benchmarks are all too research oriented. That made sense when LLMs were in the lab. It doesn't make sense now that LLMs have tens of millions of DAUs — i.e. ChatGPT. The biggest use cases for LLMs so far are chat assistants and programming assistants. We need benchmarks that are based on the way people use LLMs in chatbots and the type of questions that real users use LLM products, not hypothetical benchmarks and random academic tests.

Q6T46nT668w6i3m · 3 years ago

I don’t know what you mean by “too research oriented.” A common complaint in LLM research is the poor quality of evaluation metrics. There’s no consensus. Everyone wants new benchmarks but designing useful metrics is very much an open problem.

TillE · 3 years ago

I think tests like "can this LLM pass an English literature exam it's never seen before" are probably useful, but yeah there's a lot of silly stuff like math tests.

I suppose the question is where are they most commercially viable. I've found them fantastic for creative brainstorming, but that's sort of hard to test and maybe not a huge market.

How did you measure the performance?

We (at Anyscale) have benchmarked GPT-4 versus the Llama-2 suite of models on a few problems: functional representation, SQL generation, grade-school math question answering.

GPT-4 wins by a lot out of the box. However, surprisingly, fine-tuning makes a huge difference and allows the 7B Llama-2 model to outperform GPT-4 on some (but not all) problems.

This is really great news for open models as many applications will benefit from smaller, faster, and cheaper fine-tuned models rather than a single large, slow, general-purpose model (Llama-2-7B is something like 2% of the size of GPT-4).

GPT-4 continues to outperform even the fine-tuned 70B model on grade-school math question answering, likely due to the data Llama-2 was trained on (more data for fine-tuning helps here).

https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

FrenchDevRemote · 3 years ago

chatgpt is obviously a LOT better, llama doesn't even understand some prompts

and since LLMs aren't even that good to begin with, it's obvious you want the SOTA to do anything useful unless maybe you're finetuning

baobabKoodaa · 3 years ago

> and since LLMs aren't even that good to begin with, it's obvious you want the SOTA to do anything useful unless maybe you're finetuning

This is overkill. First of all, ChatGPT isn't even the SOTA, so if you "want SOTA to do anything useful", then this ChatGPT offering would be as useless as LLaMA according to you. Second, there are many individual tasks where even those subpar LLaMA models are useful - even without finetuning.

londons_explore · 3 years ago

openai offers finetuning too. And it's pretty cheap to do considering.

byteknight · 3 years ago

And they removed it :) [0]

You're welcome.

[0] https://github.com/microsoft/azurechatgpt

[1] https://web.archive.org/web/20230814080150/https://github.co...

aa-morgan · 3 years ago

If anyone needs access to the code, you just need to /forks on the web.archive link above and download from there. i.e. https://web.archive.org/web/20230814150922/https://github.co... (the cache ID updates when you change the URL)

omarhaneef · 3 years ago

Ugh. Any clue as to why?

giulio8 · 3 years ago

I suspect they want to redirect to https://github.com/microsoft/chat-copilot with FluentUI webapp and C# webapi... And the backend stores from qdrant to chroma ... Sequential Planner...

latebird22 · 3 years ago

Does anybody know a fork with the last commit (9116afe)?

https://github.com/oliverlabs/azurechatgpt

dandinu · 3 years ago

or this: https://github.com/dandinu/azurechatgpt

borissk · 3 years ago

I can imagine how the conversation went with the enterprise customers: "Where does this send the data our employees enter?" "Same place as if they used the free ChatGPT chat bot..."

wodenokoto · 3 years ago

No it doesn't. It sends it to an LLM hosted inside the Enterprises own Azure Subscription.

Roark66 · 3 years ago

Private and secure? I thought the main issue with privacy and security of (not at all)OpenAI models is that by using their products you agree for them to retain all the data you send and receive from the models forever for whatever they choose to use it for. Or is this just a thing for free use?

If you pay, do you get a Ts&Cs that don't contain any wording like this? Still, even if there was no specific "we own everything" statement there could be pretty much standard statement of "we'll retain data as required for the delivery and improvement of the service" which is essentially the same thing.

So, any company that allows it's employees to use chatgpt for work stuff (writing emails with company secrets etc) is definitely not engaging in "secure and private" use.

Unless there is very clear data ownership, for example, customer owns the data going in and going out. I can't see how it can be any different. The problem (not at all)OpenAI has in delivery such service is that in contrast to open source models I'm told there is a lot of "secret sauce" around the model(not just the model itself). Specifically input/output processing, result scoring and so on.

pietz · 3 years ago

The Azure SLAs state that neither the chats are stored nor used for training in any way. They are private and protected in the same way all the other sensitive data is stored on Azure.

On top, you might argue that Microsoft and Azure are easier to trust than a still rather new AI startup.

robga · 3 years ago

I agree with your points. Having said that, Microsoft removed my Azure OpenAI GPT-4 access last week without warning. I was not breaking any TOS. Oh well, pointed back at OpenAi.

agentgumshoe · 3 years ago

So what do they train it on then?

kiratp · 3 years ago

> Starting on March 1, 2023, we are making two changes to our data usage and retention policies:

> OpenAI will not use data submitted by customers via our API to train or improve our models, unless you explicitly decide to share your data with us for this purpose. You can opt-in to share data.

> Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).

https://openai.com/policies/api-data-usage-policies

actionfromafar · 3 years ago

Unless required by law… I wonder what law.

politelemon · 3 years ago

The models like gpt themselves are inherently private and secure. They make predictions based on input.

It's what happens in the interface, that is your web chat or API call, which is different per implementation. ChatGPT is an implementation that uses that model and its maker OpenAI wants to keep your history for further training.

But what Azure is doing is taking that model and putting it behind an endpoint specific to your Azure account. Businesses have been interested in gpt, so asking for private endpoints. Amazon is doing the same with Bedrock.

homero · 3 years ago

I'm pretty sure the point of this version is not to export data hence the name

vorticalbox · 3 years ago

This only applies to the api (not chatGPT) their privacy policy states they will keep your requests for 30days and not use it for training. You can also apply for zero retention.

dalbasal · 3 years ago

Privacy and security... in practice, can mean different things.

In HN-space, it is at its most abstract, idealistic, etc. At the practical level this services is aimed at... it might mean compliance, or CYA. Less cynically, it might mean something mundane. MSFT's guarantee, a responsive place to report security issues.

paxys · 3 years ago

Would it be too much to mention somewhere in the README what this repo actually contains? Just docs? Deployment files? Some application (which does..something)? The model itself?

The repo contains the UI code, not the model or anything else around ChatGPT, it just uses Azure’s ChatGPT API which doesn’t share data with OpenAI.

So basically – what you really need to do to run Azure ChatGPT is go and click some buttons in the Azure portal. This repo is a sample UI that you could possibly use to talk to that instance, but really you will probably always build your own or embed it directly into your products.

So calling the repo "azurechatgpt" is misleading. It should really be "sample-chatgpt-api-frontend" or something of that sort.

Isn’t there also some sort of backend stuff in there? How else would it keep track of history and accept documents.

I don’t know enough typescript to understand where the front end stops and the backend begins I this code