adilhafeez (u/adilhafeez)

adilhafeez commented on Show HN: Arch-Router – 1.5B model for LLM routing by preferences, not benchmarks · Posted by u/adilhafeez

pseudosavant · 6 months ago

Not that LLMs are terribly latency sensitive (you wait on a lot of tokens), but what kind of latency impact does this have on requests that go through the proxy?

adilhafeez · 6 months ago

Short answer is latency impact is very minimal.

We use envoy as request handler which forwards request to local service written in rust. Envoy is proven to be high performance, low latency and highly efficient on request handling. If I have to put a number it would be in single digit ms per request. I will have more detailed benchmark in the coming days.

adilhafeez commented on The rise of intelligent infrastructure for LLM applications archgw.com/blogs/the-rise... · Posted by u/sparacha

adilhafeez · 9 months ago

thanks for sharing - this is quite insightful

adilhafeez commented on Show HN: ArchGW – An open-source intelligent proxy server for prompts github.com/katanemo/archg... · Posted by u/sparacha

adilhafeez · 9 months ago

Hi - this is Adil the co-founder who developed archgw. We are working tirelessly to create a framework that would help developers write agentic application without having to write all the crufty/boilerplate code. At the very minimum we provide observability and logging without adding much overhead. You can simple plug arch gateway into your existing LLM application and you'd start seeing details like time-to-first-token, total latency, token count and tons of other observability details. I do recommend start tinkering with our getting started page here [1]

And for a bit more advanced use cases I do recommend looking at llm_routing [2] demo and currency_exchange demo [3].

We currently support providing seamless interface to major providers like openai, mistral, deepseek and also support hooking up to local providers like ollma [4]

[1] - https://github.com/katanemo/archgw?tab=readme-ov-file#quicks...

[2] - https://github.com/katanemo/archgw/tree/main/demos/use_cases...

[3] - https://github.com/katanemo/archgw/tree/main/demos/samples_p...

[4] - https://github.com/katanemo/archgw/tree/main/demos/use_cases...

adilhafeez commented on Show HN: archgw: open-source, intelligent proxy for AI agents, built on Envoy github.com/katanemo/archg... · Posted by u/adilhafeez

mudassaralam · a year ago

why didn’t you build your own gateway from ground up? especially when rust runtime in envoy is not production ready yet. From envoyproxy,

… This extension is functional but has not had substantial production burn time, use only with this caveat.

This extension has an unknown security posture and should only be used in deployments where both the downstream and upstream are trusted.

adilhafeez · a year ago

Envoy has proven itself in the industry and we didn't want to reinvent the wheel by doing what envoy had already done for observability, rate-limits, connection management etc. And reason for using proxy-wasm was so we don't take hard dependency on any built version of envoy. There are many other benefits too which are listed here [1].

Regarding support for wasm runtime in envoy. We believe wasm support in envoy is not going anywhere and it will continue to become more and more stable over time. Envoy has heathy community and in case of any security vulnerability we will hope that envoy will ship fix quite fast which it has done in the past. See here for details of security patch rollout [2]

[1] https://github.com/proxy-wasm/spec/blob/main/docs/WebAssembl...

[2] https://github.com/envoyproxy/envoy/blob/main/SECURITY.md

adilhafeez commented on Show HN: archgw: open-source, intelligent proxy for AI agents, built on Envoy github.com/katanemo/archg... · Posted by u/adilhafeez

Nomi21 · a year ago

This is honestly quite a detailed and thoughtfully put together post. I do have some questions and would love to hear your thoughts on those. First off, can I use just the model itself? Do you have models hosted somewhere or they run locally? If they run locally what are the system requirements? Can I build RAG based applications on arch? And how do you do intent detection in multi-turn dialogue? How does parameter gathering work, is the model capable of conversing with the user to gather parameters?

adilhafeez · a year ago

Thanks :) thanks for those comments. Those are great questions. Let me respond to them one by one,

> Can I use just the model itself?

yes - our models are on huggingface. You can use them directly.

> Do you have models hosted somewhere or they run locally? If they run locally what are the system requirements?

arch gateway does bunch of processing locally for example for intent detection and hallucination we use nli model. For function calling we use hosted version of our 1.5B function calling model [1]. We use vllm to host our model, But vllm is not supported on mac. There are other issues too running model locally on mac. For example docker doesn't support giving gpu access on mac to containers. We tried using ollama in the past to host model but ollama doesn't support exposing logprobs. But we do have an issue on this [2] and we will improve it soon.

> Can I build RAG based applications on arch?

Yes you can. You would need to host vector db. In arch we don't host vector db, we wanted to keep our infra simple and clean. Do do have a default target that you can use to build RAG application. See this demo for example see insurance agent demo [3]. We do have an open issue on building a full RAG demo here [4], +1 to it to show your support.

> How does parameter gathering work, is the model capable of conversing with the user to gather parameters?

Our model is trained to engage in dialogue if a parameter is missing because our model has seen examples of missing parameters during training. During our evals and tests we found out that our model could still hallucinate e.g. for the question "how is the weather" model could hallucinate city as "LA" even though LA was not specified in query. We handle hallucination detection in arch using nli model to establish entailment of parameters from input query. BTW we are currently working on to improve that part by quite a lot. More on that in next release.

[1] https://huggingface.co/katanemo/Arch-Function-1.5B.gguf

[2] https://github.com/katanemo/archgw/issues/286

[3] https://github.com/katanemo/archgw/blob/main/demos/insurance...

[4] https://github.com/katanemo/archgw/issues/287

adilhafeez commented on Show HN: archgw: open-source, intelligent proxy for AI agents, built on Envoy github.com/katanemo/archg... · Posted by u/adilhafeez

mikram · a year ago

Congrats Adil! Interested idea with lot of potential.

Do you have to use envoyproxy to use archgw? Can archgw be used for LLM routing without using envoyproxy?

adilhafeez · a year ago

Thanks! My responses inline,

> do you have to use envoyproxy to use archgw

Yes, 100%. Our gateway is implemented as rust filter which runs inside envoy process.

> Can archgw be used for LLM routing without using envoyproxy?

Unfortunately no. Since we are built in top of envoyproxy.

adilhafeez commented on Show HN: archgw: open-source, intelligent proxy for AI agents, built on Envoy github.com/katanemo/archg... · Posted by u/adilhafeez

fahimulhaq · a year ago

Hey Adil, Thanks for sharing and congratulations on launch.

Can I just use arch for routing between LLMs? And what LLMs do you support? And what about key management? Do I manage access keys myself?

adilhafeez · a year ago

Thanks! Those are all good questions. Let me respond to them one by one,

> Can I just use arch for routing between LLMs

Yes, you can use arch_config.yaml file to select between LLMs. In fact we have a demo on llm_routing [1] that you can try. Here how you can specify different LLMs in our config,

  llm_providers:
    - name: gpt-4o-mini
      access_key: $OPENAI_API_KEY
      provider: openai
      model: gpt-4o-mini
      default: true

    - name: gpt-3.5-turbo-0125
      access_key: $OPENAI_API_KEY
      provider: openai
      model: gpt-3.5-turbo-0125

    - name: gpt-4o
      access_key: $OPENAI_API_KEY
      provider: openai
      model: gpt-4o

    - name: ministral-3b
      access_key: $MISTRAL_API_KEY
      provider: mistral
      model: ministral-3b-latest

> And what LLMs do you support

We currently support mistral and openai. And for both of them we support streaming interface. We do expose openai complaint v1/chat interface so any chat UI that works with openai should work with us as well. We do ship demos with gradio sample application.

> And what about key management? Do I manage access keys myself?

None of your clients need to manage access keys. Upon receipt of request our filter will appropriate LLM from arch_config and pick relevant access_key and modify request with access_key from arch_config before sending request to upstream LLM [2].

[1] https://github.com/katanemo/archgw/tree/main/demos/llm_routi...

[2] https://github.com/katanemo/archgw/blob/main/crates/llm_gate...