model = get_model(\*model_config)
state_dict = torch.load(model_path, weights_only=True)
new_state_dict = {k.replace('_orig_mod.', ''): v for k, v in state_dict.items()}
model.load_state_dict(new_state_dict)
model.eval()
with torch.no_grad():
output = model(torch.FloatTensor(X))
probabilities = torch.softmax(output, dim=X)
return probabilities.numpy()
Loading from disk to VRAM can be super slow- so doing this every time you have a new process is wasteful. Instead, if you have a daemon process that keeps multiple model weights in pinned RAM, you can load them much quicker (~1.5 seconds for a 8B model like we show in the demo).
You _could_ also make a single mega router process, but then there are issues like all services needing to agree on dependency versioning. This has been a problem for me in the past (like LAVIS requiring a certain transformer version that was not compatible with some other diffusion libraries)
(Please feel free to reach out to us too at towaki@outerport.com !)
We hope to make it easier to bridge the multi-cloud landscape by being independent and 'outer'.
Our inference stack is built using candle in Rust, how hard would it be to integrate?