llm-inference

The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!

Python machine-learning deep-learning model-serving multimodal mlops ml-engineering ai-inference llm generative-ai llmops llm-serving model-inference-service llm-inference inference-platform

Updated Aug 2, 2024
Python

openvinotoolkit / openvino

Star

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference

nlp natural-language-processing ai computer-vision deep-learning transformers inference speech-recognition yolo recommendation-system performance-boost good-first-issue openvino diffusion-models stable-diffusion generative-ai llm-inference optimize-ai deploy-ai

Updated Aug 2, 2024
C++

Superduper: Bring AI to your database! Integrate AI models and workflows with your database to implement custom AI applications, without moving your data. Including streaming inference, scalable model hosting, training and vector search.

Updated Aug 2, 2024
Python

InternLM / lmdeploy

Star

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

llama cuda-kernels deepspeed llm fastertransformer llm-inference turbomind internlm llama2 codellama llama3

Updated Aug 2, 2024
Python

neuralmagic / deepsparse

Star

Sparsity-aware deep learning inference runtime for CPUs

nlp performance computer-vision inference machinelearning pruning object-detection pretrained-models quantization cpus onnx sparsification llm-inference deepsparse

Updated Jul 19, 2024
Python

databricks / dbrx

Star

Code examples and resources for DBRX, a large language model developed by Databricks

databricks llm generative-ai gen-ai llm-training llm-inference mosaic-ai

Updated May 1, 2024
Python

DefTruth / Awesome-LLM-Inference

Star

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

sora llm llms vllm llm-inference awesome-llm flash-attention flash-attention-2 tensorrt-llm paged-attention deepseek open-sora flash-attention-3

Updated Aug 1, 2024

intel / intel-extension-for-transformers

Star

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

retrieval chatbot rag habana large-language-model chatpdf llm-inference 4-bits speculative-decoding llm-cpu streamingllm intel-optimized-llamacpp neural-chat neural-chat-7b autoround gaudi3

Updated Aug 1, 2024
Python

FasterDecoding / Medusa

Star

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

llm llm-inference

Updated Jun 25, 2024
Jupyter Notebook

predibase / lorax

Star

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

transformers pytorch llama gpt lora model-serving fine-tuning llm llmops llm-serving llm-inference

Updated Aug 2, 2024
Python

liltom-eth / llama2-webui

Star

Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.

llm llm-inference llama2 llama-2

Updated Mar 22, 2024
Jupyter Notebook

NVIDIA / GenerativeAIExamples

Star

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

microservice gpu-acceleration nemo tensorrt rag triton-inference-server large-language-models llm llm-inference retrieval-augmented-generation

Updated Jul 31, 2024
Jupyter Notebook

microsoft / aici

Star

AICI: Prompts as (Wasm) Programs

rust ai wasm inference transformer language-model model-serving wasmtime llm llmops llm-serving llm-inference llm-framework

Updated Jul 30, 2024
Rust

Improve this page

Add a description, image, and links to the llm-inference topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-inference topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-inference

Here are 486 public repositories matching this topic...

nomic-ai / gpt4all

microsoft / autogen

bentoml / OpenLLM

mistralai / mistral-inference

Lightning-AI / litgpt

liguodongiot / llm-action

SJTU-IPADS / PowerInfer

bentoml / BentoML

openvinotoolkit / openvino

superduper-io / superduper

InternLM / lmdeploy

neuralmagic / deepsparse

databricks / dbrx

DefTruth / Awesome-LLM-Inference

intel / intel-extension-for-transformers

FasterDecoding / Medusa

predibase / lorax

liltom-eth / llama2-webui

NVIDIA / GenerativeAIExamples

microsoft / aici

Improve this page

Add this topic to your repo