llm-inference
Here are 486 public repositories matching this topic...
A programming framework for agentic AI. Discord:https://aka.ms/autogen-dc.Roadmap:https://aka.ms/autogen-roadmap
-
Updated
Aug 2, 2024 - Jupyter Notebook
Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.
-
Updated
Aug 2, 2024 - Python
Official inference library for Mistral models
-
Updated
Jul 24, 2024 - Jupyter Notebook
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
-
Updated
Aug 2, 2024 - Python
Bổn hạng mục chỉ ở chia sẻ đại mô hình tương quan kỹ thuật nguyên lý cùng với thực chiến kinh nghiệm.
-
Updated
Aug 1, 2024 - HTML
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
-
Updated
Jul 15, 2024 - C++
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
-
Updated
Aug 2, 2024 - Python
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
-
Updated
Aug 2, 2024 - C++
Superduper: Bring AI to your database! Integrate AI models and workflows with your database to implement custom AI applications, without moving your data. Including streaming inference, scalable model hosting, training and vector search.
-
Updated
Aug 2, 2024 - Python
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
-
Updated
Aug 2, 2024 - Python
Sparsity-aware deep learning inference runtime for CPUs
-
Updated
Jul 19, 2024 - Python
Code examples and resources for DBRX, a large language model developed by Databricks
-
Updated
May 1, 2024 - Python
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
-
Updated
Aug 1, 2024
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
-
Updated
Aug 1, 2024 - Python
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
-
Updated
Jun 25, 2024 - Jupyter Notebook
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
-
Updated
Aug 2, 2024 - Python
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
-
Updated
Mar 22, 2024 - Jupyter Notebook
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
-
Updated
Jul 31, 2024 - Jupyter Notebook
AICI: Prompts as (Wasm) Programs
-
Updated
Jul 30, 2024 - Rust
Improve this page
Add a description, image, and links to the llm-inference topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the llm-inference topic, visit your repo's landing page and select "manage topics."