Skip to content
/ jina Public

☁️ Build multimodal AI applications with cloud-native stack

License

Notifications You must be signed in to change notification settings

jina-ai/jina

Jina logo: Build multimodal AI services via cloud native technologies · Model Serving · Generative AI · Neural Search · Cloud Native

Build multimodal AI applications with cloud-native technologies

PyPI PyPI - Downloads from official pypistats Github CD status

Jina lets you build multimodalAI servicesandpipelinesthat communicate via gRPC, HTTP and WebSockets, then scale them up and deploy to production. You can focus on your logic and algorithms, without worrying about the infrastructure complexity.

Jina provides a smooth Pythonic experience for serving ML models transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Jina makes advanced solution engineering and cloud-native technologies accessible to every developer.

Wait, how is Jina different from FastAPI? Jina's value proposition may seem quite similar to that of FastAPI. However, there are several fundamental differences:

Data structure and communication protocols

  • FastAPI communication relies on Pydantic and Jina relies onDocArrayallowing Jina to support multiple protocols to expose its services. The support for gRPC protocol is specially useful for data intensive applications as for embedding services where the embeddings and tensors can be more efficiently serialized.

Advanced orchestration and scaling capabilities

  • Jina allows you to easily containerize and orchestrate your services and models, providing concurrency and scalability.
  • Jina lets you deploy applications formed from multiple microservices that can be containerized and scaled independently.

Journey to the cloud

  • Jina provides a smooth transition from local development (usingDocArray) to local serving usingDeploymentandFlow to having production-ready services by using Kubernetes capacity to orchestrate the lifetime of containers.
  • By usingJina AI Cloudyou have access to scalable and serverless deployments of your applications in one command.

Install

pip install jina

Find more install options onApple Silicon/Windows.

Get Started

Basic Concepts

Jina has three fundamental layers:

  • Data layer:BaseDocandDocList(fromDocArray) are the input/output formats in Jina.
  • Serving layer: AnExecutoris a Python class that transforms and processes Documents. By simply wrapping your models into an Executor, you allow them to be served and scaled by Jina.Gatewayis the service making sure connecting all Executors inside a Flow.
  • Orchestration layer:Deploymentserves a single Executor, while aFlowserves Executors chained into a pipeline.

The full glossary is explained here.

Serve AI models

Let's build a fast, reliable and scalable gRPC-based AI service. In Jina we call this anExecutor.Our simple Executor will wrap theStableLMLLM from Stability AI. We'll then use aDeploymentto serve it.

Note A Deployment serves just one Executor. To combine multiple Executors into a pipeline and serve that, use aFlow.

Let's implement the service's logic:

executor.py
fromjinaimportExecutor,requests
fromdocarrayimportDocList,BaseDoc

fromtransformersimportpipeline


classPrompt(BaseDoc):
text:str


classGeneration(BaseDoc):
prompt:str
text:str


classStableLM(Executor):
def__init__(self,**kwargs):
super().__init__(**kwargs)
self.generator=pipeline(
'text-generation',model='stabilityai/stablelm-base- Alpha -3b'
)

@requests
defgenerate(self,docs:DocList[Prompt],**kwargs)->DocList[Generation]:
generations=DocList[Generation]()
prompts=docs.text
llm_outputs=self.generator(prompts)
forprompt,outputinzip(prompts,llm_outputs):
generations.append(Generation(prompt=prompt,text=output))
returngenerations

Then we deploy it with either the Python API or YAML:

Python API:deployment.py YAML:deployment.yml
fromjinaimportDeployment
fromexecutorimportStableLM

dep=Deployment(uses=StableLM,timeout_ready=-1,port=12345)

withdep:
dep.block()
jtype:Deployment
with:
uses:StableLM
py_modules:
-executor.py
timeout_ready:-1
port:12345

And run the YAML Deployment with the CLI:jina deployment --uses deployment.yml

UseJina Clientto make requests to the service:

fromjinaimportClient
fromdocarrayimportDocList,BaseDoc


classPrompt(BaseDoc):
text:str


classGeneration(BaseDoc):
prompt:str
text:str


prompt=Prompt(
text='suggest an interesting image generation prompt for a mona lisa variant'
)

client=Client(port=12345)# use port from output above
response=client.post(on='/',inputs=[prompt],return_type=DocList[Generation])

print(response[0].text)
a steampunk version of the Mona Lisa, incorporating mechanical gears, brass elements, and Victorian era clothing details

Note In a notebook, you can't usedeployment.block()and then make requests to the client. Please refer to the Colab link above for reproducible Jupyter Notebook code snippets.

Build a pipeline

Sometimes you want to chain microservices together into a pipeline. That's where aFlowcomes in.

A Flow is aDAGpipeline, composed of a set of steps, It orchestrates a set ofExecutorsand aGatewayto offer an end-to-end service.

Note If you just want to serve a single Executor, you can use aDeployment.

For instance, let's combineour StableLM language modelwith a Stable Diffusion image generation model. Chaining these services together into aFlowwill give us a service that will generate images based on a prompt generated by the LLM.

text_to_image.py
importnumpyasnp
fromjinaimportExecutor,requests
fromdocarrayimportBaseDoc,DocList
fromdocarray.documentsimportImageDoc


classGeneration(BaseDoc):
prompt:str
text:str


classTextToImage(Executor):
def__init__(self,**kwargs):
super().__init__(**kwargs)
fromdiffusersimportStableDiffusionPipeline
importtorch

self.pipe=StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",torch_dtype=torch.float16
).to("cuda")

@requests
defgenerate_image(self,docs:DocList[Generation],**kwargs)->DocList[ImageDoc]:
result=DocList[ImageDoc]()
images=self.pipe(
docs.text
).images# image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)
result.tensor=np.array(images)
returnresult

Build the Flow with either Python or YAML:

Python API:flow.py YAML:flow.yml
fromjinaimportFlow
fromexecutorimportStableLM
fromtext_to_imageimportTextToImage

flow=(
Flow(port=12345)
.add(uses=StableLM,timeout_ready=-1)
.add(uses=TextToImage,timeout_ready=-1)
)

withflow:
flow.block()
jtype:Flow
with:
port:12345
executors:
-uses:StableLM
timeout_ready:-1
py_modules:
-executor.py
-uses:TextToImage
timeout_ready:-1
py_modules:
-text_to_image.py

Then run the YAML Flow with the CLI:jina flow --uses flow.yml

Then, useJina Clientto make requests to the Flow:

fromjinaimportClient
fromdocarrayimportDocList,BaseDoc
fromdocarray.documentsimportImageDoc


classPrompt(BaseDoc):
text:str


prompt=Prompt(
text='suggest an interesting image generation prompt for a mona lisa variant'
)

client=Client(port=12345)# use port from output above
response=client.post(on='/',inputs=[prompt],return_type=DocList[ImageDoc])

response[0].display()

Easy scalability and concurrency

Why not just use standard Python to build that service and pipeline? Jina accelerates time to market of your application by making it more scalable and cloud-native. Jina also handles the infrastructure complexity in production and other Day-2 operations so that you can focus on the data application itself.

Increase your application's throughput with scalability features out of the box, likereplicas,shardsanddynamic batching.

Let's scale a Stable Diffusion Executor deployment with replicas and dynamic batching:

  • Create two replicas, witha GPU assigned for each.
  • Enable dynamic batching to process incoming parallel requests together with the same model inference.
Normal Deployment Scaled Deployment
jtype:Deployment
with:
uses:TextToImage
timeout_ready:-1
py_modules:
-text_to_image.py
jtype:Deployment
with:
uses:TextToImage
timeout_ready:-1
py_modules:
-text_to_image.py
env:
CUDA_VISIBLE_DEVICES:RR
replicas:2
uses_dynamic_batching:#configure dynamic batching
/default:
preferred_batch_size:10
timeout:200

Assuming your machine has two GPUs, using the scaled deployment YAML will give better throughput compared to the normal deployment.

These features apply to bothDeployment YAMLandFlow YAML.Thanks to the YAML syntax, you can inject deployment configurations regardless of Executor code.

Deploy to the cloud

Containerize your Executor

In order to deploy your solutions to the cloud, you need to containerize your services. Jina provides theExecutor Hub,the perfect tool to streamline this process taking a lot of the troubles with you. It also lets you share these Executors publicly or privately.

You just need to structure your Executor in a folder:

TextToImage/
├── executor.py
├── config.yml
├── requirements.txt
config.yml requirements.txt
jtype:TextToImage
py_modules:
-executor.py
metas:
name:TextToImage
description:Text to Image generation Executor based on StableDiffusion
url:
keywords:[]
diffusers
accelerate
transformers

Then push the Executor to the Hub by doing:jina hub push TextToImage.

This will give you a URL that you can use in yourDeploymentandFlowto use the pushed Executors containers.

jtype:Flow
with:
port:12345
executors:
-uses:jinai+docker://<user-id>/StableLM
-uses:jinai+docker://<user-id>/TextToImage

Get on the fast lane to cloud-native

Using Kubernetes with Jina is easy:

jinaexportkubernetes flow.yml./my-k8s
kubectl apply -R -f my-k8s

And so is Docker Compose:

jinaexportdocker-compose flow.yml docker-compose.yml
docker-compose up

Note You can also export Deployment YAML toKubernetesandDocker Compose.

That's not all. We also supportOpenTelemetry, Prometheus, and Jaeger.

What cloud-native technology is still challenging to you?Tell usand we'll handle the complexity and make it easy for you.

Deploy to JCloud

You can also deploy a Flow to JCloud, where you can easily enjoy autoscaling, monitoring and more with a single command.

First, turn theflow.ymlfile into aJCloud-compatible YAMLby specifying resource requirements and using containerized Hub Executors.

Then, usejina cloud deploycommand to deploy to the cloud:

wget https://raw.githubusercontent /jina-ai/jina/master/.github/getting-started/jcloud-flow.yml
jina cloud deploy jcloud-flow.yml

Warning

Make sure to delete/clean up the Flow once you are done with this tutorial to save resources and credits.

Read more aboutdeploying Flows to JCloud.

Streaming for LLMs

Large Language Models can power a wide range of applications from chatbots to assistants and intelligent systems. However, these models can be heavy and slow and your users want systems that are both intelligentandfast!

Large language models work by turning your questions into tokens and then generating new token one at a time until it decides that generation should stop. This means you want tostreamthe output tokens generated by a large language model to the client. In this tutorial, we will discuss how to achieve this with Streaming Endpoints in Jina.

Service Schemas

The first step is to define the streaming service schemas, as you would do in any other service framework. The input to the service is the prompt and the maximum number of tokens to generate, while the output is simply the token ID:

fromdocarrayimportBaseDoc


classPromptDocument(BaseDoc):
prompt:str
max_tokens:int


classModelOutputDocument(BaseDoc):
token_id:int
generated_text:str

Service initialization

Our service depends on a large language model. As an example, we will use thegpt2model. This is how you would load such a model in your executor

fromjinaimportExecutor,requests
fromtransformersimportGPT2Tokenizer,GPT2LMHeadModel
importtorch

tokenizer=GPT2Tokenizer.from_pretrained('gpt2')


classTokenStreamingExecutor(Executor):
def__init__(self,**kwargs):
super().__init__(**kwargs)
self.model=GPT2LMHeadModel.from_pretrained('gpt2')

Implement the streaming endpoint

Our streaming endpoint accepts aPromptDocumentas input and streamsModelOutputDocuments. To stream a document back to the client, use theyieldkeyword in the endpoint implementation. Therefore, we use the model to generate up tomax_tokenstokens and yield them until the generation stops:

classTokenStreamingExecutor(Executor):
...

@requests(on='/stream')
asyncdeftask(self,doc:PromptDocument,**kwargs)->ModelOutputDocument:
input=tokenizer(doc.prompt,return_tensors='pt')
input_len=input['input_ids'].shape[1]
for_inrange(doc.max_tokens):
output=self.model.generate(**input,max_new_tokens=1)
ifoutput[0][-1]==tokenizer.eos_token_id:
break
yieldModelOutputDocument(
token_id=output[0][-1],
generated_text=tokenizer.decode(
output[0][input_len:],skip_special_tokens=True
),
)
input={
'input_ids':output,
'attention_mask':torch.ones(1,len(output[0])),
}

Learn more aboutstreaming endpointsfrom theExecutordocumentation.

Serve and send requests

The final step is to serve the Executor and send requests using the client. To serve the Executor using gRPC:

fromjinaimportDeployment

withDeployment(uses=TokenStreamingExecutor,port=12345,protocol='grpc')asdep:
dep.block()

To send requests from a client:

importasyncio
fromjinaimportClient


asyncdefmain():
client=Client(port=12345,protocol='grpc',asyncio=True)
asyncfordocinclient.stream_doc(
on='/stream',
inputs=PromptDocument(prompt='what is the capital of France?',max_tokens=10),
return_type=ModelOutputDocument,
):
print(doc.generated_text)


asyncio.run(main())
The
The capital
The capital of
The capital of France
The capital of France is
The capital of France is Paris
The capital of France is Paris.

Support

Join Us

Jina is backed byJina AIand licensed underApache-2.0.