diff --git a/README.md b/README.md index 6b2ac0ca4026c..622bf24c60027 100644 --- a/README.md +++ b/README.md @@ -1,56 +1,39 @@

-Jina logo: Build multimodal AI services via cloud native technologies · Model Serving · Generative AI · Neural Search · Cloud Native +Jina logo: Build multimodal AI services via cloud native technologies

-Build multimodal AI applications with cloud-native technologies +Build and deploy multimodal AI services at scale

PyPI - PyPI - Downloads from official pypistats Github CD status

- +Jina-serve is a framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets. Scale your services from local development to production while focusing on your core logic. -Jina lets you build multimodal [**AI services**](#build-ai-models) and [**pipelines**](#build-a-pipeline) that communicate via gRPC, HTTP and WebSockets, then scale them up and deploy to production. You can focus on your logic and algorithms, without worrying about the infrastructure complexity. +## Key Features -![](./.github/images/build-deploy.png) - -Jina provides a smooth Pythonic experience for serving ML models transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Jina makes advanced solution engineering and cloud-native technologies accessible to every developer. - -- Build and serve models for any [data type](https://docs.docarray.org/data_types/first_steps/) and any mainstream [deep learning framework](https://docarray.org/docarray/how_to/multimodal_training_and_serving/). -- Design high-performance services, with [easy scaling](https://docs.jina.ai/concepts/orchestration/scale-out/), duplex client-server streaming, batching, [dynamic batching](https://docs.jina.ai/concepts/serving/executor/dynamic-batching/), async/non-blocking data processing and any [protocol](https://docs.jina.ai/concepts/serving/gateway/#set-protocol-in-python). -- Serve [LLM models while streaming their output](https://github.com/jina-ai/jina#streaming-for-llms). -- Docker container integration via [Executor Hub](https://cloud.jina.ai), OpenTelemetry/Prometheus observability. -- Streamlined CPU/GPU hosting via [Jina AI Cloud](https://cloud.jina.ai). -- Deploy to your own cloud or system with our [Kubernetes](https://docs.jina.ai/cloud-nativeness/k8s/) and [Docker Compose](https://docs.jina.ai/cloud-nativeness/docker-compose/) integration. +- Native support for all major ML frameworks and data types +- High-performance service design with scaling, streaming, and dynamic batching +- LLM serving with streaming output +- Built-in Docker integration and Executor Hub +- One-click deployment to Jina AI Cloud +- Enterprise-ready with Kubernetes and Docker Compose support
- Wait, how is Jina different from FastAPI? -Jina's value proposition may seem quite similar to that of FastAPI. However, there are several fundamental differences: - - **Data structure and communication protocols** - - FastAPI communication relies on Pydantic and Jina relies on [DocArray](https://github.com/docarray/docarray) allowing Jina to support multiple protocols - to expose its services. The support for gRPC protocol is specially useful for data intensive applications as for embedding services - where the embeddings and tensors can be more efficiently serialized. - - **Advanced orchestration and scaling capabilities** - - Jina allows you to easily containerize and orchestrate your services and models, providing concurrency and scalability. - - Jina lets you deploy applications formed from multiple microservices that can be containerized and scaled independently. - - **Journey to the cloud** - - Jina provides a smooth transition from local development (using [DocArray](https://github.com/docarray/docarray)) to local serving using [Deployment](https://docs.jina.ai/concepts/orchestration/deployment/) and [Flow](https://docs.jina.ai/concepts/orchestration/flow/) - to having production-ready services by using Kubernetes capacity to orchestrate the lifetime of containers. - - By using [Jina AI Cloud](https://cloud.jina.ai) you have access to scalable and serverless deployments of your applications in one command. -
+Comparison with FastAPI - +Key advantages over FastAPI: -## [Documentation](https://docs.jina.ai) +- DocArray-based data handling with native gRPC support +- Built-in containerization and service orchestration +- Seamless scaling of microservices +- One-command cloud deployment + ## Install @@ -58,85 +41,49 @@ Jina's value proposition may seem quite similar to that of FastAPI. However, the pip install jina ``` -Find more install options on [Apple Silicon](https://docs.jina.ai/get-started/install/apple-silicon-m1-m2/)/[Windows](https://docs.jina.ai/get-started/install/windows/). - -## Get Started - -### Basic Concepts - -Jina has three fundamental layers: - -- Data layer: [**BaseDoc**](https://docarray.docs.org/) and [**DocList**](https://docarray.docs.org/) (from [DocArray](https://github.com/docarray/docarray)) are the input/output formats in Jina. -- Serving layer: An [**Executor**](https://docs.jina.ai/concepts/serving/executor/) is a Python class that transforms and processes Documents. By simply wrapping your models into an Executor, you allow them to be served and scaled by Jina. [**Gateway**](https://docs.jina.ai/concepts/serving/gateway/) is the service making sure connecting all Executors inside a Flow. -- Orchestration layer: [**Deployment**](https://docs.jina.ai/concepts/orchestration/deployment) serves a single Executor, while a [**Flow**](https://docs.jina.ai/concepts/orchestration/flow/) serves Executors chained into a pipeline. - +See guides for [Apple Silicon](https://docs.jina.ai/get-started/install/apple-silicon-m1-m2/) and [Windows](https://docs.jina.ai/get-started/install/windows/). -[The full glossary is explained here](https://docs.jina.ai/concepts/preliminaries/#). +## Core Concepts -### Serve AI models - +Three main layers: +- **Data**: BaseDoc and DocList for input/output +- **Serving**: Executors process Documents, Gateway connects services +- **Orchestration**: Deployments serve Executors, Flows create pipelines -Let's build a fast, reliable and scalable gRPC-based AI service. In Jina we call this an **[Executor](https://docs.jina.ai/concepts/serving/executor/)**. Our simple Executor will wrap the [StableLM](https://huggingface.co/stabilityai/stablelm-base-alpha-3b) LLM from Stability AI. We'll then use a **Deployment** to serve it. +## Build AI Services -![](./.github/images/deployment-diagram.png) - -> **Note** -> A Deployment serves just one Executor. To combine multiple Executors into a pipeline and serve that, use a [Flow](#build-a-pipeline). - -Let's implement the service's logic: - - - - - - - -
executor.py
+Let's create a gRPC-based AI service using StableLM: ```python from jina import Executor, requests from docarray import DocList, BaseDoc - from transformers import pipeline - class Prompt(BaseDoc): - text: str - + text: str class Generation(BaseDoc): - prompt: str - text: str - + prompt: str + text: str class StableLM(Executor): - def __init__(self, **kwargs): - super().__init__(**kwargs) - self.generator = pipeline( - 'text-generation', model='stabilityai/stablelm-base-alpha-3b' - ) - - @requests - def generate(self, docs: DocList[Prompt], **kwargs) -> DocList[Generation]: - generations = DocList[Generation]() - prompts = docs.text - llm_outputs = self.generator(prompts) - for prompt, output in zip(prompts, llm_outputs): - generations.append(Generation(prompt=prompt, text=output)) - return generations + def __init__(self, **kwargs): + super().__init__(**kwargs) + self.generator = pipeline( + 'text-generation', model='stabilityai/stablelm-base-alpha-3b' + ) + + @requests + def generate(self, docs: DocList[Prompt], **kwargs) -> DocList[Generation]: + generations = DocList[Generation]() + prompts = docs.text + llm_outputs = self.generator(prompts) + for prompt, output in zip(prompts, llm_outputs): + generations.append(Generation(prompt=prompt, text=output)) + return generations ``` -
- -Then we deploy it with either the Python API or YAML: -
- - - - - - - - - -
Python API: deployment.py YAML: deployment.yml
+Deploy with Python or YAML: ```python from jina import Deployment @@ -145,503 +92,188 @@ from executor import StableLM dep = Deployment(uses=StableLM, timeout_ready=-1, port=12345) with dep: - dep.block() + dep.block() ``` - - ```yaml jtype: Deployment with: - uses: StableLM - py_modules: - - executor.py - timeout_ready: -1 - port: 12345 + uses: StableLM + py_modules: + - executor.py + timeout_ready: -1 + port: 12345 ``` -And run the YAML Deployment with the CLI: `jina deployment --uses deployment.yml` - -
-
- -Use [Jina Client](https://docs.jina.ai/concepts/client/) to make requests to the service: +Use the client: ```python from jina import Client -from docarray import DocList, BaseDoc - - -class Prompt(BaseDoc): - text: str - - -class Generation(BaseDoc): - prompt: str - text: str - - -prompt = Prompt( - text='suggest an interesting image generation prompt for a mona lisa variant' -) - -client = Client(port=12345) # use port from output above -response = client.post(on='/', inputs=[prompt], return_type=DocList[Generation]) - -print(response[0].text) -``` - -```text -a steampunk version of the Mona Lisa, incorporating mechanical gears, brass elements, and Victorian era clothing details -``` - - +from docarray import DocList +from executor import Prompt, Generation -> **Note** -> In a notebook, you can't use `deployment.block()` and then make requests to the client. Please refer to the Colab link above for reproducible Jupyter Notebook code snippets. - -### Build a pipeline - - - -Sometimes you want to chain microservices together into a pipeline. That's where a [Flow](https://docs.jina.ai/concepts/orchestration/flow/) comes in. - -A Flow is a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) pipeline, composed of a set of steps, It orchestrates a set of [Executors](https://docs.jina.ai/concepts/serving/executor/) and a [Gateway](https://docs.jina.ai/concepts/serving/gateway/) to offer an end-to-end service. - -> **Note** -> If you just want to serve a single Executor, you can use a [Deployment](#build-ai--ml-services). - -For instance, let's combine [our StableLM language model](#build-ai--ml-services) with a Stable Diffusion image generation model. Chaining these services together into a [Flow](https://docs.jina.ai/concepts/orchestration/flow/) will give us a service that will generate images based on a prompt generated by the LLM. - - - - - - - - -
text_to_image.py
- -```python -import numpy as np -from jina import Executor, requests -from docarray import BaseDoc, DocList -from docarray.documents import ImageDoc - - -class Generation(BaseDoc): - prompt: str - text: str - - -class TextToImage(Executor): - def __init__(self, **kwargs): - super().__init__(**kwargs) - from diffusers import StableDiffusionPipeline - import torch - - self.pipe = StableDiffusionPipeline.from_pretrained( - "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16 - ).to("cuda") - - @requests - def generate_image(self, docs: DocList[Generation], **kwargs) -> DocList[ImageDoc]: - result = DocList[ImageDoc]() - images = self.pipe( - docs.text - ).images # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/) - result.tensor = np.array(images) - return result +prompt = Prompt(text='suggest an interesting image generation prompt') +client = Client(port=12345) +response = client.post('/', inputs=[prompt], return_type=DocList[Generation]) ``` -
+## Build Pipelines - -![](./.github/images/flow-diagram.png) - -Build the Flow with either Python or YAML: - -
- - - - - - - - - -
Python API: flow.py YAML: flow.yml
+Chain services into a Flow: ```python from jina import Flow -from executor import StableLM -from text_to_image import TextToImage flow = ( - Flow(port=12345) - .add(uses=StableLM, timeout_ready=-1) - .add(uses=TextToImage, timeout_ready=-1) + Flow(port=12345) + .add(uses=StableLM) + .add(uses=TextToImage) ) with flow: - flow.block() + flow.block() ``` - - -```yaml -jtype: Flow -with: - port: 12345 -executors: - - uses: StableLM - timeout_ready: -1 - py_modules: - - executor.py - - uses: TextToImage - timeout_ready: -1 - py_modules: - - text_to_image.py -``` - -Then run the YAML Flow with the CLI: `jina flow --uses flow.yml` - -
-
- -Then, use [Jina Client](https://docs.jina.ai/concepts/client/) to make requests to the Flow: - -```python -from jina import Client -from docarray import DocList, BaseDoc -from docarray.documents import ImageDoc - - -class Prompt(BaseDoc): - text: str +## Scaling and Deployment +### Local Scaling -prompt = Prompt( - text='suggest an interesting image generation prompt for a mona lisa variant' -) - -client = Client(port=12345) # use port from output above -response = client.post(on='/', inputs=[prompt], return_type=DocList[ImageDoc]) - -response[0].display() -``` +Boost throughput with built-in features: +- Replicas for parallel processing +- Shards for data partitioning +- Dynamic batching for efficient model inference -![](./.github/images/mona-lisa.png) - - - -### Easy scalability and concurrency - -Why not just use standard Python to build that service and pipeline? Jina accelerates time to market of your application by making it more scalable and cloud-native. Jina also handles the infrastructure complexity in production and other Day-2 operations so that you can focus on the data application itself. - -Increase your application's throughput with scalability features out of the box, like [replicas](https://docs.jina.ai/concepts/orchestration/scale-out/#replicate-executors), [shards](https://docs.jina.ai/concepts/orchestration/scale-out/#customize-polling-behaviors) and [dynamic batching](https://docs.jina.ai/concepts/serving/executor/dynamic-batching/). - -Let's scale a Stable Diffusion Executor deployment with replicas and dynamic batching: - -![](./.github/images/scaled-deployment.png) - -* Create two replicas, with [a GPU assigned for each](https://docs.jina.ai/concepts/orchestration/scale-out/#replicate-on-multiple-gpus). -* Enable dynamic batching to process incoming parallel requests together with the same model inference. - - -
- - - - - - - - - -
Normal Deployment Scaled Deployment
+Example scaling a Stable Diffusion deployment: ```yaml jtype: Deployment with: - uses: TextToImage - timeout_ready: -1 - py_modules: - - text_to_image.py + uses: TextToImage + timeout_ready: -1 + py_modules: + - text_to_image.py + env: + CUDA_VISIBLE_DEVICES: RR + replicas: 2 + uses_dynamic_batching: + /default: + preferred_batch_size: 10 + timeout: 200 ``` - - -```yaml -jtype: Deployment -with: - uses: TextToImage - timeout_ready: -1 - py_modules: - - text_to_image.py - env: - CUDA_VISIBLE_DEVICES: RR - replicas: 2 - uses_dynamic_batching: # configure dynamic batching - /default: - preferred_batch_size: 10 - timeout: 200 -``` - -
-
- -Assuming your machine has two GPUs, using the scaled deployment YAML will give better throughput compared to the normal deployment. - -These features apply to both [Deployment YAML](https://docs.jina.ai/concepts/orchestration/yaml-spec/#example-yaml) and [Flow YAML](https://docs.jina.ai/concepts/orchestration/yaml-spec/#example-yaml). Thanks to the YAML syntax, you can inject deployment configurations regardless of Executor code. - -## Deploy to the cloud +### Cloud Deployment -### Containerize your Executor +#### Containerize Services -In order to deploy your solutions to the cloud, you need to containerize your services. Jina provides the [Executor Hub](https://docs.jina.ai/concepts/serving/executor/hub/create-hub-executor/), the perfect tool -to streamline this process taking a lot of the troubles with you. It also lets you share these Executors publicly or privately. - -You just need to structure your Executor in a folder: - -```shell script +1. Structure your Executor: +``` TextToImage/ ├── executor.py ├── config.yml ├── requirements.txt ``` -
- - - - - - - - - -
config.yml requirements.txt
+2. Configure: ```yaml +# config.yml jtype: TextToImage py_modules: - - executor.py + - executor.py metas: - name: TextToImage - description: Text to Image generation Executor based on StableDiffusion - url: - keywords: [] + name: TextToImage + description: Text to Image generation Executor ``` - - -```requirements.txt -diffusers -accelerate -transformers -``` - -
-
- - -Then push the Executor to the Hub by doing: `jina hub push TextToImage`. - -This will give you a URL that you can use in your `Deployment` and `Flow` to use the pushed Executors containers. - - -```yaml -jtype: Flow -with: - port: 12345 -executors: - - uses: jinai+docker:///StableLM - - uses: jinai+docker:///TextToImage +3. Push to Hub: +```bash +jina hub push TextToImage ``` - -### Get on the fast lane to cloud-native - -Using Kubernetes with Jina is easy: - +#### Deploy to Kubernetes ```bash jina export kubernetes flow.yml ./my-k8s kubectl apply -R -f my-k8s ``` -And so is Docker Compose: - +#### Use Docker Compose ```bash jina export docker-compose flow.yml docker-compose.yml docker-compose up ``` -> **Note** -> You can also export Deployment YAML to [Kubernetes](https://docs.jina.ai/concepts/serving/executor/serve/#serve-via-kubernetes) and [Docker Compose](https://docs.jina.ai/concepts/serving/executor/serve/#serve-via-docker-compose). +#### JCloud Deployment -That's not all. We also support [OpenTelemetry, Prometheus, and Jaeger](https://docs.jina.ai/cloud-nativeness/opentelemetry/). - -What cloud-native technology is still challenging to you? [Tell us](https://github.com/jina-ai/jina/issues) and we'll handle the complexity and make it easy for you. - -### Deploy to JCloud - -You can also deploy a Flow to JCloud, where you can easily enjoy autoscaling, monitoring and more with a single command. - -First, turn the `flow.yml` file into a [JCloud-compatible YAML](https://docs.jina.ai/yaml-spec/) by specifying resource requirements and using containerized Hub Executors. - -Then, use `jina cloud deploy` command to deploy to the cloud: - -```shell -wget https://raw.githubusercontent.com/jina-ai/jina/master/.github/getting-started/jcloud-flow.yml +Deploy with a single command: +```bash jina cloud deploy jcloud-flow.yml ``` -> **Warning** -> -> Make sure to delete/clean up the Flow once you are done with this tutorial to save resources and credits. - -Read more about [deploying Flows to JCloud](https://docs.jina.ai/concepts/jcloud/#deploy). - -### Streaming for LLMs - -Large Language Models can power a wide range of applications from chatbots to assistants and intelligent systems. -However, these models can be heavy and slow and your users want systems that are both intelligent _and_ fast! +## LLM Streaming -Large language models work by turning your questions into tokens and then generating new token one at a -time until it decides that generation should stop. -This means you want to **stream** the output tokens generated by a large language model to the client. -In this tutorial, we will discuss how to achieve this with Streaming Endpoints in Jina. - +Enable token-by-token streaming for responsive LLM applications: -#### Service Schemas - -The first step is to define the streaming service schemas, as you would do in any other service framework. -The input to the service is the prompt and the maximum number of tokens to generate, while the output is simply the -token ID: +1. Define schemas: ```python from docarray import BaseDoc - class PromptDocument(BaseDoc): - prompt: str - max_tokens: int - + prompt: str + max_tokens: int class ModelOutputDocument(BaseDoc): - token_id: int - generated_text: str + token_id: int + generated_text: str ``` - -#### Service initialization - -Our service depends on a large language model. As an example, we will use the `gpt2` model. This is how you would load -such a model in your executor +2. Initialize service: ```python -from jina import Executor, requests from transformers import GPT2Tokenizer, GPT2LMHeadModel -import torch - -tokenizer = GPT2Tokenizer.from_pretrained('gpt2') - class TokenStreamingExecutor(Executor): - def __init__(self, **kwargs): - super().__init__(**kwargs) - self.model = GPT2LMHeadModel.from_pretrained('gpt2') + def __init__(self, **kwargs): + super().__init__(**kwargs) + self.model = GPT2LMHeadModel.from_pretrained('gpt2') ``` - - -#### Implement the streaming endpoint - -Our streaming endpoint accepts a `PromptDocument` as input and streams `ModelOutputDocument`s. To stream a document back to -the client, use the `yield` keyword in the endpoint implementation. Therefore, we use the model to generate -up to `max_tokens` tokens and yield them until the generation stops: +3. Implement streaming: ```python -class TokenStreamingExecutor(Executor): - ... - - @requests(on='/stream') - async def task(self, doc: PromptDocument, **kwargs) -> ModelOutputDocument: - input = tokenizer(doc.prompt, return_tensors='pt') - input_len = input['input_ids'].shape[1] - for _ in range(doc.max_tokens): - output = self.model.generate(**input, max_new_tokens=1) - if output[0][-1] == tokenizer.eos_token_id: - break - yield ModelOutputDocument( - token_id=output[0][-1], - generated_text=tokenizer.decode( - output[0][input_len:], skip_special_tokens=True - ), - ) - input = { - 'input_ids': output, - 'attention_mask': torch.ones(1, len(output[0])), - } +@requests(on='/stream') +async def task(self, doc: PromptDocument, **kwargs) -> ModelOutputDocument: + input = tokenizer(doc.prompt, return_tensors='pt') + input_len = input['input_ids'].shape[1] + for _ in range(doc.max_tokens): + output = self.model.generate(**input, max_new_tokens=1) + if output[0][-1] == tokenizer.eos_token_id: + break + yield ModelOutputDocument( + token_id=output[0][-1], + generated_text=tokenizer.decode( + output[0][input_len:], skip_special_tokens=True + ), + ) + input = { + 'input_ids': output, + 'attention_mask': torch.ones(1, len(output[0])), + } ``` -Learn more about [streaming endpoints](https://docs.jina.ai/concepts/serving/executor/add-endpoints/#streaming-endpoints) from the `Executor` documentation. - - - -#### Serve and send requests - - -The final step is to serve the Executor and send requests using the client. -To serve the Executor using gRPC: +4. Serve and use: ```python -from jina import Deployment - +# Server with Deployment(uses=TokenStreamingExecutor, port=12345, protocol='grpc') as dep: - dep.block() -``` - -To send requests from a client: -```python -import asyncio -from jina import Client - + dep.block() +# Client async def main(): - client = Client(port=12345, protocol='grpc', asyncio=True) - async for doc in client.stream_doc( - on='/stream', - inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10), - return_type=ModelOutputDocument, - ): - print(doc.generated_text) - - -asyncio.run(main()) -``` - -```text -The -The capital -The capital of -The capital of France -The capital of France is -The capital of France is Paris -The capital of France is Paris. + client = Client(port=12345, protocol='grpc', asyncio=True) + async for doc in client.stream_doc( + on='/stream', + inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10), + return_type=ModelOutputDocument, + ): + print(doc.generated_text) ``` - - - - ## Support -- Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. -- Subscribe to the latest video tutorials on our [YouTube channel](https://youtube.com/c/jina-ai) - -## Join Us - -Jina is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE). - - +Jina-serve is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE).