Deploy in low-latency

Summary

What is low-latency mode?
Deployment Creation Step by Step
Monitoring Low Latency
Preloading data for low-latency
Parallel Executions

What is low-latency mode?

Introduction

The Craft AI platform offers two pipeline deployment modes:

Elastic: This is the default mode. It emphasizes simplicity of use.
Low-latency: This mode is designed to achieve faster pipeline execution times.

A pipeline deployment always has a mode, and an execution rule such as "endpoint" or "periodic". More detail about deployment on this page.

Before delving into how the low-latency mode operates, let's establish some key points about the deployment modes.

Note

The new deployment method does not give rise to any additional financial costs, as it remains in the same environment.

Elastic Mode

This is the default mode for deployments. In this mode, executions are stateless.

This means that executions in Elastic mode are self-contained and independent from each other, since a unique temporary container is created for each execution. In this mode, executions use all the available computing resources automatically, and no resource is used when there is no execution in progress.

Advantages:

Automatic resource management
No memory side effects executions

Disadvantage:

Slower individual execution time

Technical information

In this context, a "container" refers to a pod in Kubernetes. This mode creates a pod with the required dependencies for each execution, executes the code, and then destroys the pod. This approach enables stateless execution but contributes to an increase in execution time.

Low-latency Mode

With this mode, an execution container called a pod is initialized ahead of time, ready for executions.

As a result, the execution time in this mode is faster than in elastic mode. It is designed for use cases that require fast response times or have long initialization times.

Note

The faster execution time in this mode comes from gains in how executions are started and stopped, it does not mean that computation is faster. The time taken to compute the same code during an execution in both modes remains the same, depends on the computation resources of the environment.

All executions for a low-latency deployment share the same pod, where memory is shared between executions.

A low-latency deployment pod uses computation resources from the environment, even without any execution in progress. And all executions for a deployment run on the same pod. So you need to manage the computation resources of the environment.

Advantage:

Faster execution time

Disadvantages:

Manual resource management
Memory side effects between executions

Technical details

When creating a low-latency deployment, the associated pod is created before any execution can start. It starts the process where executions will run, with the dependencies required to run your code. A pod can only handle one execution at a time, but Python global variables are shared between executions.

Info

The support for multiple pods per deployment or multiple executions per pod is coming soon.

Summary

For real-time response, use low-latency mode. Otherwise, keep the default mode, elastic mode.

It is important to note that selecting low-latency mode results in a shared execution context between executions in the same deployment, and in a continuously active pod, which requires monitoring resource usage throughout the deployment's lifespan.

Note

The run_pipeline() function does not create a deployment, but its behavior is similar to that of the elastic deployment mode.

schema_low-latency

Deployment Creation

In this section, we'll look at the steps involved in creating a low-latency deployment using the Craft AI SDK.

Note

If you have already initialized your SDK with your environment and are familiar with the creation and use of elastic deployment, this section is not applicable. Otherwise, please refer to the relevant documentation here.

To achieve our first low-latency deployment, we will utilise a basic Python script that multiplies two numbers:

multipli.py

def entryPipelineMultipli(number1, number2):
    return {"resultMulti": number1 * number2}

Warning

Remember to push this source code to Git so that the platform can access the Python script for execution.

The approach for creating the pipeline remain the same, regardless of the chosen deployment mode:

# IO creation 
pipeline_input1 = Input(
    name="number1",
    data_type="number",
)

pipeline_input2 = Input(
    name="number2",
    data_type="number",
)

pipeline_output1 = Output(
    name="resultMulti",
    data_type="number",
)

# pipeline creation
sdk.create_pipeline(
    function_path="src/multipli.py",
    function_name="entryPipelineMultipli",
    pipeline_name="multi-number-pipeline",
    container_config = { 
    "local_folder": "my_pipeline_folder/",
  },

    inputs=[pipeline_input1, pipeline_input2],
    outputs=[pipeline_output1],
)

# Pipeline creation 
sdk.create_pipeline(pipeline_name="multi-number-pipl", pipeline_name="multi-number-pipeline")

The mode parameter is initially set to low_latency upon creation.

However, it takes a few tens of seconds for the deployment to become active. You can use a loop to wait for it to be ready, as shown here.

# Deployment creation 
endpoint = sdk.create_deployment(
    execution_rule="endpoint",
    pipeline_name="multi-number-pipl",
    deployment_name="multi-number-endpt",
    mode="low_latency",
)

# Waiting loop until deployment is ready 
status = None
while status != 'success':
    status = sdk.get_deployment("multi-number-endpt")['status']
    if status != 'success':
        print("waiting endpoint ready...", sdk.get_deployment("multi-number-endpt")['status'])
        time.sleep(5)
deploi_info = sdk.get_deployment("multi-number-endpt")
print(deploi_info)

After deployment, it operates like an elastic deployment. It can be triggered through the SDK or other methods such as Postman, curl, or JavaScript requests:

Python SDKCurlJavascript

sdk.trigger_endpoint(
    endpoint_name=deploi_info["name"],
    endpoint_token=deploi_info["endpoint_token"],
    inputs={"number1": 3, "number2": 4},
    wait_for_results=True,
)

curl -L "https://your-env-name.mlops-platform.craft.ai/endpoints/multi-number-endpt" \
-H "Authorization: EndpointToken your-token" \
-H "Content-Type: application/json; charset=utf-8" \
-d @- << EOF
{
    "number1": 3,
    "number2": 4
}
EOF

const ENDPOINT_TOKEN = "your-endpoint-token";

// Endpoint inputs: Commented inputs are optional, uncomment them to use them.
const body = JSON.stringify({
    "number1": 3,
    "number2": 4,
});

fetch("https://your-env-name.mlops-platform.craft.ai/endpoints/multi-number-endpt", {
method: "POST",
headers: {
    "Content-Type": "application/json",
    Authorization: `EndpointToken ${ENDPOINT_TOKEN}`,
},
body,
})
.then((response) => {
    if (!response.ok) {
    return response.text().then((text) => {
        throw new Error(text);
    });
    }
    return response.json();
})
.then((data) => {
    // Handle the successful response data here
    console.log(data);
})
.catch((error) => {
    // Handle errors here
    console.error(error.message);
});

Tip

As with elastic deployments, low-latency deployments linked to the pipeline can be viewed on the pipeline page of the web interface, along with the relevant information and executions.

Monitoring Low Latency

As previously mentioned, deploying with low-latency introduces additional complexity. To effectively monitor deployment activity, the platform offers various information:

Status: Information on the deployment lifecycle at a given point in time.
Logs: Historical and detailed information on the deployment lifecycle
Version: Information on the deployment update

Status

Low latency deployments have one additional specific statuses:

status: Represents the potential availability of the deployment. If enabled and this status is set to up, then the deployment is ready to receive requests. It also can be pending, failed or standby.

Warning

These two statuses are different from the 'is_enabled' parameter, which represents the user's chosen deployment availability.

These statuses are available in the return object of the get_deployment() function:

sdk.get_deployment(
    deployment_name="multi-number-endpt"
)

Note

The pod has a specific status in addition to that of deployment.

Deployment logs

During its lifetime, a low-latency deployment generates logs that are not specific to any execution but are linked to the deployment itself. You can use the get_deployment_logs() function in the SDK to get them.

from datetime import datetime, timedelta

sdk.get_deployment_logs(
    deployment_name="multi-number-endpt",
    from_datetime=datetime.now() - timedelta(hours=2), 
    to_datetime=datetime.now(), 
    type="deployment",
    limit=None
)

Deployment update

A low-latency deployment can be used to reload the associated pod. To do this, you can call the SDK's update_deployment() function:

sdk.update_deployment(
    deployment_name="multi-number-endpt"
)

Preloading data for low-latency

Warning

Configuration on demand is an incubating feature.

Concept

When using low-latency mode, it is important to note that this implies continuity between executions. This is because the pod that encapsulates the executions remains active throughout the lifetime of the deployment.

As a result, there is memory permeability between executions. Each execution runs in a different thread in the same process. While this feature can be advantageous, it must be used with care. It allows data to be loaded into memory (RAM and VRAM) prior to an execution by using global variables in Python.

How to do that

The code, specified in the function_path property, when the pipeline was created, is imported during the creation of a low-latency deployment. This enables the loading of variables prior to the first deployment run.

Note

A global variable can also be defined “only” in the first execution by creating it only in the function.

Once the data has been loaded into a global variable, it can be read in function executions.

Note: That this does not require any changes to the creation of platform objects (pipeline, deployment, etc.) using the SDK.

Warning

If the pod is restarted (after a standby, for example), the loaded data is reset as when the deployment was created.

Examples

Simple example

# Import lib 
from craft_ai_sdk import CraftAiSdk
import os, time

# Code run at the low latency deployment creation 
count = 0 
loaded_data = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

print ("Init count at : " + count)

# Function who will be run at each execution 
def my_pipeline_source_code():
    global loaded_data, count

    count += 1 

    print (count, loaded_data)

Deployment logs and logs of the first 2 runs :

Deployment's logs : 

> Pod created successfully
> Importing pipeline module
> Init count at : 0
> pipeline module imported successfully
> Execution "my-pipeline-1" started successfully
> Execution "my-pipeline-2" started successfully

"my-pipeline-1" logs :

> 1 [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

"my-pipeline-2" logs :

> 2 [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

Note

You can access to deployment logs by using the SDK function sdk.get_deployment_logs().

Example of usage:

logs_deploy = sdk.get_deployment_logs(
    deployment_name="my-deployment",
)

print('\n'.join(log["message"] for log in logs_deploy))

Example of a pipeline with LLM preloading

import time
from vllm import LLM, SamplingParams
from craft_ai_sdk import CraftAiSdk
import os
from io import StringIO

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
llm = LLM(
    model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    quantization="awq",
    dtype="half",
    max_model_len=16384,
)

def persitent_data_pipeline(message: str):
    global llm

    output = llm.generate([message], sampling_params)[0]

    return {
        "results": output.outputs[0].text,
    }

Parallel Executions

Introduction

Deployments in low-latency mode can optionally have multiple executions running concurrently on the same pod, instead of waiting for one execution to complete before starting another. This can improve the efficiency and speed of executions that would benefit from parallelization.

When to Use Parallel Executions

Parallel executions can be beneficial when:

Several executions of the same deployment sometimes run at the same time.
You want executions to complete faster (shorter response time).
The CPU or GPU resources are not used at 100%.
The code in your pipeline can benefit from parallelization. For example because it involves on asynchronous operations like downloading large files, relies on a computation library that can use multiple threads, or can benefit from batching like LLM inferences.

Technical Explanation

When parallel executions are enabled, the pipeline's Python function is called in the pod each time a new execution starts, even if a previous execution is still ongoing, up to the number of maximum parallel executions.

Depending on how the pipeline's function was defined in the code, for each execution the function is called in a new: - Thread if the function was defined starting with def. - Asynchronous I/O coroutine (called with await) if the function was defined starting with async def.

We recommend that you define your pipeline's Python function with async def if you plan to use parallel executions, as this is compatible with most recent libraries. The choice mainly depends on the libraries used by your code, as some libraries may not be compatible with multiple threads, and conversely some libraries may only work with threads. If executions fail after parallel executions were enabled, try the other way to define your function.

Warnings

Log Separation: Logs from parallel executions might be mixed up: A call to sdk.get_pipeline_execution_logs for one execution may return logs from another execution that was running at the same time. This should not occur with regular Python code. This can occur with logs from outside libraries that use concurrency, if logs are created (e.g. with print) without the (context variables)[https://docs.python.org/3/library/contextvars.html] that are present when the pipeline's function is called.
Shared Memory: Executions share the same memory space, which can lead to concurrency issues and conflicts.
Thread Safety: Ensure that the code is thread-safe to avoid race conditions and other concurrency issues.

Deployment Creation Example

To enable parallel executions, use the create_deployment function with the enable_parallel_executions and max_parallel_executions_per_pod parameters:

sdk.create_deployment(
    execution_rule="endpoint",
    pipeline_name="my_pipeline_name",
    deployment_name="my_deployment_name",
    mode="low_latency",
    enable_parallel_executions=True,  # Default is False
    max_parallel_executions_per_pod=10  # Default is 6
)

Note

If the number of concurrent executions exceeds the maximum allowed (max_parallel_executions_per_pod), additional executions will be queued until a slot becomes available.