Skip to content

Choose an execution rule

A deployment is a way to run a Machine Learning pipeline in a repeatable and automated way.

For each deployment, you can configure an execution rule:

  • by endpoint (web API) : the pipeline will be executed by a call to a web API. In addition, this API will allow, if necessary, to retrieve data as input and deliver the result of the pipeline as output. Access to the API can be securely communicated to external users.
  • by periodic trigger (CRON) : rules can be configured to trigger the pipeline periodically.

Summary

  1. Deploy with execution rule: Endpoint
  2. Deploy with execution rule: Periodic
Function name Method Return type Description
create_deployment create_deployment(deployment_name, pipeline_name, execution_rule, mode, outputs_mapping=[], inputs_mapping=[], description, enable_parallel_executions=None, max_parallel_executions_per_pod=None, ram_request=None, gpu_request=None, timeout_s=180) Dict Function that deploys a pipeline by creating a deployment which allows a user to trigger the pipeline execution

Deploy with execution rule: Endpoint

Definition function

To create an auto-mapping deployment where all inputs and outputs are based on API calls, you can use the create_deployment function. To create a deployment with manual mapping, you can use the create_deployment function with the additional parameters inputs_mapping to specify the precise mapping between input and source.

CraftAiSdk.create_deployment(
    pipeline_name, 
    deployment_name, 
    execution_rule="endpoint",
    mode=DEPLOYMENT_MODES.ELASTIC,
    inputs_mapping=None,
    outputs_mapping=None, 
    description=None,
    enable_parallel_executions=None,
    max_parallel_executions_per_pod=None,
    ram_request=None, 
    gpu_request=None, 
    timeout_s=180
    )

Parameters

  • deployment_name (str) -- Name of endpoint chosen by the user to refer to the endpoint
  • pipeline_name (str) -- Name of pipeline that will be run by the deployment / endpoint
  • execution_rule (str) - Execution rule of the deployment. Must be endpoint or periodic. For convenience, members of the enumeration DEPLOYMENT_EXECUTION_RULES could be used too.
  • mode (str) – Mode of the deployment. Can be "elastic" or "low_latency". Defaults to "elastic". For convenience, members of the enumeration DEPLOYMENT_MODES can be used. This defines how computing resources are allocated for pipeline executions:

    • elastic: Each pipeline execution runs in a new isolated container (“pod”), with its own memory (RAM, VRAM, disk). No variables or files are shared between executions, and the pod is destroyed when the execution ends. This mode is simple to use because it automatically uses computing resources for running executions, and each execution starts from an identical blank state. However, it takes time to create a new pod at the beginning of each execution (tens of seconds), and computing resources can become saturated when there are many executions.

    • low_latency: All pipeline executions for the same deployment run in a shared container (“pod”) with shared memory. The pod is created when the deployment is created, and deleted when the deployment is deleted. Shared memory means that if one execution modifies a global variable or a file, subsequent executions on the same pod will see the modified value. This mode allows executions to respond quickly (less than 0.5 seconds of overhead) because the pod is already up and running when an execution starts, and it is possible to preload or cache data. However, it requires care in the code because of possible interactions between executions. Additionally, computing resources must be managed carefully, as pods use resources continuously even when there is no ongoing execution, and the number of pods does not automatically adapt to the number of executions. During the lifetime of a deployment, a pod may be re-created by the platform for technical reasons (including if it tries to use more memory than available). This mode is not compatible with pipelines created with a container_config.dockerfile_path property in create_pipeline().

  • description (str, optional) -- Text description of usage of pipeline for user only

  • outputs_mapping (List) - List of all OutputDestination objects with information for each output mapping.
  • inputs_mapping (List, optional) - List of input mappings, to map pipeline inputs to different sources (such as constant values, endpoint inputs, data store or environment variables). See InputSource for more details. For endpoint rules, if an input of the pipeline is not explicitly mapped, it will be automatically mapped to an endpoint input with the same name.
  • description (str, optional) – Description of the deployment.
  • enable_parallel_executions (bool, optional) – Whether to run several executions at the same time in the same pod, if mode is "low_latency". Not applicable if mode is "elastic", where each execution always runs in a new pod. This is disabled by default, which means that for a deployment with "low_latency" mode, by default only one execution runs at a time on a pod, and other executions are pending while waiting for the running one to finish. Enabling this may be useful for inference batching on a model that takes much memory, so the model is loaded in memory only once and can be used for several inferences at the same time. If this is enabled, then global variables, GPU memory, and disk files are shared between multiple executions, so you must be mindful of potential race conditions and concurrency issues. For each execution running on a pod, the main Python function is run either as an asyncio coroutine with await if the function was defined with async def (recommended), or in a new thread if the function was defined simply with def. Environment variables are updated whenever a new execution starts on the pod. Using some libraries with async/threaded methods in your code may cause logs to be associated with the wrong running execution (logs are associated with executions through Python contextvars).
  • max_parallel_executions_per_pod (int, optional) – Only applies if enable_parallel_executions is True. The maximum number of executions that can run at the same time on a deployment’s pod in "low_latency" mode where enable_parallel_executions is True: if a greater number of executions are requested at the same time, then only max_parallel_executions_per_pod executions will actually be running on the pod, and the other ones will be pending until a running execution finishes. The default is 6.
  • ram_request (str, optional) – The amount of memory (RAM) requested for the deployment in KiB, MiB and GiB. The value must be a string with a number followed by a unit, for example “512MiB” or “1GiB”. This is only available for “low_latency” deployments mode.
  • gpu_request (int, optional) – The number of GPUs requested for the deployment. This is only available for “low_latency” deployments mode.
  • timeout_s (int) – Maximum time (in seconds) to wait for the deployment to be ready. 3 minutes (180 seconds) by default, and at least 2 minutes (120 seconds).

Returns

Information about the deployment just create in a dict Python format. In this data, you will have :

  • name - Name of the deployment.
  • endpoint_token - Token of the endpoint used to trigger the deployment. Note that this token is only returned if execution_rule is “endpoint”.

Example

Example auto mapping

sdk.create_deployment(
   deployment_name="my_deployment",
   pipeline_name="my_pipeline",
    execution_rule="endpoint",
   outputs_mapping=[],
   inputs_mapping=[],
)

> {
>   'name': 'name-endpoint', 
>   'endpoint_token': 'S_xZOKU ... KHs'
> }

Example manual mapping

sdk.create_deployment(
   deployment_name="my_deployment",
   pipeline_name="my_pipeline",
    execution_rule="endpoint",
    inputs_mapping=[
                seagull_endpoint_input,
                big_whale_input,
                salt_constant_input,
        ],
   outputs_mapping=[prediction_endpoint_ouput],
)


> {
>   'name': 'name-endpoint', 
>   'endpoint_token': 'S_xZOkCI ... FIg'
> }

Deploy with execution rule: Periodic

Definition function

To create an auto-mapping deployment where all inputs and outputs are based on periodicity, you can use the create_deployment function. To create a deployment with manual mapping, you can use the create_deployment function with the additional parameters inputs_mapping to specify the precise mapping between input and source.

CraftAiSdk.create_deployment(
    pipeline_name, 
    deployment_name, 
    execution_rule="periodic",
    mode=DEPLOYMENT_MODES.ELASTIC,
    schedule=None, 
    inputs_mapping=None,
    outputs_mapping=None, 
    description=None
    )

Warning

Input and output mapping must always be precise. Auto mapping isn't available for periodic deployment.

Parameters

  • deployment_name (str) -- Name of the deployment chosen

  • pipeline_name (str) -- Name of pipeline that will be run by the deployment

  • description (str, optional) -- Text description of usage of pipeline for user only.

  • execution_rule (str) - Execution rule of the deployment. Must be endpoint or periodic. For convenience, members of the enumeration DEPLOYMENT_EXECUTION_RULES could be used too.

  • mode (str) – Mode of the deployment. Can be "elastic" or "low_latency". Defaults to "elastic". For convenience, members of the enumeration DEPLOYMENT_MODES can be used. This defines how computing resources are allocated for pipeline executions:

    • elastic: Each pipeline execution runs in a new isolated container (“pod”), with its own memory (RAM, VRAM, disk). No variables or files are shared between executions, and the pod is destroyed when the execution ends. This mode is simple to use because it automatically uses computing resources for running executions, and each execution starts from an identical blank state. However, it takes time to create a new pod at the beginning of each execution (tens of seconds), and computing resources can become saturated when there are many executions.

    • low_latency: All pipeline executions for the same deployment run in a shared container (“pod”) with shared memory. The pod is created when the deployment is created, and deleted when the deployment is deleted. Shared memory means that if one execution modifies a global variable or a file, subsequent executions on the same pod will see the modified value. This mode allows executions to respond quickly (less than 0.5 seconds of overhead) because the pod is already up and running when an execution starts, and it is possible to preload or cache data. However, it requires care in the code because of possible interactions between executions. Additionally, computing resources must be managed carefully, as pods use resources continuously even when there is no ongoing execution, and the number of pods does not automatically adapt to the number of executions. During the lifetime of a deployment, a pod may be re-created by the platform for technical reasons (including if it tries to use more memory than available). This mode is not compatible with pipeliness created with a container_config.dockerfile_path property in create_pipeline().

  • schedule (str, optional) - Schedule of the deployment. Only required if execution_rule is "periodic". Must be a valid: cron expression. The deployment will be executed periodically according to this schedule. The schedule must follow this format: <minute> <hour> <day of month> <month> <day of week>. Note that the schedule is in UTC time zone. "*" means all possible values. Here are some examples:

    • "0 0 * * *" will execute the deployment every day at midnight.
    • "0 0 5 * *" will execute the deployment every 5th day of the month at midnight.
  • inputs_mapping (List of instances of [InputSource], optional) - List of input mappings, to map pipeline inputs to different : sources (such as constant values, endpoint inputs, or environment variables). See InputSource for more details. For endpoint rules, if an input of the pipeline is not explicitly mapped, it will be automatically mapped to an endpoint input with the same name. For periodic rules, all inputs of the pipeline must be explicitly mapped.

  • outputs_mapping (List of instances of [OutputDestination], optional) - List of output mappings, to map pipeline outputs to different :
    destinations. See OutputDestination for more details. For endpoint execution rules, if an output of the pipeline is not explicitly mapped, it will be automatically mapped to an endpoint input with the same name. For other rules, all outputs of the pipeline must be explicitly mapped.

  • description (str, optional) – Description of the deployment.

  • enable_parallel_executions (bool, optional) – Whether to run several executions at the same time in the same pod, if mode is "low_latency". Not applicable if mode is "elastic", where each execution always runs in a new pod. This is disabled by default, which means that for a deployment with "low_latency" mode, by default only one execution runs at a time on a pod, and other executions are pending while waiting for the running one to finish. Enabling this may be useful for inference batching on a model that takes much memory, so the model is loaded in memory only once and can be used for several inferences at the same time. If this is enabled, then global variables, GPU memory, and disk files are shared between multiple executions, so you must be mindful of potential race conditions and concurrency issues. For each execution running on a pod, the main Python function is run either as an asyncio coroutine with await if the function was defined with async def (recommended), or in a new thread if the function was defined simply with def. Environment variables are updated whenever a new execution starts on the pod. Using some libraries with async/threaded methods in your code may cause logs to be associated with the wrong running execution (logs are associated with executions through Python contextvars).
  • max_parallel_executions_per_pod (int, optional) – Only applies if enable_parallel_executions is True. The maximum number of executions that can run at the same time on a deployment’s pod in "low_latency" mode where enable_parallel_executions is True: if a greater number of executions are requested at the same time, then only max_parallel_executions_per_pod executions will actually be running on the pod, and the other ones will be pending until a running execution finishes. The default is 6.
  • ram_request (str, optional) – The amount of memory (RAM) requested for the deployment in KiB, MiB and GiB. The value must be a string with a number followed by a unit, for example “512MiB” or “1GiB”. This is only available for “low_latency” deployments mode.
  • gpu_request (int, optional) – The number of GPUs requested for the deployment. This is only available for “low_latency” deployments mode.
  • timeout_s (int) – Maximum time (in seconds) to wait for the deployment to be ready. 3 minutes (180 seconds) by default, and at least 2 minutes (120 seconds).

Returns

Information about the deployment just create in a dict Python format.

  • name - Name of the deployment.
  • schedule - Schedule of the deployment. Note that this schedule is only returned if execution_rule is “periodic”.
  • human_readable_schedule - Human readable schedule of the deployment. Note that this schedule is only returned if execution_rule is “periodic”.

Example

Set up deployment to be triggered automatically every 14 days.

sdk.create_deployment(
   deployment_name="my_deployment",
   pipeline_name="my_pipeline",
   execution_rule="periodic",
   schedule="0 14 * * *"
)


> {
>   'name': 'produit-endpoint-periodic', 
>   'schedule': '*/2 * * * *', 
>   'human_readable_schedule': 'Every 2 minutes'
> }

Tips for Managing Hardware Resources

When deploying pipelines, especially with low-latency or multiple elastic executions, it’s essential to monitor and manage hardware resource usage effectively.

If RAM/VRAM is Fully Used

This is indicated by errors such as "Out of Memory". You can address this by:

  • Reducing the number of parallel low-latency deployments.
  • Decreasing the number of simultaneous elastic executions.
  • Optimizing your pipeline to use less memory.

If CPU is Fully Used

This can be identified by abnormally slow execution times or observed in resource usage metrics. If this becomes an issue:

  • Reduce the number of ongoing executions.
  • Optimize your code to be more CPU-efficient.
  • Consider upgrading your hardware resources if needed.