FlockServe

Open-source Sky Computing Inference Endpoint

FlockServe is an open-source tool that deploys AI inference endpoints, offering features like autoscaling, load balancing, and monitoring. It uses SkyPilot to provision nodes, allowing it to operate across different cloud services and making it easy to switch between them by changing the SkyPilot settings. The platform is compatible with any inference engine and provides examples of using vLLM and TGI engines. FlockServe is built on FastAPI and uvicorn, which processes requests asynchronously. Its design is modular, enabling various autoscaling and load-balancing strategies. The default autoscaling method estimates the average queue lengths to adjust resources, and the default load balancing method uses Least Connection Load Balancing, both of which are suitable for managing large language models.

Features

Scalability FlockServe shines in its ability to scale inference endpoints to meet demand effortlessly. It utilizes cloud resources to adjust capacity as needed, ensuring that your machine-learning models can handle any load thrown at them. SkyPilot integration By integrating closely with SkyPilot, FlockServe benefits from dynamic and distributed computing capabilities. This integration allows for efficient resource utilization, ensuring that models are served with the least latency and highest throughput possible. Flexible model support Understanding the diversity in machine learning frameworks and model formats, FlockServe offers broad support across various technologies. This flexibility ensures developers can deploy their models without worrying about compatibility issues. RESTful API The simplicity and intuitiveness of FlockServe’s API make interacting with the inference endpoint a breeze. Developers can easily integrate their applications with FlockServe, streamlining the process of sending requests and receiving predictions. Monitoring and logging Keeping track of your deployed models is crucial for maintaining performance and reliability. FlockServe provides robust monitoring and logging tools, giving you the insights needed to debug and optimize your models effectively.

Getting started

Below are the prerequisites needed to get started and a simple installation guide to set up FlockServe.

Prerequisites

Before you dive into installing FlockServe, ensure your environment meets the following requirements: Python version FlockServe requires Python version 3.7 or newer but must be below 3.11. This version range ensures compatibility with FlockServe’s dependencies and features. Docker (if using containerized deployment) For a containerized deployment, having Docker installed on your system is essential. Docker allows you to encapsulate FlockServe in a container, making it easy to deploy across different environments without worrying about varying dependencies.

Installation guide

Installing FlockServe is as simple as running a single command in your terminal. Follow these steps to get FlockServe up and running:

Open your terminal

Start by opening a terminal window on your computer.

Install FlockServe using pip

Execute the following command to install FlockServe via pip, Python’s package installer. This command downloads FlockServe and its dependencies from PyPI.

pip install flockserve

After completing these steps, FlockServe will be installed on your system, and you’re ready to start configuring and using your inference endpoint.

How to use FlockServe?

Here’s a guide to help you use FlockServe to serve your models.

Import FlockServe library

Begin by incorporating the FlockServe library into your project. This step is crucial for accessing the functionalities provided by FlockServe. From the command line:

flockserve --skypilot_task serving_tgi_cpu_openai.yaml

From Python:

from flockserve import FlockServe
fs = FlockServe(skypilot_task="serving_tgi_cpu_generate.yaml")
fs.run()

Configure your inference endpoint

Set up the necessary configurations for your inference endpoint, including skypilot_task, worker_capacity, and port configurations _. _ _skypilot_task_ is a mandatory argument. The other available arguments are:

Arguments	Default Value	Description
Pilot task. `_worker_capacity_`	`_30_`
Maximum concurrent tasks a worker can handle. `_worker_name_prefix_`	`_'skypilot-worker'_`
Prefix used for naming workers. `_host_`	`_'0.0.0.0'_`
P address for server binding. `_port_`	`_-1_`
Pilot task. `_worker_ready_path_`	`_"/health"_`
Path to verify worker readiness. `_min_workers_`	`_1_`
Minimum number of workers to maintain. `_max_workers_`	`_2_`
Maximum number of workers allowed. `_autoscale_up_`	`_7_`
Load threshold for scaling up workers. `_autoscale_down_`	`_4_`
Load threshold for scaling down workers. `_queue_tracking_window_`	`_600_`
Secret key for managing nodes. `_metrics_id_`	`_1_`
A. `_verbosity_`	`_1_`	0 for
Debug. `_otel_collector_endpoint_`	`_http://localhost:4317_`
L collector for telemetry export. `_otel_metrics_exporter_settings_`	`_{}_`	Additional OTEL metrics exporter settings in key-value format.

Example FlockServe configuration

# Sample FlockServe Configuration
from flockserve import FlockServe

# Define the configuration for your FlockServe inference endpoint
flockserve_config = {
    "skypilot_task": "path_to_your_skypilot_task_file.yaml",  # Mandatory: Path to the YAML file that defines the SkyPilot task.
    "worker_capacity": 30,                                   # Optional: Maximum number of tasks a worker can handle concurrently.
    "worker_name_prefix": "skypilot-worker",                 # Optional: Prefix used for naming workers.
    "host": "0.0.0.0",                                       # Optional: Host IP address for server binding.
    "port": 8000,                                            # Optional: Port number for server listening; derived from the SkyPilot task if set to -1.
    "worker_ready_path": "/health",                          # Optional: Path to check worker readiness.
    "min_workers": 1,                                        # Optional: Minimum number of workers to maintain.
    "max_workers": 5,                                        # Optional: Maximum number of workers allowed.
    "autoscale_up": 7,                                       # Optional: Load threshold to trigger scaling up of workers.
    "autoscale_down": 4,                                     # Optional: Load threshold to trigger scaling down of workers.
    "queue_tracking_window": 600,                            # Optional: Time window in seconds to track queue length for autoscaling.
    "node_control_key": "your_secret_key",                   # Optional: Secret key for node management operations.
    "metrics_id": "1",                                       # Optional: Suffix for generated OTEL metrics—set different for Prod & QA.
    "verbosity": 1,                                          # Optional: 0 for No logging, 1 for Info, 2 for Debug.
    "otel_collector_endpoint": "http://localhost:4317",      # Optional: Address of OTEL collector to export Telemetry to.
    "otel_metrics_exporter_settings": {}                     # Optional: Extra settings for OTEL metrics exporter in key-value pair format.
}

# Initialize the FlockServe application with the defined configuration
fs = FlockServe(**flockserve_config)

# Run the FlockServe application to start serving your models
fs.run()

Initialize and run the inference endpoint

With the configurations set, initialize the FlockServe application and start the server. This step brings your inference endpoint online and ready-to-serve requests.

# Initialize the FlockServe application with the defined configuration
fs = FlockServe(**flockserve_config)

# Run the FlockServe application to start serving your models
fs.run()

These guidelines and the example configuration should help you set up and customize your FlockServe inference endpoint. Adjust these settings based on your deployment environment and specific requirements to optimize the performance and scalability of your machine-learning model serving.

How to contribute to FlockServe?

FlockServe welcomes contributions from the community. Whether you’re fixing bugs, adding new features, or improving documentation, your help is valuable. Here’s how you can contribute:

Fork the repository

Start by forking the FlockServe repository on GitHub to your own account.

Clone the fork

Clone your fork to your local machine for development work.

Create a new branch

Make a new branch for each contribution, focusing on a single feature or bug fix.

Implement your changes

Work on your changes. Adhere to the project’s coding standards and guidelines.

Test your changes

Ensure your changes do not break any existing functionality. Add tests if you’re introducing new features.

Submit a PR (pull request)

Once satisfied with your work, push your branch to your fork on GitHub and submit a pull request to the main FlockServe repository. Please follow the detailed contribution guidelines provided in the FlockServe repository for more information.

Reach support

If you encounter any issues or have questions about FlockServe, the project encourages you to use the following resources:

Issue Tracker: Use the GitHub issue tracker associated with the FlockServe repository for bugs and feature requests.
Email: Drop an email with your concern/issue to****hello@jdoodle.com.

Acknowledgements to contributors

FlockServe appreciates the contributions of all its contributors and the broader open-source community for their support and collaboration. Special thanks are extended to everyone who has contributed to developing, testing, and refining FlockServe, making it a valuable tool for the community.

Previous Topic ← Explain code

Next Topic Generate new code →