AI Inference Service

Previousc. Chain [Coming Soon]NextAI Post-training Service [Coming Soon]

Last updated 3 months ago

AI Inference Service

Introduction

HolmesAI AI Inference platform is an advanced “turbo boost” toolkit designed to supercharge AI inference. Much like a turbocharger in a car engine enhances speed and performance, HolmesAI accelerates AI inference with groundbreaking efficiency. It optimizes inference services in two critical dimensions: simplicity and speed.

HolmesAI AI Inference platform is easy to use with:

Integration with diverse models including DeepSeek, GPT, ChatGLM, LLaMA, Mixtral, Qwen, and Baichuan
Seamless compatibility with multiple model platforms such as Hugging Face, OpenCGA, and ModelScope
Flexible service management through autoscaling and automated deployment features
OpenAI-compatible API for easy integration with existing systems

HolmesAI AI Inference platform is built for speed by:

Optimizing CUDA kernels, including activation and gating functions, RMS normalization, and various activation functions
Utilizing advanced quantization methods like INT8, FP8, and GPTQ for faster inference
Implementing an efficient attention mechanism that partitions KV caches into blocks, reducing memory usage and enabling better memory sharing during parallel sampling
Supporting continuous batching, allowing sequences to process without waiting for the entire batch, resulting in higher GPU utilization

Compared to OpenAI, HolmesAI AI Inference platform delivers around 240 tokens per second, offering three times faster performance for single requests. In concurrent requests, it achieves up to 3,000 tokens per second, providing 40 times higher throughput.

More details

Enhanced Multi-Model Compatibility

Hugging Face’s Transformers library provides thousands of pre-trained models across diverse modalities. We continually expand our support by integrating new models into our platform. For models with architectures similar to existing ones, the adaptation process is straightforward. For those introducing new operators, integration can be more complex, but our team is ready to assist. For more information, don’t hesitate to contact us.

Seamless Multi-Platform Deployment

We now offer seamless one-click deployment for third-party models. Simply search for your desired model on platforms like Hugging Face, OpenCSD, or ModelScope, paste the model URL, and we’ll handle the rest.

Autoscaling

HolmesAI AI Inference platform leverages Ray’s distributed and shared memory management to seamlessly deploy services across multiple GPUs and devices. The core of HolmesAI, the AsyncEngine layer, ensures efficient, concurrent, and thread-safe processing for tasks such as encoding, tokenization, decoding, and beam search. Additionally, HolmesAI automatically handles routing and load balancing for inference services, allowing you to flexibly specify any number of GPUs and API endpoints for deployment.

OpenAI Compatibility

CUDA Optimization

PyTorch is a widely-used deep learning framework that provides user-friendly tools and abstractions for building and training neural networks. However, some basic operations in PyTorch can be slower compared to performing them directly with CUDA—a parallel computing platform and programming model developed by NVIDIA that optimizes performance by leveraging the power of GPUs. For example, calculating Root Mean Square (RMS) normalization in PyTorch can be around 1.2x slower than using direct CUDA operations:

variance = x.pow(2).mean(dim=-1, keepdim=True)

x = x * torch.rsqrt(variance + self.variance_epsilon)

x = x.to(orig_dtype) * self.weight

for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {

const float x = (float) input[blockIdx.x * hidden_size + idx];

variance += x * x;

}

variance = blockReduceSum<float>(variance);

if (threadIdx.x == 0) {

s_variance = rsqrtf(variance / hidden_size + epsilon);

}

__syncthreads();

for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {

float x = (float) input[blockIdx.x * hidden_size + idx];

out[blockIdx.x hidden_size + idx] = ((scalar_t) (x s_variance)) * weight[idx];

}

Quantizations

Some quantization techniques are supported on NVIDIA platforms (SM 7.0 and above). The table below outlines the compatibility details:

For instance, quantizing models to FP8 can reduce memory usage by 2x while boosting throughput by up to 1.6x, all with minimal impact on accuracy.

Attention Mechanism

In traditional models, the KV cache used during token generation must be allocated contiguously. However, our attention mechanism employs a non-contiguous approach by allocating memory in fixed-size blocks and operating on block-aligned inputs. This method ensures that buffer allocation occurs only as needed. When starting a new generation, the framework does not require a contiguous buffer of the maximum context length. Instead, each iteration allows the scheduler to dynamically determine if additional space is needed for the current generation.

Continuous batching

Here’s an illustration of naive batching in the context of LLM inference:

In the diagram, yellow represents prompt tokens and blue represents generated tokens. After several iterations (as shown in the right image), each sequence undergoes different numbers of iterations. Notice that Sequence 2 is nearly twice as long as Sequence 3. This results in underutilization of resources, as half the resources remain occupied until all sequences are completed. In practice, a chatbot service often receives input sequences of varying lengths, typically ranging from 1k to 8k tokens, which can lead to significant GPU underutilization.

To address this, we can insert new sequences immediately after the completion of the previous generation, like so:

However, the reality is more complex than this simplified model. The initial token generation phase is computationally intensive and follows a different pattern than subsequent token generations, making it challenging to batch effectively. To manage this, we use a hyperparameter to adjust the ratio of requests waiting for the first token generation phase compared to those waiting for the end of the token generation phase. These adjustments contribute to the benchmarks presented below:

Previousc. Chain [Coming Soon]NextAI Post-training Service [Coming Soon]

Last updated 3 months ago

Introduction

HolmesAI AI Inference platform is easy to use with:

Integration with diverse models including DeepSeek, GPT, ChatGLM, LLaMA, Mixtral, Qwen, and Baichuan
Seamless compatibility with multiple model platforms such as Hugging Face, OpenCGA, and ModelScope
Flexible service management through autoscaling and automated deployment features
OpenAI-compatible API for easy integration with existing systems

HolmesAI AI Inference platform is built for speed by:

Optimizing CUDA kernels, including activation and gating functions, RMS normalization, and various activation functions
Utilizing advanced quantization methods like INT8, FP8, and GPTQ for faster inference
Implementing an efficient attention mechanism that partitions KV caches into blocks, reducing memory usage and enabling better memory sharing during parallel sampling
Supporting continuous batching, allowing sequences to process without waiting for the entire batch, resulting in higher GPU utilization

More details

Enhanced Multi-Model Compatibility

Seamless Multi-Platform Deployment

Autoscaling

OpenAI Compatibility

HolmesAI AI Inference platform offers an HTTP server that is fully compatible with the OpenAI API, supporting both the and endpoints. You can interact with this server using the official OpenAI Python client library or any other HTTP client. Configuration options, such as the "model" parameter, are available in our console. For more details, please refer to the . We support all parameters except for tools, tool_choice, functions, and function_call.

CUDA Optimization

variance = x.pow(2).mean(dim=-1, keepdim=True)

x = x * torch.rsqrt(variance + self.variance_epsilon)

x = x.to(orig_dtype) * self.weight

for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {

const float x = (float) input[blockIdx.x * hidden_size + idx];

variance += x * x;

}

variance = blockReduceSum<float>(variance);

if (threadIdx.x == 0) {

s_variance = rsqrtf(variance / hidden_size + epsilon);

}

__syncthreads();

for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {

float x = (float) input[blockIdx.x * hidden_size + idx];

out[blockIdx.x hidden_size + idx] = ((scalar_t) (x s_variance)) * weight[idx];

}

Quantizations

Some quantization techniques are supported on NVIDIA platforms (SM 7.0 and above). The table below outlines the compatibility details:

For instance, quantizing models to FP8 can reduce memory usage by 2x while boosting throughput by up to 1.6x, all with minimal impact on accuracy.

Attention Mechanism

Continuous batching

Here’s an illustration of naive batching in the context of LLM inference:

To address this, we can insert new sequences immediately after the completion of the previous generation, like so: