我们的 BYOC 方案将领先的推理基础设施带到您的云中,让您完全掌控您的 AI 工作负载。
部署在您自己的云上 - AWS、GCP、Azure 等
跨多云和多区域的高效资源调配
利用现有的云承诺和信用额度
通过 SOC II 认证,确保您的模型和数据保持安全
BentoML 是使用任何开源或自定义微调模型构建生产级 AI 系统最灵活的方式。我们负责基础设施,让您专注于创新。
使用您的模型和代码创建推理 API、任务队列和多模型流水线。BentoML 的开源框架提供可定制的扩缩容、队列、批量处理和模型组合,以加速生产级 AI 系统的开发。
@openai_endpoints( model_id=MODEL_ID, default_chat_completion_parameters=dict(stop=["<|eot_id|>"]), ) @bentoml.service( name="bentovllm-llama3.1-405b-instruct-awq-service", traffic={ "timeout": 1200, "concurrency": 256, # Matches the default max_num_seqs in the VLLM engine }, resources={ "gpu": 4, "gpu_type": "nvidia-a100-80gb", }, ) class VLLM: def __init__(self) -> None: from transformers import AutoTokenizer from vllm import AsyncEngineArgs, AsyncLLMEngine ENGINE_ARGS = AsyncEngineArgs( model=MODEL_ID, max_model_len=MAX_TOKENS, enable_prefix_caching=True, tensor_parallel_size=4, ) self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) self.stop_token_ids = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>"), ] @bentoml.api async def generate( self, prompt: str = "Explain superconductors in plain English", system_prompt: Optional[str] = SYSTEM_PROMPT, max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS, ) -> AsyncGenerator[str, None]: from vllm import SamplingParams SAMPLING_PARAM = SamplingParams( max_tokens=max_tokens, stop_token_ids=self.stop_token_ids, ) if system_prompt is None: system_prompt = SYSTEM_PROMPT prompt = PROMPT_TEMPLATE.format(user_prompt=prompt, system_prompt=system_prompt) stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM) cursor = 0 async for request_output in stream: text = request_output.outputs[0].text yield text[cursor:] cursor = len(text)
通过单个命令,无缝地从本地原型过渡到安全、可扩展的生产部署。
bentoml deploy . 🍱 Built bento "vllm:7ftwkpztah74bdwk" ✅ Pushed Bento "vllm:7ftwkpztah74bdwk" ✅ Created deployment "vllm:7ftwkpztah74bdwk" in cluster "gcp-us-central-1" 💻 View Dashboard: https://ss-org-1.cloud.bentoml.com/deployments/vllm-t1y6
通过自动生成的 Web UI、Python 客户端和 REST API,简化对已部署 AI 应用的访问。通过基于 token 的授权,为客户端应用提供安全、受控的访问。
curl -s -X POST \ 'https://bentovllm-llama3-1-405b-instruct-awq-service.mt-guc1.bentoml.ai/generate' \ -H 'Content-Type: application/json' \ -d '{ "max_tokens": 4096, "prompt": "Explain superconductors in plain English", "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don'"'"'t know the answer to a question, please don'"'"'t share false information." }'
利用 BentoML 优化的推理基础设施,赋能您的任务关键型 AI。
GPU 快速自动扩缩容,冷启动延迟极低
低延迟、高吞吐量的模型服务
智能资源管理,实现成本效益
实时监控和日志记录,确保部署可靠性