Turbocharging LLM Inference: TGI Meets Intel Gaudi's Power

Hugging Face is pleased to announce a native integration that brings Intel Gaudi hardware support directly into Text Generation Inference (TGI), its production-grade serving solution for large language models. This enhancement leverages Intel’s specialized AI accelerators to offer the open-source community greater deployment flexibility alongside TGI’s robust performance features.

The new integration combines Gaudi support into TGI’s primary codebase, removing the need for a separate fork that was previously maintained. By adopting a multi-backend architecture, users can now benefit from the latest TGI features without the complications of managing custom repositories.

This upgrade supports the full range of Intel’s Gaudi hardware, providing several deployment options:

Gaudi1: Available on AWS EC2 DL1 instances (AWS EC2 DL1 instances).
Gaudi2: Accessible via Intel Tiber AI Cloud (Intel Tiber AI Cloud) and Denvr Dataworks (Denvr Dataworks).
Gaudi3: Offered on Intel Tiber AI Cloud (Intel Tiber AI Cloud), IBM Cloud (IBM Cloud), as well as through leading OEMs including Dell (Dell), HP (HP), and Supermicro (Supermicro).

More details about Intel’s Gaudi hardware can be found on the Intel Gaudi product page.

The Gaudi backend for TGI offers several noteworthy benefits, including:

Hardware Diversity: It expands deployment options beyond conventional GPUs.
Cost Efficiency: Gaudi hardware frequently delivers attractive price-performance for targeted workloads.
Production-Ready Robustness: Enjoy dynamic batching, streamed responses, and other mature features native to TGI.
Extensive Model Support: Run popular models such as Llama 3.1, Mistral, Mixtral, and others on Gaudi systems.
Advanced Capabilities: Benefit from multi-card inference (sharding), support for vision-language models, and FP8 precision.

To start using TGI on Gaudi, the simplest method is to launch the official Docker image on a machine equipped with Gaudi hardware. A sample setup involves specifying the model, sharing a volume to avoid re-downloading weights during each run, and using an access token for authentication. Once the server is operational, users can direct inference requests simply using standard API calls.

For users seeking detailed instructions, how-to guides, and advanced configuration options, comprehensive documentation is available in the TGI Gaudi backend documentation.

The current release has been optimized to run several models at peak performance on Intel Gaudi hardware, whether in single or multi-card configurations. Models optimized for this integration include:

Llama 3.1 (8B and 70B)
Llama 3.3 (70B)
Llama 3.2 Vision (11B)
Mistral (7B)
Mixtral (8x7B)
CodeLlama (13B)
Falcon (180B)
Qwen2 (72B)
Starcoder and Starcoder2
Gemma (7B)
Llava-v1.6-Mistral-7B
Phi-2

Advanced features enabled on Gaudi include FP8 quantization through the Intel Neural Compressor, which further boosts performance optimizations. More information on this optimization process is available in the Intel Neural Compressor documentation.

Looking towards the future, the team is preparing to expand the model lineup with advanced additions designed to further enhance AI applications.

Hugging Face encourages the community to explore TGI on Gaudi hardware and share feedback. Detailed documentation, contribution guidelines, and avenues for providing input are accessible through the TGI Gaudi backend documentation. This continuous effort to integrate Intel Gaudi support underscores the commitment to delivering flexible, efficient, and production-ready tools for large language model deployment.