Hugging Face has expanded its model offerings by introducing the new generation of large language models from Meta. The latest additions, known as Llama 4 Maverick and Llama 4 Scout, represent a significant advancement in multimodal and Mixture-of-Experts (MoE) architectures. Both models employ auto-regressive techniques and are designed with native capabilities to process both text and image inputs.
The Llama 4 family includes two distinct models. The first, Llama 4 Maverick, utilizes 17 billion active parameters drawn from nearly 400 billion total parameters spread across 128 experts, while Llama 4 Scout also features 17 billion active parameters but is derived from a total of approximately 109 billion parameters using 16 experts. These models have been trained on an extensive dataset that spans up to 40 trillion tokens, covering 200 languages, with dedicated fine-tuning for a dozen of the most widely used languages.
Integration into the Hugging Face ecosystem has been achieved in multiple ways to ensure that the community can quickly begin exploring these state-of-the-art models. Key integrations include:
- Model Availability: Both models are accessible on the Hugging Face Hub under Meta’s organization. Users can explore base as well as instruction-tuned variants, although accessing the model weights requires agreement to the license terms provided in the model card.
- Transformers Integration: The models are fully supported by the Hugging Face Transformers library (version 4.51.0 or later), enabling effortless loading, inference, and fine-tuning. This integration supports not only text but also native multimodal inputs.
- Scalable Deployment: Through Text Generation Inference (TGI), these models offer high-throughput text generation, making them suitable for production applications. The deployment of Llama 4 Scout, for instance, benefits from on-the-fly quantization techniques (4-bit or 8-bit), allowing it to run on a single server-grade GPU.
- Tensor-Parallel and Device Mapping: Automatic support for tensor-parallel setups and device mapping facilitates optimized performance across a range of hardware configurations.
- Xet Storage: The integration with the Xet storage backend improves upload and download speeds while boosting deduplication rates by up to 25-40% on derivative models, saving both time and bandwidth for community projects.
The models are available under a custom community license, and detailed usage instructions, including information on multimodal examples, quantization, and advanced configuration options, can be found on their respective model cards hosted by Meta on the Hugging Face Hub. Hugging Face has also provided straightforward guidance for users who wish to integrate these models via the Transformers library, ensuring that even those new to the platform can quickly get started.
In terms of performance, evaluations demonstrate that these new models achieve state-of-the-art results. Benchmarks highlight improvements in reasoning and knowledge tasks over previous generations, with the instruction-tuned Maverick and Scout models surpassing earlier models on well-known tests such as MMLU and GPQA Diamond.
The release of Llama 4 is the result of an extensive collaborative effort involving teams across the globe. Contributors from the Transformers team, the vLLM group, and the Xet storage team played essential roles in overcoming engineering challenges, refining integrations, and optimizing the models for high-performance deployment. The collaborative synergy between Hugging Face, Meta, and various community partners underscores the commitment to powering advanced AI research and applications.
For further reading and detailed technical insights, readers are encouraged to explore additional resources on Xet storage via the blog post and the corresponding Hub documentation. More information on Meta’s vision for multimodal intelligence can be found on their official blog post.