Intel has unveiled its innovative AutoRound algorithm designed to enhance the efficiency of Large Language Model (LLM) serving on its CPUs and GPUs. This advancement coincides with the readiness of Crescent Island to support MXFP8 and MXFP4, promising a leap forward in AI technology deployment.
Revolutionizing LLM Delivery with AutoRound
AutoRound, a cutting-edge post-training quantization (PTQ) algorithm developed by Intel, is now part of the LLM Compressor toolkit. This integration offers a suite of improvements, including higher accuracy for low bit-width quantization, lightweight tuning that requires merely hundreds of steps, and zero additional inference overhead. Models can now be quantized and served efficiently with just a few lines of code, improving workflows significantly.
- Higher accuracy for low bit-width quantization
- Lightweight tuning (hundreds of steps, not thousands)
- Zero additional inference overhead
- Seamless compatibility with compressed-tensors and direct serving in vLLM
- Streamlined workflow: quantize and serve models with just a few lines of code
Broader quantization schemes and model coverage are anticipated, encouraging users to test it now and influence future developments.
Decoding AutoRound’s Capabilities
AutoRound is engineered to optimize LLMs and Vision-Language Models (VLMs) by introducing trainable parameters for each quantized tensor. It employs a unique method by processing decoder layers sequentially and using signed gradient descent to minimize block-wise output reconstruction error. AutoRound’s core strength lies in its ability to deliver superior accuracy, especially at very low bit-widths. It supports multiple data formats such as W4A16, MXFP8, MXFP4, FP8, and NVFP4, with more on the horizon. This flexibility allows users to adjust precision for optimal accuracy and efficiency across various models.

AutoRound facilitates the deployment of quantized models in low-bit formats, accelerating inference on hardware such as Intel Xeon processors, Intel Gaudi AI accelerators, Intel Data Center GPUs, and Intel Arc B-Series Graphics, as well as CUDA-based devices.
Future Prospects with Crescent Island
Looking ahead, Intel is set to incorporate native support for FP8, MXFP8, and MXFP4 formats in their upcoming Intel Data Center GPU, codenamed Crescent Island. This strategic move ensures that models quantized with AutoRound can fully leverage the potential of these formats, providing a streamlined path from innovation to deployment across Intel’s AI hardware offerings.