
Jul 14, 2025
- Company
- Press Releases
[Correction] July 8, 2025 There was an error in the figure. The relevant figure has been corrected.
Figure 2: Structure and processing of “SparseVLM” (quoted from the accepted paper)
[Correction] July 7, 2025 There was an error in the title. It has been corrected as follows.
(before) Panasonic HD develops "SparseVLM" technology that doubles the processing speed of Vision-Language Model
(after) Panasonic HD develops "SparseVLM" technology that doubles the processing speed of Vision-Language Models
Jul 04, 2025
Company / Press Releases
Figure 1: Comparison of “SparseVLM” and existing sparsification methods (quoted from the accepted paper)
Osaka, Japan, July 4, 2025 – Panasonic R&D Company of America (PRDCA) and Panasonic Holdings Co., Ltd. (Panasonic HD), in collaboration with researchers from Peking University, Fudan University, University of California, Berkeley, and Shanghai Jiao Tong University, have developed “SparseVLM,” a technology that speeds up Vision-Language Models (VLMs), AI models that can understand and process both visual data such as images and videos, and text data.
In recent years, VLMs have seen rapid development. These models can process visual and textual information simultaneously and can answer questions about visual content. However, handling a large amount of data, especially high-resolution images and long videos, leads to longer inference times and higher computational complexity for the AI model. “SparseVLM” adopts a novel approach by focusing solely on the visual information relevant to the input prompt (Figure 1), significantly reducing inference time and computational complexity while maintaining high accuracy in answering questions about images.
This research has been accepted for presentation at the 42nd International Conference on Machine Learning (ICML2025), one of the premier conferences for AI and machine learning research. The conference will take place in Vancouver, Canada from July 13 to July 19, 2025.
Figure 2: Structure and processing of “SparseVLM” (quoted from the accepted paper)
Panasonic HD and PRDCA are working on developing highly efficient generative AI in collaboration with the universities that led this research. In recent years, VLMs that process visual and textual information simultaneously have attracted attention. They incorporate large language models (LLMs) to leverage the reasoning and recognition capabilities of LLMs. Since such VLMs are designed to integrate visual tokens extracted from images or videos with text tokens and input them into the LLM, the amount of information processed by the LLM increases, especially with high-resolution images and long videos. Visual tokens that are not required to generate answers must also be processed, leading to a longer inference time and higher computational complexity.
Several approaches have been proposed to speed up VLMs by focusing on the redundancy of unnecessary visual tokens. However, these existing methods typically select visual tokens solely based on images and perform sparsification without considering their relevance to the input text prompt. Consequently, these methods remain inefficient because they process visual tokens irrelevant to the prompt, leaving room for improvement.
To overcome this challenge, we propose a lightweight VLM method, called “SparseVLM” that processes only the visual tokens relevant to the input prompt, ignoring the rest. Specifically, SparseVLM identifies words in the prompt that are relevant to the image or video input, then processes only the visual tokens associated with those words (Figure 2). This approach enables efficient selection and processing of only the necessary visual tokens to answer questions such as “What is written on this blue sign?” by focusing on the sign area in the upper right corner of the image (Q-1 in Figure 1).
Furthermore, most existing methods require additional data and training to select relevant tokens, whereas SparseVLM can identify and select the necessary tokens relevant to the prompt without additional training, offering a significant advantage in flexibility and ease of use.
Figure 3: Evaluation results (quoted from the accepted paper)
In the evaluation, we compared the performance of our proposed method against existing approaches across eight visual question answering benchmarks. The results show that our method achieves a 48.3% reduction in latency and a 71.9% decrease in FLOPs while maintaining an average accuracy of 89.3% compared to the baseline model, confirming its superior performance over existing methods (Figure 3).
The newly developed “SparseVLM” approximately doubles processing speed while maintaining visual question answering accuracy by considering input prompts that previous VLM speedup methods did not account for. It is expected to be applicable in various fields that require rapid recognition and verbalization of the user's state and surrounding environment based on visual information.
Going forward, Panasonic HD will continue to accelerate the social implementation of AI and promote the research and development of AI technologies that improve the usefulness of our customers' lives and workplaces.
“SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference”
This research was developed within the framework of the BAIR Open Research Commons(*1) led by UC Berkeley, and is the result of collaboration among Yuan Zhang, Chun-Kai Fan and Shanghang Zhang from Peking University, Junpeng Ma from Fudan University, Wenzhao Zheng and Kurt Keutzer from UC Berkeley, Denis Gudovskiy from PRDCA, and Tomoyuki Okuno and Yohei Nakata from Panasonic HD.
arXiv: https://arxiv.org/abs/2410.04417
Panasonic×AI website: https://tech-ai.panasonic.com/
*1 This AI research institute was established as a collaborative platform for the top researchers worldwide, bringing industry and academia. As of 2025, it includes 10 participating companies, including Panasonic HD, Google, and Meta.
The content in this website is accurate at the time of publication but may be subject to change without notice.
Please note therefore that these documents may not always contain the most up-to-date information.
Please note that German, Spanish and Chinese versions are machine translations, so the quality and accuracy may vary.