Panasonic Holdings develops LaViDa, a diffusion based vision language model

AI & Robotics

Panasonic R&D Company of America (hereinafter PRDCA) and Panasonic Holdings Corporation (hereinafter Panasonic HD), in collaboration with researchers at UCLA, have developed LaViDa, the diffusion based multimodal AI*¹.
In recent years, multimodal AI has improved performance by scaling up models following the success of large language models; however, conventional autoregressive text generation has faced the problem that generation time increases as the amount of text grows. With LaViDa, text is generated using a diffusion model, which makes it possible to vary generation speed, achieving about a twofold speedup while maintaining accuracy comparable to existing autoregressive methods.
The novelty of this technology has been internationally recognized, and it has been accepted for presentation at NeurIPS 2025, a leading AI/ML conference. It will be presented at the conference to be held in San Diego, USA, from December 3 to December 5, 2025.

■Technical details

＜Research background＞

Panasonic HD and PRDCA are conducting research on multimodal AI. Recently, in the field of large language models, approaches that generate text, programming code and so on using diffusion models have emerged and attracted much attention because they can produce text faster than conventional autoregressive generative models. Moreover, autoregressive methods that generate one token at a time from the start of a sentence cannot fully account for the overall structure of a text, so they struggle with tasks that require producing text under specific format constraints—such as poetry—making diffusion models a promising alternative. However, in the multimodal AI domain that handles both images and language, autoregressive methods have remained dominant and diffusion based approaches had not yet been introduced.

＜Technical features＞

Figure1 The mechanism of diffusion language model

The mechanism for generating discrete data such as text using diffusion models consists of a masking process, in which each token*² that makes up a sentence is randomly replaced with a mask token, and an unmasking process, in which the original tokens are restored from the masked tokens (Figure 1). However, simply replacing an autoregressive multimodal AI model with a diffusion model gives rise to two problems: (i) the attention*³ computations required for generating tokens each time become heavy, and (ii) there is a high risk that words important for image captions will not be learned (i.e., will not be masked).
The problem in (i)—the heavy attention computations—stems from the fact that, in an autoregressive model, the tokens to be predicted only need to attend to previously generated past tokens, whereas in a diffusion model the attention computations must always reference all tokens in the entire sentence.

To reduce computation, we introduced a technique that excludes the answer tokens from the attention calculations between the input image tokens and the question tokens. The attention mask shown in Figure 2 highlights (in green) the query–key pairs over which attention is computed. Rows represent queries and columns represent keys; I denotes the image-token sequence, P the question-token sequence, and X_t the answer-token sequence at step t. In the autoregressive case (left), the attention map only needs to consider past tokens, so the white areas in the figure require no computation. In the diffusion model (center), attention must be computed over all tokens. In our proposed Prefix-DLM (right), we improve efficiency by removing the answer tokens from the attention calculations involving image and question tokens (the white areas).

To address issue (ii), during training we prepare two complementary masking patterns for the same sentence so that the tokens masked in one pattern are not masked in the other, and we train both unmasking processes. This ensures that every token in the sentence is learned.

Figure2 Attention mask of each method

Figure3 Complementary masking

*1: Refers to AI that can handle multiple types of information simultaneously, such as images and text (multimodal AI).

*2: The smallest units (words or subwords) into which text (or other data) are divided to make processing easier (tokens).

*3: A mechanism in transformers that computes the relevance between parts of an input sequence and selects and integrates information according to their importance (attention).

＜Evaluation＞

Figure4 the comparisons of performance and efficiency

In evaluation experiments, we validated effectiveness using a variety of datasets, ranging from natural-image Q&A tasks to mathematical and scientific proof problems and document-understanding tasks containing many charts and graphs. Figure 4 shows (left) performance comparisons with existing methods on each evaluation dataset and (right) a plot with generation time on the x‑axis and generation performance on the y‑axis. In the right-hand plot, “NFE” is an indicator of the proportion of tokens generated in a single diffusion step. A lower NFE means more tokens are generated per step, which reduces the number of steps required and thus speeds up generation, but performance tends to decrease — a trade-off.
LaViDa achieved higher performance than existing autoregressive methods on all datasets, and we also confirmed that it can generate text faster than existing autoregressive approaches.

Figure5 Generation of text in structured formats
The left figure shows generation results for a poetry task, comparing LaViDa with an autoregressive model.
The right figure shows generation outputs that represent image understanding in JSON format.

Figure 5 shows examples of generating image content in structured formats such as poetry and JSON. With conventional autoregressive methods, it was necessary to specify detailed rules in the prompt, which often caused misinterpretation; with LaViDa, however, the unmasking mechanism enables these formats to be generated naturally.

■Future outlook

LaViDa is the multimodal AI to generate text using diffusion models. The Panasonic Group is currently promoting the deployment of AI agents across the company to improve operational efficiency and automation, but the various document and material formats found in the field are not always easy for AI agents to understand. By using this method to standardize those documents into a unified format, we aim to accelerate the adoption of AI agents.
Going forward, Panasonic HD will continue to accelerate the social implementation of AI and promote research and development of AI technologies that contribute to customers' daily lives and workplaces.

【Related information】

LaViDa: A Large Diffusion Language Model for Multimodal Understanding
This research is a collaborative effort between Konstantinos Kallidromitis of PRDCA, Shufan Li of UCLA, and Yusuke Kato and Kazuki Kozuka of Panasonic Holdings Corporation.

arXiv link https://arxiv.org/abs/2505.16839

Panasonic×AI WEB site
https://tech-ai.panasonic.com/en/
Panasonic×AI X
https://x.com/panasonic_ai

The content in this website is accurate at the time of publication but may be subject to change without notice.
Please note therefore that these documents may not always contain the most up-to-date information.
Please note that German, Spanish and Chinese versions are machine translations, so the quality and accuracy may vary.