Dec 02, 2024
- Company
- Press Release
- R&D
- North America
- AI & Robotics
Jul 16, 2024
Company / Press Release
Developing an AI multi-agent system that can simultaneously handle visual and language
Tokyo, Japan – Panasonic Connect Co., Ltd. (https://connect.panasonic.com/en/) participated in the "Ego4D EgoSchema Challenge*1" competition at the top-tier conference for image recognition, CVPR 2024. They worked on the Video Question Answering (VideoQA) task, which involved watching a 3-minute video and selecting the most appropriate response from five choices to questions such as "Based on the actions of the person in the video, please infer what the person is trying to do." Their developed generative AI multi-agent system achieved a 71% accuracy rate in responding to these questions using AI recognition technology, earning them a second-place evaluation in the world.
Unlike the ability to recognize still images, understanding the content of a video and answering questions requires the recognition and judgment of important actions and their objectives that change over time in a long video, while simultaneously interpreting the questions and answers expressed in language. Advanced technological development in the mutual understanding of video and language is required.
Building on their research and development experience in action prediction tasks using videos, with an award at the same conference in 2021*2, Panasonic Connect challenged the Video Question Answering task, which resides in the fusion domain of video and natural language processing.
In approaching this competition, we focused on leveraging large-scale language models (LLMs) capable of generating appropriate answers to various questions. Recent research on large-scale language models has explored methods to enhance answer accuracy by engaging AI multi-agents in debates*3 and adopting a multi-persona approach where AI agents can perform multiple roles*4. Based on these existing studies, we developed a novel Video Question Answering system (VDMA: Video Question Answering with Dynamically Generated Multi-Agents) that analyzes videos and is based on automatically generated multiple agents.
This approach involves generating multiple expert AI agents dynamically based on the analysis of questions and videos by an AI. Just as in a financial department where one would ask a human expert for things they do not understand, the AI in this approach considers which experts to consult depending on the problem and generates the corresponding expert AI dynamically to solve the problem. The detailed approach we took in this technological development is as follows.
In this competition, the AI first analyzes the question and video, generates multiple specialists suitable for answering the question, and provides instructions for each specialist. When generating the multiple specialists, in addition to the expert AI acting as specialists, a supervising agent AI (organizer) is dynamically generated to summarize the opinions of the specialists. In the example below, three agents are generated: a supervisor (1), an expert in object detection (2), and an expert in handcrafts (3). The agent AI acting as the specialists, such as the one generated for object detection and handcrafts, analyzes the content of the video and responds to the question prompt following the instructions provided. Finally, the supervisor considers the responses from each specialist's agent AI and selects the most appropriate answer from a multiple-choice selection.
This method allowed us to achieve an accuracy rate close to 76% of that achieved by human responses*5. This AI multi-agent system is a unique technology developed by Panasonic Connect and is currently pending patent approval. The source code for this technology has been made available on GitHub, and a technical report has been published on arXiv*6.
Thus far, image recognition technology has been limited to comprehending events occurring at a specific time and place, such as the presence of people, the existence of objects, or someone holding an item. However, through the development of the AI multi-agent system, we have made it possible to understand and infer the context of long video content, spanning events from the past. We believe that by understanding why a person is there, what they are holding, and what they are trying to do, we can recommend the next actions for the person and extract actions that need modification or improvement from a series of past actions leading up to the present. This capability can be utilized, for instance, to identify inefficient tasks in the manufacturing process at a factory and enable process reevaluation, leading to improvements on the manufacturing floor.
For instance, this development could be utilized to identify inefficient work processes within the production stages of a factory, thereby contributing to improvements on the manufacturing floor. Looking ahead, we will focus on further development aimed at leveraging the AI multi-agent system in areas such as supply chain management, manufacturing, logistics, and distribution, which Panasonic Connect is committed to as its business domains.
Embracing the purpose of "Change Work, Advance Society, Connect to Tomorrow," we are dedicated to realizing a sustainable society where diverse people can live happily by bringing innovation to the forefront.
*1: This is a competition in which first-person perspective videos, questions regarding the videos, and five answer choices for the questions are provided. The aim is to select the most fitting answer for the question about the video content (Video Question Answering: VideoQA task).
*2: This was one of the workshops at CVPR 2021, the world's premier international conference on computer vision. Panasonic achieved second place in the "EPIC-KITCHENS-100 2021 Challenges" contest for the action anticipation track. It was announced on July 9, 2021. (Japanese only)
https://news.panasonic.com/jp/press/jn210709-1
Egocentric Live 4D Perception (Ego4D)
https://ego4d-data.org/
*3: Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. ArXiv, abs/2305.14325, 2023. 1, 4.
*4: Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. arXiv preprint arXiv:2307.05300, 2023. 1.
*5: Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik.
Egoschema: A diagnostic benchmark for very long-form video language understanding.
Advances in Neural Information Processing Systems, 36, 2024.
*6: Source code: https://github.com/PanasonicConnect/VDMA
Technical report: https://arxiv.org/abs/2407.03610
Database: A large-scale first-person video database, supporting research in multi-modal machine perception for daily life activity. Ego4D Consortium 2020.
https://ego4d-data.org/
About Panasonic Connect Panasonic Connect Co., Ltd. (https://connect.panasonic.com/en/) was established on April 1, 2022 as part of the Panasonic Group’s (https://www.panasonic.com/global/) switch to an operating company system. With roughly 28,300 employees worldwide and annual sales of JPY1,202.8 billion the company plays a central role in the growth of the Panasonic Group’s B2B solutions business and provides new value to its customers by combining advanced hardware, intelligent software solutions, and a wealth of knowledge in industrial engineering accumulated in its over 100-year history. The company’s purpose is to “Change Work, Advance Society, Connect to Tomorrow.” By driving innovation in the supply chain, public services, infrastructure, and entertainment sectors, Panasonic Connect aims to contribute to the realization of a sustainable society and to ensure well-being for all. |
The content in this website is accurate at the time of publication but may be subject to change without notice.
Please note therefore that these documents may not always contain the most up-to-date information.
Please note that German, French and Chinese versions are machine translations, so the quality and accuracy may vary.