7 May 2024 | Zhixuan Chu, Lei Zhang, Yichen Sun, Siqiao Xue, Zhibo Wang, Zhan Qin, Kui Ren
SoraDetector is a unified framework for detecting hallucinations in large text-to-video (T2V) models, including the Sora model. The framework analyzes hallucination phenomena by categorizing them based on their manifestation in video content. It leverages keyframe extraction, multimodal large language models, and knowledge graph construction to evaluate consistency between extracted video content and textual prompts, then constructs static and dynamic knowledge graphs to detect hallucinations in both single frames and across frames. SoraDetector provides a robust and quantifiable measure of consistency, static, and dynamic hallucinations. Additionally, it introduces the Sora Detector Agent to automate the detection process and generate video quality reports. The T2VHaluBench benchmark is introduced to evaluate advancements in T2V hallucination detection. Through experiments on videos generated by Sora and other T2V models, the framework demonstrates its effectiveness in accurately detecting hallucinations. The code and dataset are available at https://github.com/TruthAI-Lab/SoraDetector. The framework addresses the challenges of detecting hallucinations in T2V models by combining keyframe extraction, object detection, knowledge graph construction, and multimodal large language models. It defines three types of hallucinations: prompt consistency hallucinations, static hallucinations, and dynamic hallucinations. Static hallucinations involve unrealistic or inconsistent objects, textures, or scenes within individual frames, while dynamic hallucinations involve temporal inconsistencies and abnormalities in object motion and behavior across frames. SoraDetector uses keyframe extraction, knowledge graph construction, and multimodal language models to detect these hallucinations. The framework also includes a detailed frame extraction scheme and a dynamic knowledge graph construction process to enhance detection accuracy. The results from static hallucination detection are integrated into dynamic hallucination detection to improve efficiency and accuracy. Finally, the framework aggregates the results from consistency, static, and dynamic hallucination detection to provide a comprehensive set of hallucination issues.SoraDetector is a unified framework for detecting hallucinations in large text-to-video (T2V) models, including the Sora model. The framework analyzes hallucination phenomena by categorizing them based on their manifestation in video content. It leverages keyframe extraction, multimodal large language models, and knowledge graph construction to evaluate consistency between extracted video content and textual prompts, then constructs static and dynamic knowledge graphs to detect hallucinations in both single frames and across frames. SoraDetector provides a robust and quantifiable measure of consistency, static, and dynamic hallucinations. Additionally, it introduces the Sora Detector Agent to automate the detection process and generate video quality reports. The T2VHaluBench benchmark is introduced to evaluate advancements in T2V hallucination detection. Through experiments on videos generated by Sora and other T2V models, the framework demonstrates its effectiveness in accurately detecting hallucinations. The code and dataset are available at https://github.com/TruthAI-Lab/SoraDetector. The framework addresses the challenges of detecting hallucinations in T2V models by combining keyframe extraction, object detection, knowledge graph construction, and multimodal large language models. It defines three types of hallucinations: prompt consistency hallucinations, static hallucinations, and dynamic hallucinations. Static hallucinations involve unrealistic or inconsistent objects, textures, or scenes within individual frames, while dynamic hallucinations involve temporal inconsistencies and abnormalities in object motion and behavior across frames. SoraDetector uses keyframe extraction, knowledge graph construction, and multimodal language models to detect these hallucinations. The framework also includes a detailed frame extraction scheme and a dynamic knowledge graph construction process to enhance detection accuracy. The results from static hallucination detection are integrated into dynamic hallucination detection to improve efficiency and accuracy. Finally, the framework aggregates the results from consistency, static, and dynamic hallucination detection to provide a comprehensive set of hallucination issues.