An Architecture-Led Hybrid Report on Body Language Detection Project
By: Thomson Tong, Diba Darooneh
Potential Business Impact:
Lets computers understand videos and describe what's happening.
This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.
Similar Papers
Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions
CV and Pattern Recognition
Helps teachers see if students are confused.
Evaluation of Vision-LLMs in Surveillance Video
CV and Pattern Recognition
Helps computers spot unusual things in videos.
HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA
CV and Pattern Recognition
Teaches computers to understand and create human movement.