For our Computer Vision (CIS 5810) final project we created Pose Ninja, an interactive, webcam-based game that encourages physical movement through body tracking.
Special thanks to my teammates, Jaime and Jay for their collaboration on this project.
Overview
This project is an interactive, webcam-based game that uses real-time body tracking to encourage movement and fast reactions. Players tap randomly appearing targets with specific body parts while avoiding incoming bombs, testing their accuracy and speed. To ensure the game is widely accessible we designed the game to run efficiently on CPU-only hardware. The game supports both single-player and two-player modes.
Audience
The game targets casual gamers, students, and anyone interested in motion-based interactive experiences. It also appeals to those who enjoy light physical activity and to tech enthusiasts curious about computer vision and body tracking. The multiplayer mode adds an engaging, competitive option for friends and families.
Pipeline
The system uses pre-trained MediaPipe models to detect pose, face, and hand landmarks, along with a YOLO segmentation model to isolate players from the background. Webcam frames are processed in real time to extract 3D body landmarks, which are rendered using OpenCV, while segmentation runs in parallel to minimize latency. These outputs are combined to create a tracked, segmented player model.
Gameplay generates random targets tied to specific body parts, and players score by moving the correct body part within a distance threshold of the target. In multiplayer mode, each player is color-coded, tracked consistently across frames, and given an individual score to support competitive play.
Evolution of a Game
The Foundation (MediaPipe and OpenCV)
We started with MediaPipe’s BlazePose model. This gave us 33 anatomical landmarks in real-time. By calculating the Euclidean distance between a player’s joint and the target point, we had our core gameplay loop.
The Hands-Free Menu
We kicked things off by building a menu screen that players could navigate without touching a mouse. By using the MediaPipe Hands model to track 22 points on a single hand, we were able to calculate whether a player’s fist was open or closed based on the distance between their wrist and fingertips. This allowed players to select "Single Player" or "Two Player" modes just by hovering and "clicking" in mid-air.
The UI Pivot
We initially attempted to build the game's interface using React to create a modern, web-based visual experience. While this looked great in theory, we discovered that the communication bridge between our Python backend and the React frontend created massive latency, even when using WebSockets. Because the video flow wasn't fast enough to support a reaction-based game, we scrapped the web approach and moved to the PyQt5 library, which provided much smoother visuals and lower overhead.
3D Depth
We tried to add a third dimension to the gameplay by utilizing the z-coordinate provided by MediaPipe landmarks. This allowed us to simulate depth, requiring players to not only align horizontally and vertically but also reach forward or backward to touch targets. Despite the added realism, we eventually removed the depth dimension because the z-axis values were often inconsistent, leading to "ghost misses" where the system failed to register a hit. We decided that reliability and fair scoring were more important than 3D complexity.
From the debug printings we can see that the targets and body have (x, y, z) positions.
Multiplayer
To support multiple players, we first tried tracking people using centroid-based matching and later upgraded to the InsightFace library for face recognition. While face detection successfully helped the system "remember" players who left and re-entered the frame, the required 90-frame "calibration phase" felt slow and clunky for a casual game. We ultimately moved away from these methods in favor of a faster, more responsive tracking system that prioritized performance over persistent long-term identification.
When we leave the screen and come back we still have the same player numbers.
RCNN
We experimented with using RCNN to generate high-quality, smooth silhouettes for multiple players simultaneously. While the silhouettes looked professional, the model was far too heavy to run on a standard CPU, causing the visuals to lag significantly behind the players' actual movements. Since we wanted to make the game accessible on standard laptops without a dedicated GPU, we dropped RCNN and went back to finding a more lightweight segmentation method.
Bounding Boxes vs. Precision
This iteration introduced specialized MediaPipe models for hands and faces to create "bounding boxes" for scoring. Instead of needing to hit a target with a single point (like a wrist), players could score with any part of their hand or face. However, this drastically reduced the game's difficulty and made scoring feel less rewarding. We decided to disable this technique for the final version to keep the "ninja" precision requirement intact and to save on the processing power required to run the extra models.
Final Multiplayer Version
To solve the performance issues we faced in previous stages, we implemented a multithreaded pipeline that runs YOLO segmentation and MediaPipe tracking in parallel. This allowed us to keep the cool player silhouettes and fast-paced gameplay running at a high frame rate on a standard CPU, creating the most exciting and stable version of Pose Ninja.
Final Polish
To bring everything together, we polished the game by adding background music, sound effects, and "bombs" that players must dodge to protect their score. These finishing touches transformed Pose Ninja from a technical demo into an engaging, fast-paced arcade game.
Results
Out of 43 teams, our project was awarded Most Engaging Project.
You can download the game and try it for yourself here.