Our final project for our Introduction to Robotics (MEAM 5200) class involved programming a Franka Panda robotic arm to pick up blocks from a workspace and stack them on a target platform, with scoring based on block type (static or dynamic) and final stack height. We built a robust computer-vision–based pick-and-place pipeline optimized for static blocks, while also exploring both hard-coded and vision-driven methods for interacting with dynamic blocks.
Special thanks to my teammates, Arush, Dhruv, and Elizabeth for their collaboration on this project.
System Overview
Our pipeline was built on four pillars: Perception, Transformation, Planning, and Execution. We utilized AprilTags for vision, Jacobian pseudo-inverse methods for Inverse Kinematics (IK), and a custom "look-and-grab" state machine to manage the workflow.
Perception and the Coordinate Pipeline
To interact with a block, the robot first needs to know exactly where it is in 3D space. Using the Franka Panda’s end-effector camera, we detected AprilTags to get the block's pose in the camera frame. However, the robot moves based on its base frame. We performed a series of homogeneous transformations to bridge this gap.
Static Blocks Pipeline
Static blocks were our primary scoring focus. To ensure a clean grasp every time, we developed a Grasp Orientation Selection heuristic.
Instead of letting the vision system dictate a noisy orientation, we:
Fixed the gripper axis to be perfectly vertical.
Projected the block's axes onto the table plane.
Selected the orientation that required the smallest rotation from the robot’s current state.
This approach minimized "wrist flip" and unnecessary motion, leading to a much more stable pick-up.
Video 1: Static Blocks in Simulation
Video 2: Static Blocks with Physical Robot Arm. Video at 10x speed.
Dynamic Blocks: Vision Method
To handle dynamic blocks on a rotating turntable, we developed a vision-based interception strategy that predicts a block’s future grasp pose based on its observed position and the table’s known constant rotation speed. Rather than continuously tracking the block, the robot senses the block at a predefined “look” angle and uses timing to infer where and how to grasp it at a later angle.
We iterated through several designs: sensing at the y-axis and waiting a full rotation proved too slow, while sensing at ±45° caused joint-limit and singularity issues. The final approach senses the block at −22.5° and grasps it at 22.5°, balancing reachability and timing constraints.
Diagram of block movement and orientation from −22.5 degrees to 22.5 degrees
A precomputed look pose is used to detect when the block enters a narrow x-position threshold, triggering a single grasp-pose calculation that flips the block’s position appropriately and rotates its orientation by 45°. The gripper is constrained to approach from the x-direction, allowing the block to slide cleanly into the grasp.
Dynamic Blocks: Waiting Method
Ultimately, we found that continuous tracking was too sensitive to latency and detection dropout. For the final version, we implemented a Deterministic "Waiting" Strategy. The robot moved to a precomputed "intercept zone," waited for the block to enter its personal space, and executed a timed grasp. This proved far more robust against the sim-to-real gap.
Video 3: Dynamic Blocks in simulation
Video 4: Dynamic Blocks with Physical Robot Arm
Reliable Stacking via Precomputation
A key lesson that we learned was that solving IK live for every stack level introduced too much variance and it is very slow. If the IK failed near a joint limit, the whole run could end.
To solve this, we hardcoded joint configurations for four stack levels. By moving the arm to a "stack staging" position directly above the goal and then descending using precomputed joint positions, we ensured the arm always moved vertically. This method ran much faster than having to compute IK for every block.
Results and Performance Analysis
Our system proved highly reliable at grasping, though the nature of our stacking, using precomputed heights meant that if one block failed, the rest of the stack followed. However we had to take this risk due to computing IK for every block was really slow.
Simulation Performance Summary
The maximum points we were able to achieve were 4 static blocks and 1 dynamic block which is a total of
10(25 + 75 + 125 + 175) + 20(225) = 8500 points
On hardware, offsets had to be recalibrated each time a different robot arm was used. As a result, we did not collect a large number of trials on the same robot, making it difficult to draw definitive conclusions about pickup and stacking success rates. Nevertheless, we provide two representative videos demonstrating successful execution of the system on hardware for both static and dynamic blocks. (Video 2 and Video 4 above)
Lessons Learned
Overall, we found that reliability improved more from simplifying execution—such as precomputing place poses and constraining orientations, though this came with tradeoffs between speed and robustness. Vision-based orientation estimation proved fragile on hardware due to lighting and detection noise, making strong priors like the table plane and limited grasp orientations essential. For dynamic blocks, interception-based strategies were significantly more robust than continuous tracking, which was highly sensitive to latency under competition constraints. Despite frequent sim-to-real discrepancies caused by sensor noise and environmental mismatch, simulation remained a critical tool for developing and validating effective grasping strategies.