Pose, tracking, and optional depth → kinematic trajectories & instrument state-space skeletons
We present a configurable pipeline for surgical instrument kinematic reconstruction from laparoscopic video. Our system estimates multi-keypoint poses and persistent track IDs, exports a tidy track table for analysis, and optionally augments to 3D state space with configurable depth inference. Tracked keypoints are optionally converted into a normalized state-space to produce a compact skeleton trajectory representation suitable for applications such as robotic policy training & behavior analytics.
This site showcases demonstration clips, method diagrams, and utility analytics (e.g., trajectory clusters). Depth is optional and currently in work; the state-space skeleton export is a toggle that runs on top of generated instrument trajectories, with or without depth. The goal is a research-friendly toolchain that is fast to test, transparent to tune, and easy to integrate into larger surgical autonomy stacks.
Legend: Depth Skeleton
Inputs: Laparoscopic video (monocular or stereo).
Pose & Tracking Outputs: instrument detector + multi-keypoint tracker yields per-frame keypoints and stable track IDs.
Optional Depth: stereo depth (NVLabs FoundationStereo) or finetuned monocular depth (e.g., Metric3D/Depth-Anything).
Optional Segmentation Mask Constraint: Applies a segmentation mask prompted by detection bounding boxes to constrain keypoint depth estimation to instrument masks. SAM2 Finetuning in progress.
State-Space Transform: Converts tracked pose skeleton into a normalized kinematic state-space (2D/3D), exported as JSON alongside the tidy track table.
Applications: kinematic clips → windowed clustering, with exemplar frames and visual summaries.
Fig. 1 - Pipeline overview.
Fig. 2 - Demo Kinematic Analytics (trajectory clusters)
Our pipeline leverages fine-tuned Yolov11 models for pose estimation and Metric3D monocular depth prediction (NVLabs FoundationStereo can also be used for depth inference). Our Yolov11 detection-pose models are trained on ~5k manually-annotated instruments across 25 SurgVU videos using data augmentations including random rotation, scaling, brightness/contrast jitter, and horizontal flipping. Our Metric3D Monocular Depth model is fine-tuned on 25 Hamlyn stereoscopic videos with NVLabs FoundationStereo-generated annotation maps. Below are training reports showing convergence and held-out validation metrics.
Pose Model Classes: Six Instruments with varying annotation scales
Pose Model Classes: Six Instruments with varying annotation scales
Pose Model Training Report: YoloV11-small
Pose Model Training Report
Depth Model Training Report
@article{surgai_kinematics_2025,
author = {McHugh, Liam and Chen, Xudong and Turkan, Mehmet K. and Ballo, Mattia and Godbole, Aditya Amit and Morais, Maria and Kostic, Zoran and Filicori, Filippo},
title = {Kinematic Analytics for Minimally-Invasive Surgery: Depth-Optional State-Space Reconstruction},
journal = {preprint},
year = {2025},
}