End-to-End Instrument Kinematics Reconstruction Pipeline for Minimally-Invasive Surgery

Pose, tracking, and optional depth → kinematic trajectories & instrument state-space skeletons

Northwell Endoscopic Video → instrument pose & tracks.

Abstract

We present a configurable pipeline for surgical instrument kinematic reconstruction from laparoscopic video. Our system estimates multi-keypoint poses and persistent track IDs, exports a tidy track table for analysis, and optionally augments to 3D state space with configurable depth inference. Tracked keypoints are optionally converted into a normalized state-space to produce a compact skeleton trajectory representation suitable for applications such as robotic policy training & behavior analytics.

This site showcases demonstration clips, method diagrams, and utility analytics (e.g., trajectory clusters). Depth is optional and currently in work; the state-space skeleton export is a toggle that runs on top of generated instrument trajectories, with or without depth. The goal is a research-friendly toolchain that is fast to test, transparent to tune, and easy to integrate into larger surgical autonomy stacks.

Demos

Legend: Depth Skeleton

Method

Inputs: Laparoscopic video (monocular or stereo).

Pose & Tracking Outputs: instrument detector + multi-keypoint tracker yields per-frame keypoints and stable track IDs.

Optional Depth: stereo depth (NVLabs FoundationStereo) or finetuned monocular depth (e.g., Metric3D/Depth-Anything).

Optional Segmentation Mask Constraint: Applies a segmentation mask prompted by detection bounding boxes to constrain keypoint depth estimation to instrument masks. SAM2 Finetuning in progress.

State-Space Transform: Converts tracked pose skeleton into a normalized kinematic state-space (2D/3D), exported as JSON alongside the tidy track table.

Applications: kinematic clips → windowed clustering, with exemplar frames and visual summaries.

Pipeline diagram: input video → pose/tracks → (depth) → state-space skeleton.

Fig. 1 - Pipeline overview.

UMAP clustering of windowed kinematic features.

Fig. 2 - Demo Kinematic Analytics (trajectory clusters)

Training

Our pipeline leverages fine-tuned Yolov11 models for pose estimation and Metric3D monocular depth prediction (NVLabs FoundationStereo can also be used for depth inference). Our Yolov11 detection-pose models are trained on ~5k manually-annotated instruments across 25 SurgVU videos using data augmentations including random rotation, scaling, brightness/contrast jitter, and horizontal flipping. Our Metric3D Monocular Depth model is fine-tuned on 25 Hamlyn stereoscopic videos with NVLabs FoundationStereo-generated annotation maps. Below are training reports showing convergence and held-out validation metrics.

Skeletal Pose Structures.

Pose Model Classes: Six Instruments with varying annotation scales

Pose annotation distributions.

Pose Model Classes: Six Instruments with varying annotation scales

Depth training curves: loss and depth error metrics over epochs.

Pose Model Training Report: YoloV11-small

Pose training curves: loss and keypoint accuracy over epochs.

Pose Model Training Report

Depth Comparisons: NVLabs FoundationsStereo & Fine-Tuned Metric3D Inference.

Depth Model Training Report

BibTeX

@article{surgai_kinematics_2025,
  author    = {McHugh, Liam and Chen, Xudong and Turkan, Mehmet K. and Ballo, Mattia and Godbole, Aditya Amit and Morais, Maria and Kostic, Zoran and Filicori, Filippo},
  title     = {Kinematic Analytics for Minimally-Invasive Surgery: Depth-Optional State-Space Reconstruction},
  journal   = {preprint},
  year      = {2025},
}