我们提出了一种算法,用于为单眼视频中的所有像素重建密集的、几何一致的深度。我们利用传统的运动结构重建来建立视频中像素的几何约束。与经典重建中的临时先验不同,我们使用基于学习的先验,即为单图像深度估计训练的卷积神经网络。在测试时,我们对该网络进行微调以满足特定输入视频的几何约束,同时保留其在受限较少的视频部分中合成合理深度细节的能力。我们通过定量验证表明,我们的方法比以前的单目重建方法具有更高的准确性和更高程度的几何一致性。从视觉上看,我们的结果显得更加稳定。我们的算法能够处理具有适度动态运动的具有挑战性的手持捕获输入视频。改进后的重建质量支持多种应用,例如场景重建和基于视频的高级视觉效果。
git submodule update --init --recursive
conda create -n consistent_depth python=3.6
conda activate consistent_depth
./scripts/install.sh
[hidecontent type="logged" desc="隐藏内容:登录后可查看"]
./scripts/install_colmap_ubuntu.sh
。您可以在不安装COLMAP 的情况下运行以下演示。在一个 NVIDIA GeForce RTX 2080 GPU 上测试时,演示需要 37 分钟。
./scripts/download_model.sh
./scripts/download_demo.sh results/ayush
python main.py --video_file data/videos/ayush.mp4 --path results/ayush \
--camera_params "1671.770118, 540, 960" --camera_model "SIMPLE_PINHOLE" \
--make_video
1671.770118, 540, 960
是相机内在函数 ( f, cx, cy
) ,SIMPLE_PINHOLE
是相机模型。tensorboard --logdir results/ayush/R_hierarchical2_mc/B0.1_R1.0_PL1-0_LR0.0004_BS4_Oadam/tensorboard/
results/ayush/R_hierarchical2_mc
videos/
color_depth_mc_depth_colmap_dense_B0.1_R1.0_PL1-0_LR0.0004_BS4_Oadam.mp4 # comparison of disparity maps from mannequin challenge, COLMAP and ours
B0.1_R1.0_PL1-0_LR0.0004_BS4_Oadam/
depth/ # final disparity maps
checkpoints/0020.pth # final checkpoint
eval/ # disparity maps and losses after each epoch of training
除了用于快速演示和易于安装的 COLMAP 部分之外,该演示运行所有内容,包括流量估计、测试时间训练等。要启用测试 COLMAP 部分,您可以删除results/ayush/colmap_dense
和 results/ayush/depth_colmap_dense
。然后再次运行上面的python命令。
请参考params.py
或运行python main.py --help
以获取完整的参数列表。在这里,我演示了一些系统常见用法的示例。
$video_file_path
.PINHOLE
[可选] 使用(fx, fy, cx, cy) 或SIMPLE_PINHOLE
(f, cx, cy) 模型校准相机。相机内在校准是可选的,但建议进行更准确和更快的相机注册。我们通常通过拍摄具有非常慢的相机运动的纹理平面视频来校准相机,同时尝试让目标特征覆盖整个视野,选择非模糊帧,在这些图像上运行COLMAP 。python main.py --video_file $video_file_path --path $output_path --make_video
PINHOLE
模型和fx, fy, cx, cy = 1660.161322, 1600, 540, 960
python main.py --video_file $video_file_path --path $output_path \
--camera_model "PINHOLE" --camera_params "1660.161322, 1600, 540, 960" \
--make_video
python main.py --video_file $video_file_path --path $output_path \
--camera_model "PINHOLE" --camera_params "1660.161322, 1600, 540, 960" \
--make_video --model_type "${model_type}"
mc
(Zhang 等人的 Mannequin Challenge,2019 年)、midas2
(Ranftl 等人的 MiDaS,2019 年)和monodepth2
(Godard 等人的 Monodepth2,2019 年)。我们依靠COLMAP来进行相机姿势注册。如果您有预先计算的相机姿势,您可以将它们提供给文件夹中的系统,$path
如下所示。($path
参见此处的示例文件结构。)
color_full/frame_%06d.png
.frame.txt
格式(示例参见此处):
number_of_frames
width
height
frame_000000_timestamp_in_seconds
frame_000001_timestamp_in_seconds
...
images.txt
,cameras.txt
和points3D.txt
(或.bin
)放在 下colmap_dense/pose_init/
。请注意,POINTS2D
inimages.txt
和 thepoints3D.txt
可以为空。python main.py --path $path --initialize_pose
为了在动态场景中获得更好的姿势,您可以在使用COLMAP提取特征时屏蔽掉动态对象。注意COLMAP >= 3.6是提取屏蔽区域特征所必需的。
提取帧
python main.py --video_file $video_file_path --path $output_path --op extract_frames
在图像上运行您最喜欢的分割方法(例如,Mask-RCNN)$output_path/color_full
以提取动态对象(例如,人类)的二进制掩码。在蒙版图像为黑色(灰度中像素强度值为 0)的区域中不会提取任何特征。在COLMAP 文件之后,保存帧的掩码$output_path/color_full/frame_000010.png
,例如,在$output_path/mask/frame_000010.png.png
。
运行管道的其余部分。
python main.py --path $output_path --mask_path $output_path/mask \
--camera_model "${camera_model}" --camera_params "${camera_intrinsics}" \
--make_video
结果文件夹具有以下结构。许多文件仅用于调试目的。
frames.txt # meta data about number of frames, image resolution and timestamps for each frame
color_full/ # extracted frames in the original resolution
color_down/ # extracted frames in the resolution for disparity estimation
color_down_png/
color_flow/ # extracted frames in the resolution for flow estimation
flow_list.json # indices of frame pairs to finetune the model with
flow/ # optical flow
mask/ # mask of consistent flow estimation between frame pairs.
vis_flow/ # optical flow visualization. Green regions contain inconsistent flow.
vis_flow_warped/ # visualzing flow accuracy by warping one frame to another using the estimated flow. e.g., frame_000000_000032_warped.png warps frame_000032 to frame_000000.
colmap_dense/ # COLMAP results
metadata.npz # camera intrinsics and extrinsics converted from COLMAP sparse reconstruction.
sparse/ # COLMAP sparse reconstruction
dense/ # COLMAP dense reconstruction
depth_colmap_dense/ # COLMAP dense depth maps converted to disparity maps in .raw format
depth_${model_type}/ # initial disparity estimation using the original monocular depth model before test-time training
R_hierarchical2_${model_type}/
flow_list_0.20.json # indices of frame pairs passing overlap ratio test of threshold 0.2. Same content as ../flow_list.json.
metadata_scaled.npz # camera intrinsics and extrinsics after scale calibration. It is the camera parameters used in the test-time training process.
scales.csv # frame indices and corresponding scales between initial monocular disparity estimation and COLMAP dense disparity maps.
depth_scaled_by_colmap_dense/ # monocular disparity estimation scaled to match COLMAP disparity results
vis_calibration_dense/ # for debugging scale calibration. frame_000000_warped_to_000029.png warps frame_000000 to frame_000029 by scaled camera translations and disparity maps from initial monocular depth estimation.
videos/ # video visualization of results
B0.1_R1.0_PL1-0_LR0.0004_BS4_Oadam/
checkpoints/ # checkpoint after each epoch
depth/ # final disparity map results after finishing test-time training
eval/ # intermediate losses and disparity maps after each epoch
tensorboard/ # tensorboard log for the test-time training process
[/hidecontent]