TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

1King Abdullah University of Science and Technology,   2Visual Geometry Group, University of Oxford,   3Snap Inc.  
ECCV 2024
arXiv Code

Abstract

Neural radiance fields (NeRFs) generally require many images with accurate poses for accurate novel view synthesis, which does not reflect realistic setups where views can be sparse and poses can be noisy. Previous solutions for learning NeRFs with sparse views and noisy poses only consider local geometry consistency with pairs of views. Closely following bundle adjustment in Structure-from-Motion (SfM), we introduce TrackNeRF for more globally consistent geometry reconstruction and more accurate pose optimization. TrackNeRF introduce feature tracks, i.e. connected pixel trajectories across all visible views that correspond to the same 3D points. By enforcing reprojection consistency among feature tracks, TrackNeRF encourages holistic 3D consistency explicitly. Through extensive experiments, TrackNeRF sets a new benchmark in noisy and sparse view reconstruction. In particular, TrackNeRF shows significant improvements over the state-of-the-art BARF and SPARF by around 8 and around 1 in terms of PSNR on DTU under various sparse and noisy view setups.

Method


TrackNeRF minimizes the reprojection loss across all visible views for feature tracks corresponding to the same landmarks, thus guarantees global correspondence consistency from multiview.

Novel View Synthesis Visualization and Comparison

Comparison of Reconstruction Quality between BARF and TrackNeRF


View Synthesis on DTU scan45 from 3 Input Views with with camera poses perturbed by 15% noise.



View Synthesis on LLFF leaves from 3 Input Views with camera poses perturbed by 15% noise



View Synthesis on Dtu scan21 from 3 Input Views with camera poses perturbed by 35% noise.



Tables

We benchmark novel view synthesis and camera pose estimation methods on DTU with noisy initial poses with different numbers of input views (3, 6 or 9). We simulate noisy poses by adding 15% of Gaussian noise to the ground-truth poses. Rotation errors are in degrees and translation errors are multiplied by 100. Results in (·) are computed by masking the background. Our TrackNeRF achieves the best performance in all setups.

Visualization of Camera Poses under 15% of Noise

While BARF can’t recover the camera pose well, both SPARF and our TrackNeRF can recover near perfect camera pose under 15% of noise on most scenes of DTU.

Visualization of Camera Poses under 35% of Noise

Only our TrackNeRF can recover near perfect camera pose under 35% of noise on most scenes of DTU.

Citation

Kudos to the great work PixSfM and SPARF, from which we borrow the code and build TrackNeRF upon:

If you find the code and paper helpful, a citation will be appreciated via:

Acknowledgements

The research reported in this publication was supported by funding from KAUST Center of Excellence on GenAI under award number 5940, as well as, the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI). Part of the support is also coming from the KAUST Ibn Rushd Postdoc Fellowship program.

The website template was borrowed from SPARF, ReconFusion, Michaël Gharbi and Ref-NeRF.