File size: 10,974 Bytes
efedecd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
---
pipeline_tag: feature-extraction
---
<!-- # NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields -->
<div align="center">
<img src="demo/nerf-mae_teaser.png" width="85%">
<img src="demo/nerf-mae_teaser.jpeg" width="85%">
</div>
<!-- <p align="center">
<img src="demo/nerf-mae_teaser.jpeg" width="100%">
</p> -->
<br>
<div align="center">
[](https://arxiv.org/abs/2404.01300)
[](https://nerf-mae.github.io)
[](https://pytorch.org/)
[](https://github.com/zubair-irshad/NeRF-MAE?tab=readme-ov-file#citation)
[](https://youtu.be/D60hlhmeuJI?si=d4RfHAwBJgLJXdKj)
</div>
---
<a href="https://www.tri.global/" target="_blank">
<img align="right" src="demo/GeorgiaTech_RGB.png" width="18%"/>
</a>
<a href="https://www.tri.global/" target="_blank">
<img align="right" src="demo/tri-logo.png" width="17%"/>
</a>
### [Project Page](https://nerf-mae.github.io/) | [arXiv](https://arxiv.org/abs/2308.12967) | [PDF](https://arxiv.org/pdf/2308.12967.pdf)
**NeRF-MAE : Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields**
<a href="https://zubairirshad.com"><strong>Muhammad Zubair Irshad</strong></a>
·
<a href="https://zakharos.github.io/"><strong>Sergey Zakharov</strong></a>
·
<a href="https://www.linkedin.com/in/vitorguizilini"><strong>Vitor Guizilini</strong></a>
·
<a href="https://adriengaidon.com/"><strong>Adrien Gaidon</strong></a>
·
<a href="https://faculty.cc.gatech.edu/~zk15/"><strong>Zsolt Kira</strong></a>
·
<a href="https://www.tri.global/about-us/dr-rares-ambrus"><strong>Rares Ambrus</strong></a>
<br> **European Conference on Computer Vision, ECCV 2024**<br>
<b> Toyota Research Institute | Georgia Institute of Technology</b>
## 💡 Highlights
- **NeRF-MAE**: The first large-scale pretraining utilizing Neural Radiance Fields (NeRF) as an input modality. We pretrain a single Transformer model on thousands of NeRFs for 3D representation learning.
- **NeRF-MAE Dataset**: A large-scale NeRF pretraining and downstream task finetuning dataset.
## 🏷️ TODO 🚀
- [x] Release large-scale pretraining code 🚀
- [x] Release NeRF-MAE dataset comprising radiance and density grids 🚀
- [x] Release 3D object detection finetuning and eval code 🚀
- [x] Pretrained NeRF-MAE checkpoints and out-of-the-box model usage 🚀
## NeRF-MAE Model Architecture
<p align="center">
<img src="demo/nerf-mae_architecture.jpg" width="90%">
</p>
## Citation
If you find this repository or our dataset useful, please star ⭐ this repository and consider citing 📝:
```
@inproceedings{irshad2024nerfmae,
title={NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields},
author={Muhammad Zubair Irshad and Sergey Zakharov and Vitor Guizilini and Adrien Gaidon and Zsolt Kira and Rares Ambrus},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024}
}
```
### Contents
- [🌇 Environment](#-environment)
- [⛳ Model Usage and Checkpoints](#-model-usage-and-checkpoints)
- [🗂️ Dataset](#-dataset)
## 🌇 Environment
Create a python 3.7 virtual environment and install requirements:
```bash
cd $NeRF-MAE repo
conda create -n nerf-mae python=3.9
conda activate nerf-mae
pip install --upgrade pip
pip install -r requirements.txt
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
```
The code was built and tested on **cuda 11.3**
Compile CUDA extension, to run downstream task finetuning, as described in [NeRF-RPN](https://github.com/lyclyc52/NeRF_RPN):
```bash
cd $NeRF-MAE repo
cd nerf_rpn/model/rotated_iou/cuda_op
python setup.py install
cd ../../../..
```
## ⛳ Model Usage and Checkpoints
- [Hugginface repo to download pretrained and finetuned checkpoints](https://huggingface.co/mirshad7/NeRF-MAE)
NeRF-MAE is structured to provide easy access to pretrained NeRF-MAE models (and reproductions), to facilitate use for various downstream tasks. This is for extracting good visual features from NeRFs if you don't have resources for large-scale pretraining. Our pretraining provides an easy-to-access embedding of any NeRF scene, which can be used for a variety of downstream tasks in a straightforwaed way.
We have released pretrained and finetuned checkpoints to start using our codebase out-of-the-box. There are two types of usages. 1. Most common one is using the features directly in a downstream task such as an FPN head for 3D Object Detection and 2. Reconstruct the original grid for enforcing losses such as masked reconstruction loss. Below is a sample useage of our model with spelled out comments.
1. Get the features to be used in a downstream task
```python
import torch
# Define Swin Transformer configurations
swin_config = {
"swin_t": {"embed_dim": 96, "depths": [2, 2, 6, 2], "num_heads": [3, 6, 12, 24]},
"swin_s": {"embed_dim": 96, "depths": [2, 2, 18, 2], "num_heads": [3, 6, 12, 24]},
"swin_b": {"embed_dim": 128, "depths": [2, 2, 18, 2], "num_heads": [3, 6, 12, 24]},
"swin_l": {"embed_dim": 192, "depths": [2, 2, 18, 2], "num_heads": [6, 12, 24, 48]},
}
# Set the desired backbone type
backbone_type = "swin_s"
config = swin_config[backbone_type]
# Initialize Swin Transformer model
model = SwinTransformer_MAE3D_New(
patch_size=[4, 4, 4],
embed_dim=config["embed_dim"],
depths=config["depths"],
num_heads=config["num_heads"],
window_size=[4, 4, 4],
stochastic_depth_prob=0.1,
expand_dim=True,
resolution=resolution,
)
# Load checkpoint and remove unused layers
checkpoint_path = hf_hub_download(repo_id="mirshad7/NeRF-MAE", filename="nerf_mae_pretrained.pt")
checkpoint = torch.load(checkpoint_path, map_location="cpu")
model.load_state_dict(checkpoint["state_dict"])
for attr in ["decoder4", "decoder3", "decoder2", "decoder1", "out", "mask_token"]:
delattr(model, attr)
# Extract features using Swin Transformer backbone. input_grid has sample shape torch.randn((1, 4, 160, 160, 160))
features = []
input_grid = model.patch_partition(input_grid) + model.pos_embed.type_as(input_grid).to(input_grid.device).clone().detach()
for stage in model.stages:
input_grid = stage(input_grid)
features.append(torch.permute(input_grid, [0, 4, 1, 2, 3]).contiguous()) # Format: [N, C, H, W, D]
#Multi-scale features have shape: [torch.Size([1, 96, 40, 40, 40]), torch.Size([1, 192, 20, 20, 20]), torch.Size([1, 384, 10, 10, 10]), torch.Size([1, 768, 5, 5, 5])]
# Process features through FPN
```
2. Get the Original Grid Output
```python
import torch
# Load data from the specified folder and filename with the given resolution.
res, rgbsigma = load_data(folder_name, filename, resolution=args.resolution)
# rgbsigma has sample shape torch.randn((1, 4, 160, 160, 160))
# Build the model using provided arguments.
model = build_model(args)
# Load checkpoint if provided.
if args.checkpoint:
model.load_state_dict(torch.load(args.checkpoint, map_location="cpu")["state_dict"])
model.eval() # Set model to evaluation mode.
# Run inference getting the features out for downsteam usage
with torch.no_grad():
pred = model([rgbsigma], is_eval=True)[3] # Extract only predictions.
```
### 1. How to plug these features for downstream 3D bounding detection from NeRFs (i.e. plug-and-play with a [NeRF-RPN](https://github.com/lyclyc52/NeRF_RPN) OBB prediction head)
Please also see the section on [Finetuning](#-finetuning). Our released finetuned checkpoint achieves state-of-the-art on 3D object detection in NeRFs. To run evaluation using our finetuned checkpoint on the dataset provided by NeRF-RPN, please run the below script, after updating the paths to the pretrained checkpoint i.e. --checkpoint and DATA_ROOT depending on evaluation done for ```Front3D``` or ```Scannet```:
```
bash test_fcos_pretrained.sh
```
Also see the cooresponding run file i.e. ```run_fcos_pretrained.py``` and our model adaptation i.e. ```SwinTransformer_FPN_Pretrained_Skip```. This is a minimal adaptation to plug and play our weights with a NeRF-RPN architecture and achieve significant boost in performance.
## 🗂️ Dataset
Download the preprocessed datasets here.
- Pretraining dataset (comprising NeRF radiance and density grids). [Download link](https://s3.amazonaws.com/tri-ml-public.s3.amazonaws.com/github/nerfmae/NeRF-MAE_pretrain.tar.gz)
- Finetuning dataset (comprising NeRF radiance and density grids and bounding box/semantic labelling annotations). [3D Object Detection (Provided by NeRF-RPN)](https://drive.google.com/drive/folders/1q2wwLi6tSXu1hbEkMyfAKKdEEGQKT6pj), [3D Semantic Segmentation (Coming Soon)](), [Voxel-Super Resolution (Coming Soon)]()
Extract pretraining and finetuning dataset under ```NeRF-MAE/datasets```. The directory structure should look like this:
```
NeRF-MAE
├── pretrain
│ ├── features
│ └── nerfmae_split.npz
└── finetune
└── front3d_rpn_data
├── features
├── aabb
└── obb
```
Note: The above datasets are all you need to train and evaluate our method. Bonus: we will be releasing our multi-view rendered posed RGB images from FRONT3D, HM3D and Hypersim as well as Instant-NGP trained checkpoints soon (these comprise over 1M+ images and 3k+ NeRF checkpoints)
Please note that our dataset was generated using the instruction from [NeRF-RPN](https://github.com/lyclyc52/NeRF_RPN) and [3D-CLR](https://vis-www.cs.umass.edu/3d-clr/). Please consider citing our work, NeRF-RPN and 3D-CLR if you find this dataset useful in your research.
Please also note that our dataset uses [Front3D](https://arxiv.org/abs/2011.09127), [Habitat-Matterport3D](https://arxiv.org/abs/2109.08238), [HyperSim](https://github.com/apple/ml-hypersim) and [ScanNet](https://www.scan-net.org/) as the base version of the dataset i.e. we train a NeRF per scene and extract radiance and desnity grid as well as aligned NeRF-grid 3D annotations. Please read the term of use for each dataset if you want to utilize the posed multi-view images for each of these datasets.
### For More details, please checkout out Paper, Github and Project Page!
---
license: cc-by-nc-4.0
---
|