This is a coarse version for MAE, only make the pretrain model, the finetune and linear is comming soon.
Note: My vit code not fully base on the Timm or BEIT, so the result may be less than them.
This repo is the MAE-vit model which impelement with pytorch, no reference any reference code so this is a non-official version. Because of the limitation of time and machine, I only trained the vit-tiny, vit-base/16 for model pretrain.
python
img_size = 224,
patch_size = 16,
EncoderConfig
|Encoder|dims|depth|heads|mask| |:---:|:---:|:---:|:---:|:---:| |VIT-TINY/16|192|12|3|0.75| |VIT-Base/16|768|12|12|0.75|
DecoderConfig
|Decoder|dims|depth|heads|mask| |:---:|:---:|:---:|:---:|:---:| |VIT-TINY/16|512|8|16|0.75| |VIT-Base/16|512|8|16|0.75|
Mask
Wait for the results
TODO: - [ ] Finetune Trainig - [ ] Linear Training
Show the pretrain result on the imaget val dataset, left is the mask image, middle is the reconstruction image, right is the origin image.
Large models work significantly better than small models.
weights
pretrian
Vit-Tiny/16 pretrain models is here
Vit-Base/16 pretrain models is here - training from strach
Trainig the raw vit from strach follow kaiming paper config, but not use the EMA for vit-base.And use the sin-cos position embeeding replace the learnable position embeeding. Vit-Base/16 strach model is here, top-1 acc is 81.182%, paper is 82.3% with EMA.
Finetune
Result is 81.5%, but the ckpt have lost by rm -rf
. Higher than the training from strach.
You can download to test the restruction result. Put the ckpt in weights
folder.
/data/home/imagenet/xxx.jpeg, 0
/data/home/imagenet/xxx.jpeg, 1
...
/data/home/imagenet/xxx.jpeg, 999
Training
Pretrain
bash
#!/bin/bash
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
export OMP_NUM_THREADS
export MKL_NUM_THREADS
cd MAE-Pytorch;
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -W ignore -m torch.distributed.launch --nproc_per_node 8 train_mae.py \
--batch_size 256 \
--num_workers 32 \
--lr 1.5e-4 \
--optimizer_name "adamw" \
--cosine 1 \
--max_epochs 300 \
--warmup_epochs 40 \
--num-classes 1000 \
--crop_size 224 \
--patch_size 16 \
--color_prob 0.0 \
--calculate_val 0 \
--weight_decay 5e-2 \
--finetune 0 \
--lars 0 \
--mixup 0.0 \
--smoothing 0.0 \
--train_file $train_file \
--val_file $val_file \
--checkpoints-path $ckpt_folder \
--log-dir $log_folder
bash
#!/bin/bash
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
export OMP_NUM_THREADS
export MKL_NUM_THREADS
cd MAE-Pytorch;
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -W ignore -m torch.distributed.launch --nproc_per_node 8 train_mae.py \
--batch_size 256 \
--num_workers 32 \
--lr 1.2e-3 \
--optimizer_name "adamw" \
--cosine 1 \
--max_epochs 400 \
--warmup_epochs 40 \
--num-classes 1000 \
--crop_size 224 \
--patch_size 16 \
--color_prob 0.0 \
--calculate_val 0 \
--weight_decay 5e-2 \
--finetun 0 \
--lars 0 \
--mixup 0.0 \
--smoothing 0.0 \
--train_file $train_file \
--val_file $val_file \
--checkpoints-path $ckpt_folder \
--log-dir $log_folder
Finetune TODO:
bash
#!/bin/bash
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
export OMP_NUM_THREADS
export MKL_NUM_THREADS
cd MAE-Pytorch;
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -W ignore -m torch.distributed.launch --nproc_per_node 8 train_mae.py \
--batch_size 256 \
--num_workers 32 \
--lr 1.2e-3 \
--optimizer_name "adamw" \
--opt_betas 0.9 0.999 \
--cosine 1 \
--finetune 1 \
--max_epochs 100 \
--warmup_epochs 5 \
--num-classes 1000 \
--crop_size 224 \
--patch_size 16 \
--color_prob 0.0 \
--calculate_val 0 \
--weight_decay 5e-2 \
--lars 0 \
--mixup 0.8 \
--cutmix 1.0 \
--smoothing 0.1 \
--train_file $train_file \
--val_file $val_file \
--checkpoints-path $ckpt_folder \
--log-dir $log_folder
Inference
python
python mae_test.py --test_image xxx.jpg --ckpt weights/weights.pth
python
python inference.py --test_file val_imagenet.log --ckpt weights/weights.pth
There may be have some problems with the implementation, welcome to make discussion and submission code.
hello! I am a postgraduate freshman. I surprisly find this torch version of MAE. i wonder you can set a license for this project, so i could know if i can freely use part of the code(class MAEVisionTransformers) in my work.
Thanks for your work! I’m pretraining the vit-tiny for my own dataset, but i can not determine the setting for decoder's parameters (depth/embed_dim/num_heads), just consistent with vit-base/large/huge or choose some smaller value to make a lightweight decoder?