Visual Qustion Answering (VQA) is a type of tasks, where given an image and a question about the image, a model is expected to give a correct answer.
For example, a visual image looks like this:
The question is: What color is the girl’s necklace?
Our model would generate the answer ‘white’.
MindSpore is a new AI framework developed by Huawei.
NaiveVQA: MindSpore & PyTorch Implementations of a Strong VQA Baseline
This repository contains a naive VQA model, which is our final project (mindspore implementation) for course DL4NLP at ZJU. It’s a reimplementation of the paper Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering.
pytorchfor our pytorch implementation.
git checkout pytorch
Per Question Type Accuracy (MindSpore)
Per Question Type Accuracy (PyTorch)
annotations/– annotations data (ignored)
images/– images data (ignored)
questions/– questions data (ignored)
results/– contains evaluation results when you evaluate a model with
clean.py– a script to clean up
align.py– a script to sort and align up the annotations and questions
resnet/– resnet directory, cloned from pytorch-resnet
logs/– should contain saved
config.py– global configure file
view-log.py– a tool for visualizing an accuracy\epoch figure
val_acc.png– a demo for the accuracy\epoch figure
model.py– the major model
preprocess-image.py– preprocess the images, using ResNet152 to extract features for further usages
preprocess-image-test.py– to extract images in the test set
preprocess-vocab.py– preprocess the questions and annotations to get their vocabularies for further usages
data.py– dataset, dataloader and data processing code
utils.py– helper code
evaluate.ipynb– evaluate a model and visualize the result
cover_rate.ipynb– calculate the selected answers’ coverage
PythonHelperTools/(currently not used)
vqaDemo.py– a demo for VQA dataset APIs
PythonEvaluationTools/(currently not used)
vqaEvalDemo.py– a demo for VQA evaluation
- Free disk space of at least 60GB
- Nvidia GPU / Ascend Platform
Notice: We have successfully tested our code with MindSpore 1.2.1 on Nvidia RTX 2080ti. Thus we strongly suggest you use MindSpore 1.2.1 GPU version. Since MindSpore is definitely not stable, any version different from 1.2.1 might cause failures.
Also, due to some incompatibility among different versions of MindSpore, we still can’t manage to run the code on Ascend now. Fortunately, people are more possible to have an Nvidia GPU rather than an Ascend chip :)
Get and Prepare the Dataset
Get our VQA dataset (a small subset of VQA 2.0) from here. Unzip the file and move the subdirectories
into the repository directory
Prepare your dataset with:
# Only run the following command once! cd data # Save the original json files cp annotations/train.json annotations/train_backup.json cp questions/train.json questions/train_backup.json cp annotations/val.json annotations/val_backup.json cp questions/val.json questions/val_backup.json cp annotations/test.json annotations/test_backup.json cp questions/test.json questions/test_backup.json python clean.py # run the clean up script mv annotations/train_cleaned.json annotations/train.json mv questions/train_cleaned.json questions/train.json python align.py # run the aligning script mv annotations/train_cleaned.json annotations/train.json mv annotations/val_cleaned.json annotations/val.json mv annotations/test_cleaned.json annotations/test.json mv questions/train_cleaned.json questions/train.json mv questions/val_cleaned.json questions/val.json mv questions/test_cleaned.json questions/test.json
The scripts upon would
- clean up your dataset (there are some images whose ids are referenced in the annotation & question files, while the images themselves don’t exist!)
- align the questions’ ids for convenience while training
You actually don’t have to preprocess the images yourself. We have prepared the prerocessed features file for you, feel free to download it through here (the passcode is ‘dl4nlp’). You should download the
resnet-14x14.h5(42GB) file and place it at the repository root directory. Once you’ve done that, skip this chapter!
Preprocess the images with:
- If you want to accelerate it, tune up
- If you run out of CUDA memory, tune down
The output should be
The vocabulary only depends on the train set, as well as the
config.max_answers(the number of selected candidate answers) you choose.
Preprocess the questions and annotations to get their vocabularies with:
The output should be
Now, you can train the model with:
During training, a ‘.ckpt’ file and a ‘.json’ file would be saved under
.ckpt file contains the parameters of your model and can be reloaded. The
.json file contains training metainfo records.
View the training process with:
python view-log.py <path to .json train record>
val_acc.png should look like these:
(a real train of PyTorch implementation)
(a real train of MindSpore implementation)
To continue training from a pretrained model, set the correct
pretrainedto True in
Test Your Model
Likewise, you need to preprocess the test set’s images before testing. Run
to extract features from
test/images. The output should be
Likewise, we have prepared the
resnet-14x14-test.h5for you. Download it here (the passcode is ‘dl4nlp’)
evaluatie.ipynb to test/evaluate the model. Open the notebook, and set the correct
eval_config, you’re good to go! Just run the following cell one by one, you should be able to visualize the performance of your trained model.
- To calculate the selected answers’ cover rate (determined by
The current version of codes are translated from
pytorch branch, where some codes are borrowed from repository pytorch-vqa.