Review for "BOLD5000: A public fMRI dataset of 5000 images"

Completed on 28 Sep 2018 by Krzysztof Jacek Gorgolewski .

Login to endorse this review.


The manuscript “BOLD5000: A public fMRI dataset of 5000 images” describes one of the most exciting fMRI datasets published in the recent years. It includes data from 4 participants each viewing (and reacting to) ~5k different images. The stimuli were selected very carefully from established collections of images used in computer vision research providing a bridge between neuroscience and machine learning. I am confident that we will see a lot of fascinating reuses of this dataset in the upcoming years.

Comments to author

- Tweaking the title of the manuscript might be worth thinking about. The problem is that the word “image” could refer to an MR scan or a photograph used as a stimulus. Perhaps “BOLD5000: a public dataset of human brain activation during viewing of 5000 images”.

- Figures 2, 3, and four would benefit from using normalized histograms with unified bin sizes displaying both compared distributions on the same set of axes (see*NyGPyuSF9enQDCJGYGiI9A.png for example)

- Plotting the estimated ROIs on top of the anatomy of each participant would also benefit the manuscript.

- Figure 5: Adding plotting distribution of framewise displacement would help to asses your readers how much motion to expect in the data.

- Figure 5: It is unclear why a different number of sessions would justify plotting data from one of the participants on a different set of axes

- Figure 5: please make the labels for outliers larger (so they could be readable) and make them correspond to participant, session, and run labels.

- Figure 6: it would be good to add a plot from an ROI where you do not expect a response - as a sanity check - for example, the motor cortex.

- Figure 7: bar plots should be replaced with a visualization that depicts the spread of each distribution (whisker plots or violin plots)

- Group level reports are missing from MRIQC results which makes it hard to diagnose the outliers on QC metrics.

- The BIDS version of the dataset deposited on is missing some data which makes future automatic processing harder. Mainly:

* Missing participants.tsv file with demographic data (Section 8.9 of the BIDS Spec)

* Missing _sessions.tsv file with post-session questionnaire answers (Section 9.1 of the BIDS Spec)

* _events.json data dictionaries do not include a description of column names (section 4.2 of the spec)

* Acquisition datetimes (“Begin” in _events.json) should be anonymized and moved to _scans.tsv files (Section 8.8 of the BIDS Spec)

* Missing stimuli files (cropped images displayed to users) and stim_file columns in the _events.tsv files (Section 8.5 of the BIDS Spec)

* Lack of data dictionaries (_events.json) for localizer events files (Section 4.2 of the BIDS Spec)

* Lack of physiological data (Section 8.6 of the BIDS Spec)