The Brain Imaging Data Structure (BIDS) standard

In the previous section, we pointed out that Nipype can be used to create reproducible analysis pipelines that can be applied across different datasets. This is true, in principle, but in practice, it also relies on one more idea: that of a data standard. This is because to be truly transferable, an analysis pipeline needs to know where to find the data and metadata that it uses for analysis. Thus, in this section, we will shift our focus to talking about how entire neuroimaging projects are (or should be) laid out. Until recently, datasets were usually organized in an idiosyncratic way. Each researcher had to decide on their own what data organization made sense to them. This made data sharing and reuse quite difficult because if you wanted to use someone else’s data, there was a very good chance you’d first have to spend a few days just figuring out what you were looking at and how you could go about reading the data into whatever environment you were comfortable with.

The BIDS specification

Fortunately, things have improved dramatically in recent years. Recognizing that working with neuroimaging data would be considerably easier if everyone adopted a common data representation standard, a group of (mostly) fMRI researchers convened in 2016 to create something now known as the Brain Imaging Data Standard, or BIDS. BIDS wasn’t the first data standard proposed in fMRI, but it has become by far the most widely adopted. Much of the success of BIDS can be traced to its simplicity: the standard deliberately insists not only on machine readability but also on human readability, which means that a machine can ingest a dataset and do all kinds of machine processing with it, but a human looking at the files can also make sense of the dataset, understanding what kinds of data were collected and what experiments were conducted. While there are some nuances and complexities to BIDS, the core of the specification consists of a relatively simple set of rules a human with some domain knowledge can readily understand and implement.We won’t spend much time describing the details of the BIDS specification in this book, as there’s already excellent documentation for that on the project’s website. Instead, we’ll just touch on a couple of core principles. The easiest way to understand what BIDS is about is to dive right into an example. Here’s a sample directory structure we’ve borrowed from the BIDS documentation. It shows a valid BIDS dataset that contains just a single subject:

project/
    sub-control01/
        anat/
            sub-control01_T1w.nii.gz
            sub-control01_T1w.json
            sub-control01_T2w.nii.gz
            sub-control01_T2w.json
        func/
            sub-control01_task-nback_bold.nii.gz
            sub-control01_task-nback_bold.json
            sub-control01_task-nback_events.tsv
            sub-control01_task-nback_physio.tsv.gz
            sub-control01_task-nback_physio.json
            sub-control01_task-nback_sbref.nii.gz
        dwi/
            sub-control01_dwi.nii.gz
            sub-control01_dwi.bval
            sub-control01_dwi.bvec
        fmap/
            sub-control01_phasediff.nii.gz
            sub-control01_phasediff.json
            sub-control01_magnitude1.nii.gz
            sub-control01_scans.tsv
    code/
        deface.py
    derivatives/
    README
    participants.tsv
    dataset_description.json
    CHANGES

There are two important points to note here. First, the BIDS specification imposes restrictions on how files are organized within a BIDS project directory. For example, every subject’s data goes inside a sub-[id] folder below the project root —- where the sub- prefix is required, and the [id] is a researcher-selected string uniquely identifying that subject within the project ("control01" in the example). And similarly, inside each subject directory, we find subdirectories containing data of different modalities: anat for anatomical images; func for functional images; dwi for diffusion-weighted images; and so on. When there are multiple data collection sessions for each subject, an extra level is introduced to the hierarchy, so that functional data from the first session acquired from subject control01 would be stored inside a folder like sub-control01/ses-01/func.Second, valid BIDS files must follow particular naming conventions. The precise naming structure of each file depends on what kind of file it is, but the central idea is that a BIDS filename is always made up of (1) a sequence of key-value pairs, where each key is separated from its corresponding value by a dash, and pairs are separated by underscores; (2) a “suffix” that directly precedes the file extension and describes the type of data contained in the file (this comes from a controlled vocabulary, meaning that it can only be one of a few accepted values, such as "bold" or "dwi"); and (3) an extension that defines the file format.For example, if we take a file like sub-control01/func/sub-control01_task-nback_bold.nii.gz and examine its constituent chunks, we can infer from the filename that the file is a Nifti image (.nii.gz extension) that contains BOLD fMRI data (bold suffix) for task nback acquired from subject control01.Besides these conventions, there are several other key elements of the BIDS specification. We won’t discuss them in detail, but it’s good to at least be aware of them:

- Every data file should be accompanied by a JSON “sidecar” containing metadata describing that file. For example, a BOLD data file might be accompanied by a side-car file that describes acquisition parameters, such as repetition time.
- BIDS follows an “inheritance” principle —- meaning that JSON metadata files higher up in the hierarchy automatically apply to relevant files lower in the hierarchy unless explicitly overridden. For example, if all of the BOLD data in a single dataset was acquired using the same protocol, this metadata need not be replicated in each subject’s data folder.
- Every project is required to have a dataset_description.json file at the root level that contains basic information about the project (e.g., the name of the dataset and a description of its constituents, as well as citation information).
- BIDS doesn’t actively prohibit you from including non-BIDS-compliant files in a BIDS project -— so you don’t have to just throw out files that you can’t easily shoehorn into the BIDS format. The downside of including non-compliant files is just that most BIDS tools and/or human users won’t know what to do with them, so your dataset might not be quite as useful as it otherwise would be.

BIDS Derivatives

The BIDS specification was originally created with static representations of neuroimaging datasets in mind. But it quickly became clear that it would also be beneficial for the standard to handle derivatives of datasets -— that is, new BIDS datasets generated by applying some transformation to one or more existing BIDS datasets. For example, suppose we have a BIDS dataset containing raw fMRI images. Typically, we’ll want to preprocess our images (for example, to remove artifacts, apply motion correction, temporally filter the signal, etc.) before submitting them to analysis. It’s great if our preprocessing pipeline can take BIDS datasets as inputs, but what should it then do with the output? A naive approach would be to just construct a new BIDS dataset that’s very similar to the original one, but replace the original (raw) fMRI images with new (preprocessed) ones. But that’s likely to confuse: a user could easily end up with many different versions of the same BIDS dataset, yet have no formal way to determine the relationship between them. To address this problem, the BIDS-Derivatives extension introduces some additional metadata and file naming conventions that make it easier to chain BIDS-aware tools (see the next section) without chaos taking hold.

The BIDS Ecosystem

At this point, you might be wondering: what is BIDS good for? Surely the point of introducing a new data standard isn’t just to inconvenience people by forcing them to spend their time organizing their data a certain way? There must be some benefits to individual researchers — and ideally, the community as a whole -— spending precious time making datasets and workflows BIDS-compliant, right? Well, yes, there are! The benefits of buying into the BIDS ecosystem are quite substantial. Let’s look at a few.

Easier data sharing and reuse

One obvious benefit we alluded to above is that sharing and re-using neuroimaging data becomes much easier once many people agree to organize their data the same way. As a trivial example, once you know that BIDS organizes data according to a fixed hierarchy (i.e., subject –> session –> run), it’s easy to understand other people’s datasets. There’s no chance of finding time-course images belonging to subject 1 in, say, /imaging/old/NbackTask/rawData/niftis/controlgroup/1/. But the benefits of BIDS for sharing and reuse come into full view once we consider the impact on public data repositories. While neuroimaging repositories have been around for a long time (for an early review, see {cite}van2001functional), their utility was long hampered by the prevalence of idiosyncratic file formats and project organizations. By supporting the BIDS standard, data repositories open the door to a wide range of powerful capabilities.

To illustrate, consider OpenNeuro — currently the largest and most widely-used repository of brain MRI data. OpenNeuro requires uploaded datasets to be in BIDS format (though datasets do not have to be fully compliant). As a result, the platform can automatically extract, display, and index important metadata. For example, the number of subjects, sessions, and runs in each dataset; the data modalities and experimental tasks present; a standardized description of the dataset; and so on. Integration with free analysis platforms like BrainLife is possible, as is structured querying over datasets via OpenNeuro’s GraphQL API endpoint.

Perhaps most importantly, the incremental effort required by users to make their BIDS-compliant datasets publicly available and immediately usable by others is minimal: in most cases, users have only to click an Upload button and locate the project they wish to share (there is also a command-line interface, for users who prefer to interact with OpenNeuro programmatically).

BIDS-Apps

A second benefit to representing neuroimaging data in BIDS is that one immediately gains access to a large, and rapidly growing, ecosystem of BIDS-compatible tools. If you’ve used different neuroimaging tools in your research -— for example, perhaps you’ve tried out both FSL and SPM (the two most widely used fMRI data analysis suites) —- you’ll have probably had to do some work to get your data into a somewhat different format for each tool. In a world without standards, tool developers can’t be reasonably expected to know how to read your particular dataset, so the onus falls on you to get your data into a compatible format. In the worst case, this means that every time you want to use a new tool, you have to do some more work.

By contrast, for tools that natively support BIDS, life is simpler. Once we know that fMRIPrep -— a very popular preprocessing pipeline for fMRI data {cite}Esteban2019-md -— takes valid BIDS datasets as inputs, the odds are very high that we’ll be able to apply fMRIPrep to our own valid BIDS datasets with little or no additional work. To facilitate the development and use of these kinds of tools, BIDS developed a lightweight standard for “BIDS Apps” {cite}Gorgolewski2017-mb. A BIDS App is an application that takes one or more BIDS datasets as input. There is no restriction on what a BIDS App can do, or what it’s allowed to output (though many BIDS Apps output BIDS-Derivatives datasets); the only requirement is that a BIDS App is containerized (using Docker or Singularity; see {numref}docker), and accept valid BIDS datasets as input. New BIDS Apps are continuously being developed, and as of this writing, the BIDS Apps website lists a few dozen apps.

What’s particularly nice about this approach is that it doesn’t necessarily require the developers of existing tools to do a lot of extra work themselves to support BIDS. In principle, anyone can write a BIDS-App “wrapper” that mediates between the BIDS format and whatever format a tool natively expects to receive data in. So, for example, the BIDS-Apps registry already contains BIDS-Apps for packages or pipelines like SPM, CPAC, Freesurfer, and the Human Connectome Project Pipelines. Some of these apps are still fairly rudimentary and don’t cover all of the functionality provided by the original tools, but others support much or most of the respective native tool’s functionality. And of course, many BIDS-Apps aren’t wrappers around other tools at all; they’re entirely new tools designed from the very beginning to support only BIDS datasets. We’ve already mentioned fMRIPrep, which has very quickly become arguably the de facto preprocessing pipeline in fMRI; another widely-used BIDS-App is MRIQC {cite}Esteban2017-tu, a tool for automated quality control and quality assessment of structural and functional MRI scans, which we will see in action in {numref}nibabel. Although the BIDS-Apps ecosystem is still in its infancy, the latter two tools already represent something close to killer applications for many researchers.

To demonstrate this statement, consider how easy it is to run fMRIPrep once your data is organized in the BIDS format. After installing the software and its dependencies, running the software is as simple as issue this command in the terminal:

fmriprep data/bids_root/ out/ participant -w work/

where, data/bids_root points to a directory that contains a BIDS-organized dataset that includes fMRI data, out points to the directory into which the outputs (the BIDS derivatives) will be saved and work is a directory that will store some of the intermediate products that will be generated along the way. Looking at this, it might not be immediately apparent how important BIDS is for this to be so simple, but consider what software would need to do to find all of the fMRI data inside of a complex dataset of raw MRI data, that might contain other data types, other files and so forth. Consider also the complexity that arises from the fact that fMRI data can be collected using many different acquisition protocols, and the fact that fMRI processing sometimes uses other information (for example, measurements of the field map, or anatomical T1-weighted or T2-weighted scans). The fact that the data complies with BIDS allows fMRIPrep to locate everything that it needs with the dataset and to make use of all the information to perform the preprocessing to the best of its ability given the provided data.

Utility libraries

Lastly, the widespread adoption of BIDS has also spawned a large number of utility libraries designed to help developers (rather than end users) build their analysis pipelines and tools more efficiently. Suppose I’m writing a script to automate my lab’s typical fMRI analysis workflow. It’s a safe bet that, at multiple points in my script, I’ll need to interact with the input datasets in fairly stereotyped and repetitive ways. For instance, I might need to search through the project directory for all files containing information about event timing, but only for a particular experimental task. Or, I might need to extract some metadata containing key parameters for each time-series image I want to analyze (e.g., the repetition time, or TR). Such tasks are usually not very complicated, but they’re tedious and can slow down development considerably. Worse, at a community level, they introduce massive inefficiency, because each person working on their analysis script ends up writing their own code to solve what are usually very similar problems.

A good utility library abstracts away a lot of this kind of low-level work and allows researchers and analysts to focus most of their attention on high-level objectives. By standardizing the representation of neuroimaging data, BIDS makes it much easier to write good libraries of this sort. Probably the best example so far is a package called PyBIDS, which provides a convenient Python interface for basic querying and manipulation of BIDS datasets. To give you a sense of how much work PyBIDS can save you when you’re writing neuroimaging analysis code in Python, let’s take a look at some of the things the package can do.

We start by importing an object called BIDSLayout, which we will use to manage and query the layout of files on disk. We also import a function that knows how to locate some test data that was installed on our computer together with the PyBIDS software library.

In [2]:

from bids import BIDSLayout
from bids.tests import get_test_data_path

One of the datasets that we have within the test data path is data set number 5 from OpenNeuro. Note that the software has not actually installed into our hard-drive a bunch of neuroimaging data — that would be too large! Instead, the software installed a bunch of files that have the right names and are organized in the right way, but are mostly empty. This allows us to demonstrate the way that the software works, but don’t try reading the neuroimaging data from any of the files in that directory. We’ll work with a more manageable BIDS dataset, including the files in it in {numref}nibabel.

For now, we initialize a BIDSLayout object, by pointing to the location of the dataset in our disk. When we do that, the software scans through that part of our file-system, validates that it is a properly organized BIDS dataset, and finds all of the files that are arranged according to the specification. This allows the object to already infer some things about the dataset. For example, the dataset has 16 subjects and 48 total runs (here a “run” is an individual fMRI scan). The person who organized this dataset decided not to include a session folder for each subject. Presumably, because each subject participated in just one session in this experiment, that information is not useful.

In [3]:

layout = BIDSLayout(get_test_data_path() + "/ds005")
print(layout)

BIDS Layout: ...packages/bids/tests/data/ds005 | Subjects: 16 | Sessions: 0 | Runs: 48

The layout object now has a method called get(), which we can use to gain access to various parts of the dataset. For example, we can ask to give us a list of the filenames of all of the anatomical ("T1w") scans that were collected for subjects sub-01 and sub-02

In [4]:

layout.get(subject=['01', '02'],  suffix="T1w", return_type='filename')

Out[4]:

['/home/runner/.local/lib/python3.10/site-packages/bids/tests/data/ds005/sub-01/anat/sub-01_T1w.nii.gz',
 '/home/runner/.local/lib/python3.10/site-packages/bids/tests/data/ds005/sub-02/anat/sub-02_T1w.nii.gz']

Or, using a slightly different logic, all of the functional ("bold") scans collected for subject sub-03

In [5]:

layout.get(subject='03', suffix="bold", return_type='filename')

Out[5]:

['/home/runner/.local/lib/python3.10/site-packages/bids/tests/data/ds005/sub-03/func/sub-03_task-mixedgamblestask_run-01_bold.nii.gz',
 '/home/runner/.local/lib/python3.10/site-packages/bids/tests/data/ds005/sub-03/func/sub-03_task-mixedgamblestask_run-02_bold.nii.gz',
 '/home/runner/.local/lib/python3.10/site-packages/bids/tests/data/ds005/sub-03/func/sub-03_task-mixedgamblestask_run-03_bold.nii.gz']

In these examples, we asked the BIDSLayout object to give us the 'filename' return type. This is because if we don’t explicitly ask for a return type, we will get back a list of BIDSImageFile objects. For example, selecting the first one of these for sub-03‘s fMRI scans:

In [6]:

bids_files = layout.get(subject="03", suffix="bold")
bids_image = bids_files[0]

This object is quite useful, of course. For example, it knows how to parse the file name into meaningful entities, using the get_entities() method, which returns a dictionary with entities such as subject and task that can be used to keep track of things in analysis scripts.

In [7]:

bids_image.get_entities()

Out[7]:

{'datatype': 'func',
 'extension': '.nii.gz',
 'run': 1,
 'subject': '03',
 'suffix': 'bold',
 'task': 'mixedgamblestask'}

In most cases, you can also get direct access to the imaging data using the BIDSImageFile object. This object has a get_image method, which would usually return a nibabel Nifti1Image object. As you will see in {numref}nibabel this object lets you extract metadata, or even read the data from a file into memory as a Numpy array. However, in this case, calling the get_image method would raise an error, because, as we mentioned above, the files do not contain any data. So, let’s look at another kind of file that you can read directly in this case. In addition to the neuroimaging data, BIDS provides instructions on how to organize files that record the behavioral events that occurred during an experiment. These are stored as tab-separated-values (‘.tsv’) files, and there is one for each run in the experiment. For example, for this dataset, we can query for the events that happened during subject sub-03‘s 3rd run:

In [8]:

events = layout.get(subject='03', extension=".tsv", task="mixedgamblestask", run="03")
tsv_file = events[0]
print(tsv_file)

<BIDSDataFile filename='/home/runner/.local/lib/python3.10/site-packages/bids/tests/data/ds005/sub-03/func/sub-03_task-mixedgamblestask_run-03_events.tsv'>

Instead of a BIDSImageFile, the variable tsv_file is now a BIDSDataFile object, and this kind of object has a get_df method, which returns a Pandas DataFrame object

In [9]:

bids_df = tsv_file.get_df()
bids_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   onset                       85 non-null     float64
 1   duration                    85 non-null     int64  
 2   trial_type                  85 non-null     object 
 3   distance from indifference  0 non-null      float64
 4   parametric gain             85 non-null     float64
 5   parametric loss             0 non-null      float64
 6   gain                        85 non-null     int64  
 7   loss                        85 non-null     int64  
 8   PTval                       85 non-null     float64
 9   respnum                     85 non-null     int64  
 10  respcat                     85 non-null     int64  
 11  RT                          85 non-null     float64
dtypes: float64(6), int64(5), object(1)
memory usage: 8.1+ KB

This kind of functionality is useful if you are planning to automate your analysis over large datasets that can include heterogeneous acquisitions between subjects and within subjects. At the very least, we hope that the examples have conveyed to you the power inherent in organizing your data according to a standard, as a starting point to use, and maybe also develop, analysis pipelines that expect data in this format. We will see more examples of this in practice in the next section.

Exercises

BIDS has a set of example datasets available in a GitHub repository at https://github.com/bids-standard/bids-examples. Clone the repository and use pyBIDS to explore the dataset called “ds011”. Using only pyBIDS code, can you figure out how many subjects participated in this study? What are the values of TR that were used in fMRI acquisitions in this dataset?

Additional resources

There are many places to learn more about the NiPy community. The documentation of each of the software libraries that were mentioned here includes fully worked-out examples of data analysis, in addition to links to tutorials, videos, and so forth. For Nipype, in particular, we recommend Michael Notter’s Nipype tutorial as a good starting point.

BIDS is not only a data standard but also a community of developers and users that support the use and further development of the standard. For example, over the time since the standard was first proposed, the community has added instructions to support the organization and sharing of new data modalities (e.g., intracranial EEG) and derivatives of processing. The strength of such a community is that, like open-source software, it draws on the individual strengths of many individuals and many groups who are willing to spend time and effort evolving and improving it. One of the resources that the community has developed to help people who are new to the standard learn more about it is the BIDS Starter Kit. It is a website that includes materials (videos, tutorials, explanations, and examples) to help you get started learning about and eventually using the BIDS standard.

In addition to these relatively static resources, users of both NiPy software and BIDS can interact with other members of the community through a dedicated questions and answers website called Neurostars. On this website, anyone can ask questions about neuroscience software and get answers from experts who frequent the website. In particular, the developers of many of the projects that we mentioned in this chapter, and many of the people who work on the BIDS standard often answer questions about these projects through this website.

Reference

Rokem A, Yarkoni T. Data Science for Neuroimaging: An Introduction. 2023.

cpc.knu.ac.kr

Computational Psychiatry Center

The Brain Imaging Data Structure (BIDS) standard