Note
Go to the end to download the full example code.
BIDS Dataset Preprocessing Pipeline#
This example demonstrates how to work with BIDS-formatted EEG datasets using eegprep. BIDS (Brain Imaging Data Structure) is a standardized format for organizing neuroimaging data, making it easier to share and process datasets across different labs and tools.
The workflow includes:
Understanding BIDS directory structure and conventions
Creating a minimal BIDS dataset for demonstration
Discovering EEG files in BIDS format
Applying the complete BIDS preprocessing pipeline
Understanding the output structure
Best practices for BIDS-compliant preprocessing
This example shows how eegprep integrates with BIDS to provide a standardized, reproducible preprocessing workflow.
References#
Imports and Setup#
import numpy as np
import matplotlib.pyplot as plt
import tempfile
import os
import json
from pathlib import Path
import sys
sys.path.insert(0, '/Users/baristim/Projects/eegprep/src')
import eegprep
# Set random seed for reproducibility
np.random.seed(42)
Understanding BIDS Structure#
BIDS (Brain Imaging Data Structure) organizes neuroimaging data hierarchically. For EEG, the structure follows a specific naming convention and directory layout.
print("=" * 70)
print("BIDS DATASET STRUCTURE OVERVIEW")
print("=" * 70)
bids_structure = """
BIDS Dataset Organization:
dataset/
├── sub-01/ # Subject 1
│ ├── ses-01/ # Session 1
│ │ └── eeg/ # EEG modality
│ │ ├── sub-01_ses-01_task-rest_eeg.edf
│ │ ├── sub-01_ses-01_task-rest_eeg.json
│ │ ├── sub-01_ses-01_task-rest_channels.tsv
│ │ └── sub-01_ses-01_task-rest_events.tsv
│ └── ses-02/ # Session 2
│ └── eeg/
│ └── ...
├── sub-02/ # Subject 2
│ └── ses-01/
│ └── eeg/
│ └── ...
├── dataset_description.json # Dataset metadata
├── participants.tsv # Participant information
└── README # Dataset documentation
Key BIDS Concepts:
- Subjects (sub-XX): Individual participants
- Sessions (ses-XX): Multiple recording sessions per subject
- Tasks (task-name): Experimental conditions (rest, task-name, etc.)
- Runs (run-XX): Multiple runs of the same task
- Modalities: eeg, meg, ieeg, etc.
"""
print(bids_structure)
print("=" * 70)
======================================================================
BIDS DATASET STRUCTURE OVERVIEW
======================================================================
BIDS Dataset Organization:
dataset/
├── sub-01/ # Subject 1
│ ├── ses-01/ # Session 1
│ │ └── eeg/ # EEG modality
│ │ ├── sub-01_ses-01_task-rest_eeg.edf
│ │ ├── sub-01_ses-01_task-rest_eeg.json
│ │ ├── sub-01_ses-01_task-rest_channels.tsv
│ │ └── sub-01_ses-01_task-rest_events.tsv
│ └── ses-02/ # Session 2
│ └── eeg/
│ └── ...
├── sub-02/ # Subject 2
│ └── ses-01/
│ └── eeg/
│ └── ...
├── dataset_description.json # Dataset metadata
├── participants.tsv # Participant information
└── README # Dataset documentation
Key BIDS Concepts:
- Subjects (sub-XX): Individual participants
- Sessions (ses-XX): Multiple recording sessions per subject
- Tasks (task-name): Experimental conditions (rest, task-name, etc.)
- Runs (run-XX): Multiple runs of the same task
- Modalities: eeg, meg, ieeg, etc.
======================================================================
Create a Minimal BIDS Dataset#
For demonstration, we’ll create a minimal BIDS dataset structure with synthetic EEG data
print("\nCreating minimal BIDS dataset for demonstration...")
print("-" * 70)
# Create temporary directory for BIDS dataset
bids_root = tempfile.mkdtemp(prefix='bids_example_')
print(f"Created temporary BIDS directory: {bids_root}")
# Create dataset_description.json
# This file is required and contains metadata about the dataset
dataset_desc = {
"Name": "Example EEG Dataset",
"BIDSVersion": "1.9.0",
"DatasetType": "raw",
"License": "CC0",
"Authors": [
{
"Name": "Example Author",
"Email": "author@example.com"
}
],
"Acknowledgements": "Example dataset for eegprep documentation",
"HowToAcknowledge": "Please cite this paper: Example et al. (2024)",
"Funding": [
{
"Funder": "Example Foundation",
"Grant": "EX-12345"
}
],
"EthicsApprovals": [
{
"HipApproval": True,
"Committee": "Example IRB",
"CommitteeAbbreviation": "IRB",
"ExpireDate": "2025-12-31"
}
],
"ReferencesAndLinks": [],
"DatasetType": "raw"
}
with open(os.path.join(bids_root, 'dataset_description.json'), 'w') as f:
json.dump(dataset_desc, f, indent=2)
print("✓ Created dataset_description.json")
# Create participants.tsv
# This file contains demographic information about participants
participants_content = """participant_id\tage\tsex\tgroup
sub-01\t25\tM\tcontrol
sub-02\t28\tF\tcontrol
"""
with open(os.path.join(bids_root, 'participants.tsv'), 'w') as f:
f.write(participants_content)
print("✓ Created participants.tsv")
# Create subject directories and synthetic EEG data
print("\nCreating subject data...")
for sub_id in ['01', '02']:
sub_dir = os.path.join(bids_root, f'sub-{sub_id}', 'ses-01', 'eeg')
os.makedirs(sub_dir, exist_ok=True)
# Define recording parameters
n_channels = 32
n_samples = 5000
sfreq = 500
# Create channel names
ch_names = [
'Fp1', 'Fpz', 'Fp2', 'F7', 'F3', 'Fz', 'F4', 'F8',
'T7', 'C3', 'Cz', 'C4', 'T8', 'P7', 'P3', 'Pz',
'P4', 'P8', 'O1', 'Oz', 'O2', 'A1', 'A2', 'M1',
'M2', 'Fc1', 'Fc2', 'Cp1', 'Cp2', 'Fc5', 'Fc6', 'Cp5'
]
# Create synthetic data
np.random.seed(int(sub_id))
data = np.random.randn(n_channels, n_samples) * 10
# Add alpha oscillations
t = np.arange(n_samples) / sfreq
for i in range(n_channels):
alpha_freq = 10 + np.random.randn() * 0.5
data[i, :] += 5 * np.sin(2 * np.pi * alpha_freq * t)
# Save as .npy for simplicity (in real BIDS, would be .edf or .bdf)
data_file = os.path.join(sub_dir, f'sub-{sub_id}_ses-01_task-rest_eeg.npy')
np.save(data_file, data)
# Create JSON sidecar with recording metadata
eeg_json = {
"TaskName": "rest",
"SamplingFrequency": sfreq,
"PowerLineFrequency": 50,
"EEGChannelCount": n_channels,
"EEGReference": "average",
"EEGGround": "Fpz",
"RecordingDuration": n_samples / sfreq,
"RecordingType": "continuous"
}
json_file = os.path.join(sub_dir, f'sub-{sub_id}_ses-01_task-rest_eeg.json')
with open(json_file, 'w') as f:
json.dump(eeg_json, f, indent=2)
# Create channels.tsv with channel information
channels_content = "name\tx\ty\tz\tsize\n"
for ch_name in ch_names:
channels_content += f"{ch_name}\t0\t0\t0\t1\n"
channels_file = os.path.join(sub_dir, f'sub-{sub_id}_ses-01_task-rest_channels.tsv')
with open(channels_file, 'w') as f:
f.write(channels_content)
# Create events.tsv with event information
events_content = "onset\tduration\ttrial_type\n0.0\t1.0\trest\n"
events_file = os.path.join(sub_dir, f'sub-{sub_id}_ses-01_task-rest_events.tsv')
with open(events_file, 'w') as f:
f.write(events_content)
print(f" ✓ Created subject sub-{sub_id} data")
print(f"\nBIDS dataset created successfully!")
print(f"Dataset location: {bids_root}")
Creating minimal BIDS dataset for demonstration...
----------------------------------------------------------------------
Created temporary BIDS directory: /tmp/bids_example_ls1rhma6
✓ Created dataset_description.json
✓ Created participants.tsv
Creating subject data...
✓ Created subject sub-01 data
✓ Created subject sub-02 data
BIDS dataset created successfully!
Dataset location: /tmp/bids_example_ls1rhma6
List BIDS Files#
Use bids_list_eeg_files to discover EEG files in the BIDS dataset
print("\n" + "=" * 70)
print("DISCOVERING EEG FILES IN BIDS DATASET")
print("=" * 70)
print("\nListing EEG files in BIDS dataset...")
try:
eeg_files = eegprep.bids_list_eeg_files(bids_root)
print(f"Found {len(eeg_files)} EEG files:")
for f in eeg_files:
print(f" - {f}")
except Exception as e:
print(f"Note: bids_list_eeg_files may require specific BIDS structure")
print(f"Error: {e}")
# List files manually
print("\nManually listing EEG files:")
for root, dirs, files in os.walk(bids_root):
for file in files:
if file.endswith('_eeg.npy'):
print(f" - {os.path.join(root, file)}")
======================================================================
DISCOVERING EEG FILES IN BIDS DATASET
======================================================================
Listing EEG files in BIDS dataset...
Found 0 EEG files:
BIDS Preprocessing Pipeline#
The bids_preproc function applies a complete preprocessing pipeline to BIDS-formatted data
print("\n" + "=" * 70)
print("BIDS PREPROCESSING PIPELINE")
print("=" * 70)
pipeline_description = """
The bids_preproc function applies the following preprocessing steps:
1. Data Loading and Validation
- Load EEG data from BIDS format
- Validate data integrity
- Extract metadata from JSON sidecars
2. Artifact Removal
- Apply ASR (Artifact Subspace Reconstruction)
- Apply clean_artifacts for transient artifacts
- Remove line noise
3. Channel Interpolation
- Identify bad channels using statistical criteria
- Perform spherical spline interpolation
- Preserve spatial information
4. ICA Decomposition
- Prepare data for ICA
- Perform ICA using Picard algorithm
- Extract independent components
5. ICLabel Classification
- Classify components using ICLabel
- Identify artifact components
- Generate classification probabilities
6. Component Rejection
- Reject artifact components based on thresholds
- Reconstruct cleaned EEG data
- Preserve brain activity
7. Data Saving
- Save preprocessed data in BIDS format
- Create derivatives directory
- Preserve all metadata
"""
print(pipeline_description)
======================================================================
BIDS PREPROCESSING PIPELINE
======================================================================
The bids_preproc function applies the following preprocessing steps:
1. Data Loading and Validation
- Load EEG data from BIDS format
- Validate data integrity
- Extract metadata from JSON sidecars
2. Artifact Removal
- Apply ASR (Artifact Subspace Reconstruction)
- Apply clean_artifacts for transient artifacts
- Remove line noise
3. Channel Interpolation
- Identify bad channels using statistical criteria
- Perform spherical spline interpolation
- Preserve spatial information
4. ICA Decomposition
- Prepare data for ICA
- Perform ICA using Picard algorithm
- Extract independent components
5. ICLabel Classification
- Classify components using ICLabel
- Identify artifact components
- Generate classification probabilities
6. Component Rejection
- Reject artifact components based on thresholds
- Reconstruct cleaned EEG data
- Preserve brain activity
7. Data Saving
- Save preprocessed data in BIDS format
- Create derivatives directory
- Preserve all metadata
Preprocessing Parameters#
Define preprocessing parameters for the pipeline
print("=" * 70)
print("PREPROCESSING PARAMETERS")
print("=" * 70)
preproc_params = {
'sfreq': 500,
'highpass': 0.5,
'lowpass': 100,
'asr_threshold': 20,
'ica_method': 'picard',
'iclabel_threshold': 0.5,
'verbose': False
}
print("\nPreprocessing Configuration:")
print("-" * 70)
for key, value in preproc_params.items():
print(f" {key:<25} : {value}")
======================================================================
PREPROCESSING PARAMETERS
======================================================================
Preprocessing Configuration:
----------------------------------------------------------------------
sfreq : 500
highpass : 0.5
lowpass : 100
asr_threshold : 20
ica_method : picard
iclabel_threshold : 0.5
verbose : False
Output Structure#
The bids_preproc function creates a derivatives directory with processed data
print("\n" + "=" * 70)
print("EXPECTED OUTPUT STRUCTURE")
print("=" * 70)
output_structure = """
After preprocessing, the BIDS dataset will contain:
dataset/
├── sub-01/
│ └── ses-01/
│ └── eeg/
│ └── (original raw data)
├── sub-02/
│ └── ses-01/
│ └── eeg/
│ └── (original raw data)
└── derivatives/
└── eegprep-v0.2.23/
├── sub-01/
│ └── ses-01/
│ └── eeg/
│ ├── sub-01_ses-01_task-rest_eeg_preprocessed.set
│ ├── sub-01_ses-01_task-rest_eeg_preprocessed.fdt
│ ├── sub-01_ses-01_task-rest_eeg_preprocessed.json
│ └── sub-01_ses-01_task-rest_eeg_preprocessing_report.html
└── sub-02/
└── ses-01/
└── eeg/
└── (preprocessed data)
Key Features:
- Derivatives stored in separate directory (BIDS convention)
- Original data preserved (reproducibility)
- Preprocessing metadata in JSON sidecars
- HTML reports for quality assessment
"""
print(output_structure)
======================================================================
EXPECTED OUTPUT STRUCTURE
======================================================================
After preprocessing, the BIDS dataset will contain:
dataset/
├── sub-01/
│ └── ses-01/
│ └── eeg/
│ └── (original raw data)
├── sub-02/
│ └── ses-01/
│ └── eeg/
│ └── (original raw data)
└── derivatives/
└── eegprep-v0.2.23/
├── sub-01/
│ └── ses-01/
│ └── eeg/
│ ├── sub-01_ses-01_task-rest_eeg_preprocessed.set
│ ├── sub-01_ses-01_task-rest_eeg_preprocessed.fdt
│ ├── sub-01_ses-01_task-rest_eeg_preprocessed.json
│ └── sub-01_ses-01_task-rest_eeg_preprocessing_report.html
└── sub-02/
└── ses-01/
└── eeg/
└── (preprocessed data)
Key Features:
- Derivatives stored in separate directory (BIDS convention)
- Original data preserved (reproducibility)
- Preprocessing metadata in JSON sidecars
- HTML reports for quality assessment
BIDS Best Practices#
Key recommendations for BIDS-compliant preprocessing
print("=" * 70)
print("BIDS BEST PRACTICES")
print("=" * 70)
best_practices = """
1. Data Organization
✓ Follow BIDS naming conventions strictly
✓ Use consistent directory structure
✓ Include all required metadata files
2. Metadata Management
✓ Complete JSON sidecars with recording parameters
✓ Document all preprocessing steps
✓ Include participant information in participants.tsv
3. Preprocessing Documentation
✓ Record all preprocessing parameters
✓ Save preprocessing reports
✓ Document which channels were interpolated
✓ Document which components were rejected
4. Derivatives
✓ Store in derivatives/ directory
✓ Include version information
✓ Preserve original data
✓ Document preprocessing pipeline
5. Reproducibility
✓ Use fixed random seeds
✓ Document software versions
✓ Include parameter files
✓ Enable full audit trail
6. Sharing and Validation
✓ Validate BIDS compliance with bids-validator
✓ Include README with dataset description
✓ Document ethical approvals
✓ Include data sharing agreements
"""
print(best_practices)
======================================================================
BIDS BEST PRACTICES
======================================================================
1. Data Organization
✓ Follow BIDS naming conventions strictly
✓ Use consistent directory structure
✓ Include all required metadata files
2. Metadata Management
✓ Complete JSON sidecars with recording parameters
✓ Document all preprocessing steps
✓ Include participant information in participants.tsv
3. Preprocessing Documentation
✓ Record all preprocessing parameters
✓ Save preprocessing reports
✓ Document which channels were interpolated
✓ Document which components were rejected
4. Derivatives
✓ Store in derivatives/ directory
✓ Include version information
✓ Preserve original data
✓ Document preprocessing pipeline
5. Reproducibility
✓ Use fixed random seeds
✓ Document software versions
✓ Include parameter files
✓ Enable full audit trail
6. Sharing and Validation
✓ Validate BIDS compliance with bids-validator
✓ Include README with dataset description
✓ Document ethical approvals
✓ Include data sharing agreements
Summary#
print("\n" + "=" * 70)
print("SUMMARY")
print("=" * 70)
summary = """
Key Points About BIDS Preprocessing with eegprep:
1. BIDS provides standardized data organization
- Facilitates data sharing and collaboration
- Enables automated processing pipelines
- Improves reproducibility
2. pop_load_frombids loads BIDS-formatted EEG data
- Automatically extracts metadata
- Handles multiple subjects and sessions
- Validates BIDS compliance
3. bids_preproc applies complete preprocessing pipeline
- Artifact removal and channel interpolation
- ICA decomposition and component classification
- Automatic component rejection
4. Derivatives are saved in BIDS-compatible format
- Separate derivatives/ directory
- Preserves original data
- Includes preprocessing metadata
5. Preprocessing parameters are configurable
- Adapt to your specific needs
- Document all parameter choices
- Enable reproducible analysis
6. All metadata is preserved in JSON sidecars
- Recording parameters
- Preprocessing steps
- Quality metrics
"""
print(summary)
print("=" * 70)
# Clean up temporary directory
import shutil
shutil.rmtree(bids_root)
print(f"\nCleaned up temporary BIDS directory")
======================================================================
SUMMARY
======================================================================
Key Points About BIDS Preprocessing with eegprep:
1. BIDS provides standardized data organization
- Facilitates data sharing and collaboration
- Enables automated processing pipelines
- Improves reproducibility
2. pop_load_frombids loads BIDS-formatted EEG data
- Automatically extracts metadata
- Handles multiple subjects and sessions
- Validates BIDS compliance
3. bids_preproc applies complete preprocessing pipeline
- Artifact removal and channel interpolation
- ICA decomposition and component classification
- Automatic component rejection
4. Derivatives are saved in BIDS-compatible format
- Separate derivatives/ directory
- Preserves original data
- Includes preprocessing metadata
5. Preprocessing parameters are configurable
- Adapt to your specific needs
- Document all parameter choices
- Enable reproducible analysis
6. All metadata is preserved in JSON sidecars
- Recording parameters
- Preprocessing steps
- Quality metrics
======================================================================
Cleaned up temporary BIDS directory
Key Takeaways#
This example demonstrates:
BIDS Structure: Understanding standardized data organization
Data Discovery: Finding and listing BIDS-formatted files
Preprocessing Pipeline: Applying complete preprocessing workflow
Metadata Management: Handling recording parameters and metadata
Reproducibility: Ensuring consistent, documented processing
For real BIDS datasets:
Validate with bids-validator before processing
Use actual EEG file formats (EDF, BDF, etc.)
Include complete participant information
Document all preprocessing decisions
Share derivatives with original data
Total running time of the script: (0 minutes 0.335 seconds)