Neurofibromatosis type 1 (NF1) 3D organoid image-based profiling pipeline

Patients living with neurofibromatosis type 1 (NF1) often develop neurofibromas (nfs), which are complex and benign tumors that can transform into malignant disease. However, there are only two FDA-approved therapies for NF1-associated inoperable plexiform neurofibromas (pnfs): mirdametinib and selumetinib.thus, we urgently need more therapeutic options for neurofibromas.

To address this, we have developed a 3D patient-derived tumor organoid model of NF1. We developed a modified 3D cell painting protocol to generate high-content imaging data from these organoids. This repository contains the code and documentation for a comprehensive analysis pipeline to process and analyze these 3D organoid models of NF1.

This pipeline was developed specifically for the NF1 3D organoid dataset, but the modular design allows for adaptation to other 3D microscopy datasets.

Example data

Raw channels

405 Channel	488 channel	555 channel	640 channel

Organoid, nuclei, cell, and cytoplasm segmentations

Organoid	nuclei	cell	cytoplasm

Pipeline architecture

The pipeline follows a hierarchical processing structure:

Execution strategy:

SLURM-based HPC scheduling for parallel processing
Conditional execution based on file existence
Automatic job submission throttling (max 990 concurrent jobs) We present a full workflow to profile 3-dimensional images of organoids. Our end-to-end system processes raw 3D microscopy data through illumination correction, segmentation, feature extraction, quality control, and image-based profiling.

        flowchart TD
   A[Raw microscopy images] --> B[Stage 0: data preprocessing]

   B --> C1[Z-stack creation]
   B --> C2["(optional) deconvolution"]

   C1 --> D[Deconvolved z-stack images]
   C2 --> D

   D --> E[Stage 1: image quality control]

   E --> F1[Blur detection]
   E --> F2[Saturation detection]

   F1 --> G[QC flags & reports]
   F2 --> G

   G --> H[Stage 2: Image segmentation]

   H --> I1[Nuclei segmentation]
   H --> I2[Cell segmentation]
   H --> I3[Organoid segmentation]

   I1 --> J[Segmentation refinement]
   I2 --> J
   I3 --> J

   J --> K[3D segmentation masks]

   K --> L[Stage 3: feature extraction]

   L --> M1[Area + size]
   L --> M2[Intensity]
   L --> M3[Texture]
   L --> M4[Colocalization]
   L --> M5[Neighbors]
   L --> M6[Deep learning features]

   M1 --> N[feature matrices]
   M2 --> N
   M3 --> N
   M4 --> N
   M5 --> N
   M6 --> N

   N --> O[Stage 4: image-based profiling]

   style B fill:#40BA40,stroke:#000,stroke-width:2px
   style E fill:#40BA40,stroke:#000,stroke-width:2px
   style H fill:#40BA40,stroke:#000,stroke-width:2px
   style L fill:#40BA40,stroke:#000,stroke-width:2px
   style O fill:#40BA40,stroke:#000,stroke-width:2px

Stage 0: data preprocessing

Directory: 0.preprocessing_data/

Purpose: Transform raw microscopy data into standardized 3D z-stack images ready for analysis.

Inputs:

Raw 2D TIFF images from microscope (5 channels × N z-slices × M wells)
Metadata files (experiment design, plate layouts)

Outputs:

3D z-stack TIFF files organized by patient/well/FOV (optionally deconvolved)
File structure: data/{patient}/zstack_images/{well_fov}/{channel}.tif

Data preprocessing steps

Patient-specific preprocessing
- We optimize formatting of raw images we receive fresh from our collaborator’s microscope.
- We organize raw image files by patient ID and create patient-specific directory structures.
- We validate file naming conventions.
Updating file structure
- We standardize the directory hierarchy across patients.
- We rename files to a consistent naming scheme (to account for cross-batch inconsistencies).
Creating z-stacks
- We combine 2D image slices into 3D z-stacks, built to accommodate different microscope formats (e.g., CQ1, echo)
- This process maintain metadata and propagates channel information.
- We track z-spacing between images (typically 1μm) as well as z-stack depth (between 50-100 slices [~50-100μm total])
Detecting image corruption
- We validate TIFF file integrity removing corrupted images, incomplete z-stacks, and other errors introduced during image acquisition.
- We flag problematic datasets by added QC flags to be used in downstream filtering (if necessary).
Preprocessing for deconvolution (optional)
- We prepare images for huygens deconvolution (generating parameter files and organizing batch processing structure)
Post-processing for deconvolution (optional)
- We import deconvolved images, verify output quality, and update file paths and metadata for downstream processing

Execution:

# Example for a specific patient (NF0014)
cd 0.preprocessing_data
python scripts/1.make_zstack_and_copy_over.py --patient NF0014_T1

# Process for the CQ1 microscope
python scripts/1z.make_zstack_and_copy_over_CQ1.py --patient NF0014_T1

Stage 1: image quality control

Directory: 1.image_quality_control/

Inputs:

z-stack images from stage 0 (deconvolution optional)

Outputs:

QC flags file: data/{patient}/qc_flags.csv
QC reports: HTML/PDF summaries with plots
Flagged well list for exclusion from downstream analysis

Purpose: Assess image quality and flag problematic well fovs before segmentation.

Image QC steps

Cellprofiler QC pipeline
- Extract whole-image QC metrics using cellprofiler.
- Export QC metrics to CSV.
  - These will be used in all downstream QC filtering steps.
Blur evaluation
- Calculate laplacian variance for focus detection.
- Identify out-of-focus z-slices.
- Set thresholds for acceptable sharpness.
Saturation analysis
- Detect overexposed pixels per channel.
- Calculate percentage of saturated voxels.
- Flag wells with excessive saturation (>5%).
QC report generation
- Create visualizations with ggplot2 (R).
- Generate per-plate and per-patient summaries.
- Produce pass/fail flags for each well FOV.

Quality metrics:

Blur: laplacian variance, focus score
Saturation: percentage of pixels clipped at the maximum value
Signal-to-noise: mean signal / background standard deviation
Illumination: uniformity of pixel intensity across each FOV

Execution:

cd 1.image_quality_control
jupyter nbconvert --to notebook --execute notebooks/*.ipynb

Stage 2: image segmentation

Directory: 2.segment_images/

Purpose: Generate 3D masks for nuclei, cells, cytoplasm, and whole organoids. We refer to each mask category as a “compartment”.

Image segmentation steps

Nuclei segmentation
- Apply cellpose 4.0 on the DNA channel (405 nm).
- Uses a custom function for 2.5D segmentation to handle z-stacks with varying slice counts.
Organoid segmentation
- Use cellpose 3.x using a custom size invariant search algorithm.
Cell segmentation
- Segment individual cells using F-actin and AGP channel (555 nm).
- Expand from nuclear seeds.
- Use 3D watershed for cell boundary detection.
Cytoplasm derivation
- Subtract nuclear masks from cell masks
- Generate cytoplasmic compartment masks
Mask refinement
- Stitch 2D masks into 3D volumes
- Match objects to retain the same ids across z-slices.
- Assign nuclei to parent cells.
- Assign cells to parent organoids.

Execution:

cd 2.segment_images
sbatch grand_parent_segmentation.sh

Nuclei segmentation workflow diagram

Segmentation workflow diagram

Cell segmentation workflow diagram

        flowchart TD
A[raw image] --> B[thresholding]
A --> C[butterworth low-pass filter]


B --> D1[globular & cluster]
B --> D2[small & dissociated]
B --> D3[elongated]

D1 --> E1[gaussian smoothing with sigma=2.5]
D2 --> E2[gaussian smoothing with sigma=3.0]
D3 --> E3[gaussian smoothing with sigma=4.0]

E1 --> F1[otsu's thresholding]
E2 --> F1
E3 --> F1

F1 --> I1[binary mask of organoid]
I1 --> J1[no dilation of the mask]
I1 --> J2[dilation ball radius=1]
I1 --> J3[dilation ball radius=10]
J1 --> H1[3D seeded watershed segmentation with nuclei masks as seeds]
J2 --> H1
J3 --> H1



C --> Z1[globular & cluster]
C --> Z2[small & dissociated]
C --> Z3[elongated]
Z1 --> Y1[no gaussian smoothing]
Z2 --> Y2[gaussian smoothing with sigma=1.0]
Z3 --> Y3[gaussian smoothing with sigma=1.0]

Y1 --> X1[sobel filter]
Y2 --> X1
Y3 --> X1

X1 --> W1[connectivity=1; compactness=1]
X1 --> W2[connectivity=1; compactness=0]
X1 --> W3[connectivity=0; compactness=0]

W1 --> H1
W2 --> H1
W3 --> H1

Organoid segmentation

Organoid segmentation was performed by using CellPose SAM on guassian smoothed (singma=10.0) AGP channel images with a size constraing of 200 pixel diameter.

Stage 3: feature extraction

Directory: 3.cellprofiling/

Purpose: Extract all morphology (hand-drawn) features (e.g., shape, Intensity, texture, etc.) from segmented objects.

Feature extraction steps

To maximize parallelization and processing speed, our featurization strategy follows a three-level hierarchical job submission structure.

Level 1: all fovs for a patient, per well (grandparent process) (Run_featurization_grandparent.sh)

Submits parent jobs for each FOV (level 2).

Level 2: all feature categories for each FOV (parent process) (Run_featurization_parent.sh)
- Loops through all combinations of feature types × compartments × channels.
- Submits child jobs for each feature combination (level 3).
Level 3: compute specific feature categories (child process) (Individual feature extraction scripts)
- Calculates specific features based on the hierarchical combination specified in levels 1 and 2.
- Saves individual feature calculation outputs as a parquet file within a folder to be combined later.

Feature types: For more details on feature types and extraction methods, refer to the Features/ documentation.

Feature extraction strategy

We extract hand-drawn features across multiple categories (e.g., shape,intensity, texture) for each compartment and channel combination. In addition, we extract deep learning-based features using the sammmed3d model to capture complex morphological phenotypes that may not be described by hand-crafted features. Sammed3d features are extracted as 384-dimensional embeddings per channel per object using the CLS token output from the vit encoder on the whole volume. Additionally, we take a nucleocentric feature extraction approach where we extract features from a cropped volumes centered around each nucleus. We use sammed3d to extract features from these nucleocentric volumes, and we z-maximally project the nucleocentric volumes to extract 2D features using CHAMMI-75 features. CHAMMI-75 features are extracted as 384-dimensional embeddings per channel per object using the CLS token output from a separate vit encoder.

Feature extraction categories

Feature type	description
Area + size	object size, area, perimeter, etc.
Colocalization	overlap of signals between channels (e.g., pearson correlation)
Intensity	mean, median, max, min, etc. intensity values per object
Granularity	“granularity of pixel intensities”
Neighbors	number of neighboring objects, distance to neighbors
Texture	haralick features, gabor filters, etc.
Deep learning features	sammmed3d vit-based embeddings
CHAMMI-75 features	vit-based embeddings from nucleocentric 2D projections

Stage 4: image-based profiling

Directory: 4.processing_image_based_profiles/

Purpose: Merge, normalize, and aggregate features across wells and patients, preparing data for downstream analyses.

Inputs:

Feature parquet files from stage 3
Metadata: plate maps, treatment info, QC flags, etc.

Outputs:

Profiles are generated at multiple levels with profiles being generated for each profile type:

Single-cell hand-drawn features one row per nucleus/cell/cytoplasm
Single-cell sammed3d features one row per nucleus/cell/cytoplasm
Organoid hand-drawn features one row per organoid
Organoid sammed3d features one row per organoid
Nucleocentric sammed3d features one row per nucleus-centered volume
Nucleocentric CHAMMI-75 features one row per nucleus-centered volume

Profile types:

Object-level standard scalar normalized profiles (one row per object with all features)
Object-level feature-selected profiles (one row per object with selected features)
Well-level aggregated profiles (one row per well with aggregated Features)
Consensus profiles aggregated profiles with selected features (one row per treatment with selected features)

With the combination of 6 profile types and 4 profile levels, we generate a total of 24 different profile outputs. Each profile output is saved as a parquet file in the data/{patient}/image_based_profiles/.

Profile file information flow diagram

Feature merging
- Combine all feature csvs per well FOV
- Use DuckDB for sqlite → parquet conversion
- Create single-cell (sc) and organoid-level profiles
Annotation
- Add treatment metadata from plate maps
- Link drug names, targets, concentrations
- Add patient genotype information
QC filtering
- Apply image QC flags from stage 1
- Remove outlier objects (z-score > 3)
- Filter low-quality wells
Normalization
- Z-score normalization per plate
- Applied to the entire plate based on the control wells (DMSO treated wells)
- Standardize features: (x - μ) / σ
- Handle batch effects
Feature selection
- Remove low-variance features
- Remove correlated features (correlation > 0.9)
- Remove blocklisted features
- Remove features based on frequency cutoff categorical features
Aggregation
- Calculate well-level statistics (mean, median, std)
- Generate organoid-parent aggregations
- Compute patient-level summaries
Form consensus profiles
- Merge sc and organoid aggregations
- Export final consensus profiles

        flowchart TD
    A1[cellpainting images and segmentations]


    A1 -->|featurization| B[nuclei features ]
    A1 -->|featurization| C[cell features ]
    A1 -->|featurization| D[cytoplasm features ]
    A1 -->|featurization| E[organoid features ]
    A1 -->|featurization| F[nucleocentric features ]


    B --> |merging| G[single-cell features ]
    C --> |merging| G[single-cell features ]
    D --> |merging| G[single-cell features ]
    G --> |annotation| G1[single-cell features ]

    E --> |annotation| H[organoid features ]
    F --> |annotation| I[nucleocentric features ]
    G1 --> J[single-cell hand-drawn features ]
    G1 --> K[single-cell deep learning features ]
    H --> L[organoid hand-drawn features]
    H --> M[organoid deep learning features]
    I --> N[nucleocentric volumetric features]
    I --> O[nucleocentric flat features]
    J --> |QC| P1[QC profiles]
    K --> |QC| P2[QC profiles]
    L --> |QC| P3[QC profiles]
    M --> |QC| P4[QC profiles]
    N --> |QC| P5[QC profiles]
    O --> |QC| P6[QC profiles]
    P1 --> |normalization| S1[normalized profiles]
    P2 --> |normalization| S2[normalized profiles]
    P3 --> |normalization| S3[normalized profiles]
    P4 --> |normalization| S4[normalized profiles]
    P5 --> |normalization| S5[normalized profiles]
    P6 --> |normalization| S6[normalized profiles]
    S1 --> |feature selection| T1[selected features]
    S2 --> |feature selection| T2[selected features]
    S3 --> |feature selection| T3[selected features]
    S4 --> |feature selection| T4[selected features]
    S5 --> |feature selection| T5[selected features]
    S6 --> |feature selection| T6[selected features]
    T1 --> U1[aggregated profiles]
    T2 --> U2[aggregated profiles]
    T3 --> U3[aggregated profiles]
    T4 --> U4[aggregated profiles]
    T5 --> U5[aggregated profiles]
    T6 --> U6[aggregated profiles]
    T1 --> V1[consensus profiles]
    T2 --> V2[consensus profiles]
    T3 --> V3[consensus profiles]
    T4 --> V4[consensus profiles]
    T5 --> V5[consensus profiles]
    T6 --> V6[consensus profiles]

Separately from the above workflow, we also generate cross-patient consensus profiles by merging profiles across patients and applying global feature selection to identify a common set of features that are robust across patients. Create all analysis ready output files

        flowchart LR
    A[all patient normalized profiles] --> B[combined profiles]
    B --> C[global feature selection]
    C --> D[cross-patient aggregated profiles]
    D --> E[cross-patient consensus profiles]

Pycytominer feature selection parameters:

Correlation threshold: 0.9
Variance threshold: 0.01
NA cutoff: 5%
Frequency cut: 0.1
Unique cut: 0.1

Execution:

cd 4.processing_image_based_profiles
sbatch merge_features_grand_parent.sh

Data organization

Directory structure

The pipeline expects data organized in this hierarchy:

NF1_3D_organoid_profiling_pipeline/
├── data/
│   ├── patient_IDs.txt
│   ├── NF0014_T1/
│   │   ├── zstack_images/
│   │   │   ├── C4-2/
│   │   │   │   ├── 405.Tif        # DNA channel
│   │   │   │   ├── 488.Tif        # ER channel
│   │   │   │   ├── 555.Tif        # golgi channel
│   │   │   │   ├── 568.Tif        # F-actin channel
│   │   │   │   └── 640.Tif        # mito channel
│   │   │   └── ... (Other well fovs)
│   │   ├── segmentation_masks/
│   │   │   ├── C4-2/
│   │   │   │   ├── Organoid_mask.tif
│   │   │   │   ├── Nuclei_mask.tif
│   │   │   │   ├── Cell_mask.tif
│   │   │   │   └── Cytoplasm_derived.tif
│   │   │   └── ... (Other well fovs)
│   │   ├── extracted_features/
│   │   │   ├── C4-2/
│   │   │   │   ├── AreaSizeShape_Nuclei_DNA_CPU.parquet
│   │   │   │   ├── Intensity_Cell_488_GPU.parquet
│   │   │   │   ├── Texture_Cytoplasm_640_CPU.parquet
│   │   │   │   └── ... (125-189 Files)
│   │   │   └── ...
│   │   ├── image_based_profiles/
│   │   │   ├── 0.converted_profiles/
│   │   │   │   ├── C4-2/
│   │   │   │   │   ├── sc_related.parquet
│   │   │   │   │   └── organoid_related.parquet
│   │   │   ├── 1.combined_profiles/
│   │   │   │   ├── sc.parquet
│   │   │   │   └── organoid.parquet
│   │   │   ├── 2.annotated_profiles/
│   │   │   ├── 3.normalized_profiles/
│   │   │   ├── 4.feature_selected_profiles/
│   │   │   └── 5.aggregated_profiles/
│   │   └── qc_flags.parquet
│   ├── NF0016_T1/
│   │   └── ... (Same structure)
│   └── all_patient_profiles/
│       ├── sc_consensus.parquet
│       ├── organoid_consensus.parquet
│       ├── well_aggregated.parquet
│       └── patient_aggregated.parquet
├── models/
│   └── sam-med3d-turbo.pth
├── environments/
│   ├── GFF_preprocessing.yml
│   ├── GFF_segmentation.yml
│   └── ... (Conda environments)
└── ... (Code directories 0-6)

File naming conventions

Z-stack images:

Format: {channel}.tif where channel ∈ {405, 488, 555, 640}
Dimensions: (Z, Y, X)
Data type: uint16

Segmentation masks:

Format: {compartment}_mask.tif
Compartments: {organoid, nuclei, cell, cytoplasm}
Label encoding: integer object ids (0=background, 1-N=objects)

Feature files:

Format: {feature}_{compartment}_{channel}_{processor}_features.parquet
Example: Intensity_Nuclei_405_GPU_features.parquet

Profile files:

Format: parquet (compressed columnar storage)
Naming: {level}_{aggregation}.parquet
Example: sc_consensus.parquet

Channel information

The pipeline processes four fluorescent imaging channels: Note that while these are the channels we have used for our NF1 3D organoid dataset, the pipeline is designed to be flexible and adaptable to other channel configurations as needed. the channel information is stored in the metadata and used throughout the pipeline to ensure correct processing and feature extraction for each channel.

Name	fluorophore	ex(nm)	em(nm)	dichroic	target	organelle
405	Hoechst 33342	361	486	405	DNA	nucleus
488	Cona alexa fluor 488	495	519	488	ER	ER
555	WGA alexa fluor 555	555	580	555	membranes	golgi/plasma memb
640	Mitotracker deep red	644	665	640	mitochondria	mitochondria

Imaging parameters:

Objective: 60x/1.35 NA oil immersion
Oil RI: 1.518
Voxel size: 0.108 μm (XY) × 1 μm (Z)
Bit depth: 16-bit
Dynamic range: 0-65535

Computational specifications

Hardware

Local

CPU: 24 cores @ 2.5 ghz
RAM: 128 GB
Storage: 20 TB free space
GPU: NVIDIA geforce 3090Ti with 24 GB VRAM for acceleration

HPC (SLURM)

Nodes: 100s of CPU compute nodes
Partition: amilan (CPU), aa100 (GPU)
QOS: normal (24h), long (7 days)
Max concurrent jobs: 990 per user

Environment setup

We recommend using uv, mamba or conda to create the required environments. we have written a makefile to help with conda environment creation and management.

cd environments || exit
make --always-make
cd .. || exit

For uv users, you can also create the environments with:

source uv_setup.sh

Python utilities (monorepo layout)

The utilities under utils/src/ are now structured as installable packages. For local development, install them in editable mode:

cd utils
pip install -e .

Note that the utilites should be imported into compute environments. see the Environments module for installing the utils. there is a makefile in the Environments module that installs the environments with utils.

System requirements

Linux-based OS
HPC/SLURM environment recommended for large-scale runs
At least multiple tbs of storage for raw and processed images
Sufficient RAM and CPU/GPU resources depending on dataset size
- We recommend at least 128GB RAM and multiple CPU cores for image Processing steps
- Though we have been able to get RAM usage under 8GB per well_fov by distributing the compute.
  - Please note that this RAM usage is highly dependent on the number of z-slices, image dimensions, and number of channels.
  - Here we generally have 30-50 z-slices, ~1500x1500 pixel images, and 4 channels. we rarely exceed 100 z-slices. additionally, scaling in z-slices will require more compute time and RAM.
Optional: GPU resources for segmentation and deep learning based feature extraction
- We have found that a NVIDIA 3090 TI (24GB VRAM) is more than enough for our segmentation tasks.
- It is important to note that part of the advantage of using 2.5D segmentation is that it greatly reduces the GPU VRAM requirements compared to full 3D segmentation - especially as z-slice count scales up.

Storage requirements:

Raw images: 250-500 MB/well FOV
Z-stacks: 250-500 MB/well FOV
Masks: 250-500 MB/well FOV
Features: 5-10 MB/well FOV
Profiles: 1-5 MB/well FOV
Total: ~1-2 GB/well FOV

Number of fovs per well varies between 7-25 with typically 60 wells per patient. Per patient well fovs can range from 420 to 1500 depending on the experiment design.

Storage estimates (per patient):

Well fovs	storage (TB)
400	~0.4-0.8
500	~0.5-1.0
1000	~1.0-2.0
1500	~1.5-3.0

Data availability

The raw and processed imaging data are not quite publicly available at this time. We will have data available at some point on the NF data portal via Synapse.

Associated repositories

For more information on the NF1 organoid profiling project, please see the Following associated repositories:

NF1 3D organoid profiling Pipeline
- This repository (code and documentation for the pipeline)
NF1 2D organoid profiling Pipeline
- Pipeline for 2D organoid profiling
NF1 organoid profile Analysis
- Downstream analysis of the generated profiles

Citation

If you use this pipeline in your research, please cite it using the metadata in CITATION.cff.