COSIE.data_preprocessing.load_data

load_data(data_dict, n_comps=50, hvg_num=3000, target_sum=None, use_harmony=True, metacell=False)[source]

Process input spatial multi-modal data, returning processed feature matrices and spatial coordinates.

Shared modalities that appear in multiple sections are concatenated and jointly processed. Unique modalities (only present in one section) are processed independently. Each section’s feature matrix is stored as a PyTorch tensor for downstream modeling. Spatial coordinates are checked for consistency across modalities; if inconsistent within a section, an error is raised.

Parameters

data_dictdict

A dictionary mapping each modality name (e.g., ‘RNA’, ‘Protein’) to a list of AnnData objects, one per tissue section. Each AnnData should contain .X, .obs, .var, and .obsm[‘spatial’]. If a modality is missing from a section, use None as a placeholder in the list.

n_compsint, optional

Number of PCA components to compute. Default is 50.

hvg_numint, optional

Number of highly variable features to select. If the feature dimension is smaller than hvg_num, HVG selection is skipped. Default is 3000.

target_sumfloat or None, optional

Target sum for total-count normalization (used in scanpy.pp.normalize_total). If None, Scanpy default is used. Default is None.

use_harmonybool, optional

Whether to perform Harmony integration across sections for shared modalities. If False, only joint PCA is applied. Default is True.

metacellbool, optional

Whether to merge each 2×2 spatial grid of cells into a “metacell” for reducing memory usage and improving speed. Applies to all modalities. Default is False.

Returns

feature_dictdict

A dictionary mapping each section name (e.g., ‘s1’, ‘s2’) to a sub-dictionary of processed feature tensors for each modality. Each feature is a torch.FloatTensor of shape (n_cells, n_comps).

spatial_loc_dictdict

A dictionary mapping each section name to a 2D NumPy array of spatial coordinates, extracted from .obsm[‘spatial’]. Shape is (n_cells, 2).

data_dictdict

The updated input dictionary. Each AnnData object is modified to include reduced features (e.g., PCA or Harmony output) in .obsm.