COSIE.data_preprocessing.preprocess_adata

preprocess_adata(adata_raw, modality, hvg_num=3000, n_comps=50, target_sum=None)[source]

Preprocess an AnnData object based on the specified modality. The pipeline includes highly variable feature selection, normalization, log-transformation, scaling, and PCA.

This function supports preprocessing of epigenomic, RNA, protein, metabolite, and histology embedding (HE) modalities.

Parameters

adata_rawAnnData

The raw AnnData object to be processed.

modalitystr

The modality type. Must be one of:

  • ‘RNA’, ‘RNA_panel2’: RNA count matrix, supports different panels within the same RNA modality.

  • ‘H3K27me3’, ‘H3K27ac’, ‘ATAC’, ‘H3K4me3’: Epigenomic signals. We recommend first converting raw epigenomics data to gene scores before using this function. Gene score generation scripts are available at spatial-Mux-seq Repository.

  • ‘Protein’: Protein abundance matrix; CLR normalization will be applied. For COMET protein data, we recommend using arcsinh normalization.

  • ‘Metabolite’: Metabolite expression matrix.

  • ‘HE’: Histology image embeddings; PCA will be applied directly without normalization.

hvg_numint, optional

Number of highly variable features to select. If None, HVG selection is skipped. Default is 3000.

n_compsint, optional

Number of PCA components to compute. Default is 50.

target_sumfloat or None, optional

Target sum for total-count normalization (used in normalize_total). If None, the Scanpy default is used. Default is None.

Returns

adataAnnData

A preprocessed AnnData object with normalized, log-transformed, scaled, and PCA-reduced .X. For protein modality, CLR normalization is used. HVG selection is only applied if hvg_num is provided and the number of input features exceeds this threshold.