COSIE.data_preprocessing.preprocess_adata
- preprocess_adata(adata_raw, modality, hvg_num=3000, n_comps=50, target_sum=None)[source]
Preprocess an AnnData object based on the specified modality. The pipeline includes highly variable feature selection, normalization, log-transformation, scaling, and PCA.
This function supports preprocessing of epigenomic, RNA, protein, metabolite, and histology embedding (HE) modalities.
Parameters
- adata_rawAnnData
The raw AnnData object to be processed.
- modalitystr
The modality type. Must be one of:
‘RNA’, ‘RNA_panel2’: RNA count matrix, supports different panels within the same RNA modality.
‘H3K27me3’, ‘H3K27ac’, ‘ATAC’, ‘H3K4me3’: Epigenomic signals. We recommend first converting raw epigenomics data to gene scores before using this function. Gene score generation scripts are available at spatial-Mux-seq Repository.
‘Protein’: Protein abundance matrix; CLR normalization will be applied. For COMET protein data, we recommend using arcsinh normalization.
‘Metabolite’: Metabolite expression matrix.
‘HE’: Histology image embeddings; PCA will be applied directly without normalization.
- hvg_numint, optional
Number of highly variable features to select. If None, HVG selection is skipped. Default is 3000.
- n_compsint, optional
Number of PCA components to compute. Default is 50.
- target_sumfloat or None, optional
Target sum for total-count normalization (used in normalize_total). If None, the Scanpy default is used. Default is None.
Returns
- adataAnnData
A preprocessed AnnData object with normalized, log-transformed, scaled, and PCA-reduced .X. For protein modality, CLR normalization is used. HVG selection is only applied if hvg_num is provided and the number of input features exceeds this threshold.