COSIE.downstream_analysis.perform_prediction

perform_prediction(data_dict, final_embeddings, target_section, target_modality, K_num=50, source_sections=None, target_molecules='All', block_size=None, metric='euclidean', accelerate=False, n_trees=10)[source]

Perform KNN-based prediction for a specific modality in a target tissue section.

The predicted values are computed by identifying the K nearest neighbors in the embedding space from source sections and averaging their expression values for the specified molecules. If accelerate = True, PCA will be conducted and preeiction will be performed based on the PCA embedding instead.

Parameters

data_dictdict

A dictionary where each key is a modality (e.g., ‘RNA’, ‘Protein’) and each value is a list of AnnData objects, one per tissue section. If a modality is missing in a section, use None as a placeholder.

final_embeddingsdict

A dictionary mapping section names (e.g., ‘s1’, ‘s2’, …) to 2D NumPy arrays of shape (n_cells, latent_dim), representing cell embeddings for each section.

target_sectionstr

The name of the section to predict (e.g., ‘s1’).

target_modalitystr

The modality to predict (e.g., ‘RNA’, ‘Protein’, ‘Metabolite’).

K_numint, optional

Number of nearest neighbors used for prediction. Default is 50.

source_sectionslist of str or None, optional

A list of section names to serve as the source data. If None, all sections with the target modality will be used. Default is None.

target_moleculesstr or list, optional

Features (genes, proteins, metabolites, etc.) to predict:

  • ‘All’: Predict the intersection of all shared features across source sections.

  • list: A specific list of features to predict (e.g., [‘CD4’, ‘CD68’]).

Default is ‘All’.

block_sizeint or None, optional

If set, perform block-wise prediction across features to reduce memory usage. Each block contains up to block_size features. Default is None.

metricstr, optional

Distance metric used in approximate nearest neighbor search (via Annoy). Must be one of: {‘euclidean’, ‘manhattan’, ‘angular’, ‘hamming’, ‘dot’}. Default is ‘euclidean’.

acceleratebool, optional

Whether to perform joint PCA dimensionality reduction (to 50 dimensions) on all embeddings before prediction. This can accelerate nearest neighbor search and reduce memory usage. Default is False.

n_treesint, optional

Number of trees used to build the Annoy index. Larger values increase accuracy at the cost of indexing time. Default is 10.

Returns

new_adataAnnData

A new AnnData object containing the predicted data. Includes:

  • .X: Predicted expression matrix as a dense NumPy array.

  • .obs: Metadata copied from the target section’s reference AnnData.

  • .var: Feature names associated with the predicted modality.

  • .obsm[‘spatial’]: Copied spatial coordinates (if present).