COSIE.downstream_analysis.perform_prediction
- perform_prediction(data_dict, final_embeddings, target_section, target_modality, K_num=50, source_sections=None, target_molecules='All', block_size=None, metric='euclidean', accelerate=False, n_trees=10)[source]
Perform KNN-based prediction for a specific modality in a target tissue section.
The predicted values are computed by identifying the K nearest neighbors in the embedding space from source sections and averaging their expression values for the specified molecules. If accelerate = True, PCA will be conducted and preeiction will be performed based on the PCA embedding instead.
Parameters
- data_dictdict
A dictionary where each key is a modality (e.g., ‘RNA’, ‘Protein’) and each value is a list of AnnData objects, one per tissue section. If a modality is missing in a section, use None as a placeholder.
- final_embeddingsdict
A dictionary mapping section names (e.g., ‘s1’, ‘s2’, …) to 2D NumPy arrays of shape (n_cells, latent_dim), representing cell embeddings for each section.
- target_sectionstr
The name of the section to predict (e.g., ‘s1’).
- target_modalitystr
The modality to predict (e.g., ‘RNA’, ‘Protein’, ‘Metabolite’).
- K_numint, optional
Number of nearest neighbors used for prediction. Default is 50.
- source_sectionslist of str or None, optional
A list of section names to serve as the source data. If None, all sections with the target modality will be used. Default is None.
- target_moleculesstr or list, optional
Features (genes, proteins, metabolites, etc.) to predict:
‘All’: Predict the intersection of all shared features across source sections.
list: A specific list of features to predict (e.g., [‘CD4’, ‘CD68’]).
Default is ‘All’.
- block_sizeint or None, optional
If set, perform block-wise prediction across features to reduce memory usage. Each block contains up to block_size features. Default is None.
- metricstr, optional
Distance metric used in approximate nearest neighbor search (via Annoy). Must be one of: {‘euclidean’, ‘manhattan’, ‘angular’, ‘hamming’, ‘dot’}. Default is ‘euclidean’.
- acceleratebool, optional
Whether to perform joint PCA dimensionality reduction (to 50 dimensions) on all embeddings before prediction. This can accelerate nearest neighbor search and reduce memory usage. Default is False.
- n_treesint, optional
Number of trees used to build the Annoy index. Larger values increase accuracy at the cost of indexing time. Default is 10.
Returns
- new_adataAnnData
A new AnnData object containing the predicted data. Includes:
.X: Predicted expression matrix as a dense NumPy array.
.obs: Metadata copied from the target section’s reference AnnData.
.var: Feature names associated with the predicted modality.
.obsm[‘spatial’]: Copied spatial coordinates (if present).