Core functions

pycytominer.aggregate module

Aggregate profiles based on given grouping variables.

pycytominer.aggregate.aggregate(population_df: DataFrame, strata: list[str] = ['Metadata_Plate', 'Metadata_Well'], features: list[str] | str = 'infer', operation: str = 'median', output_file: str | None = None, output_type: str | None = 'csv', compute_object_count: bool = False, object_feature: str = 'Metadata_ObjectNumber', subset_data_df: DataFrame | None = None, compression_options: str | dict[str, Any] | None = None, float_format: str | None = None) DataFrame | str | None

Combine population dataframe variables by strata groups using given operation.

Parameters:
  • population_df (pandas.core.frame.DataFrame) – DataFrame to group and aggregate.

  • strata (list of str, default ["Metadata_Plate", "Metadata_Well"]) – Columns to groupby and aggregate.

  • features (list of str, default "infer") – List of features that should be aggregated.

  • operation (str, default "median") – How the data is aggregated. Currently only supports one of [‘mean’, ‘median’].

  • output_file (str or file handle, optional) – If provided, will write aggregated profiles to file. If not specified, will return the aggregated profiles. We recommend naming the file based on the plate name.

  • output_type (str, optional) – If provided, will write aggregated profiles as a specified file type (either CSV or parquet). If not specified and output_file is provided, then the file will be outputed as CSV as default.

  • compute_object_count (bool, default False) – Whether or not to compute object counts.

  • object_feature (str, default "Metadata_ObjectNumber") – Object number feature. Only used if compute_object_count=True.

  • subset_data_df (pandas.core.frame.DataFrame) – How to subset the input.

  • compression_options (str or dict, optional) – Contains compression options as input to pd.DataFrame.to_csv(compression=compression_options). pandas version >= 1.2.

  • float_format (str, optional) – Decimal precision to use in writing output file as input to pd.DataFrame.to_csv(float_format=float_format). For example, use “%.3g” for 3 decimal precision.

Returns:

population_df – DataFrame of aggregated features. If output_file=None, then return the DataFrame. If you specify output_file, then write to file and do not return data.

Return type:

pandas.core.frame.DataFrame, optional

pycytominer.annotate module

Annotates profiles with metadata information

pycytominer.annotate.annotate(profiles, platemap, join_on=['Metadata_well_position', 'Metadata_Well'], output_file=None, output_type='csv', add_metadata_id_to_platemap=True, format_broad_cmap=False, clean_cellprofiler=True, external_metadata=None, external_join_left=None, external_join_right=None, compression_options=None, float_format=None, cmap_args={}, **kwargs)

Add metadata to aggregated profiles.

Parameters:
  • profiles (pandas.core.frame.DataFrame or file) – DataFrame or file path of profiles.

  • platemap (pandas.core.frame.DataFrame or file) – Dataframe or file path of platemap metadata.

  • join_on (list or str, default: ["Metadata_well_position", "Metadata_Well"]) – Which variables to merge profiles and plate. The first element indicates variable(s) in platemap and the second element indicates variable(s) in profiles to merge using. Note the setting of add_metadata_id_to_platemap

  • output_file (str, optional) – If not specified, will return the annotated profiles. We recommend that this output file be suffixed with “_augmented.csv”.

  • output_type (str, optional) – If provided, will write annotated profiles as a specified file type (either CSV or parquet). If not specified and output_file is provided, then the file will be outputed as CSV as default.

  • add_metadata_id_to_platemap (bool, default True) – Whether the plate map variables possibly need “Metadata” pre-pended

  • format_broad_cmap (bool, default False) – Whether we need to add columns to make compatible with Broad CMAP naming conventions.

  • clean_cellprofiler (bool, default True) – Clean specific CellProfiler feature names.

  • external_metadata (str, optional) – File with additional metadata information

  • external_join_left (str, optional) – Merge column in the profile metadata.

  • external_join_right (str, optional) – Merge column in the external metadata.

  • compression_options (str or dict, optional) – Contains compression options as input to pd.DataFrame.to_csv(compression=compression_options). pandas version >= 1.2.

  • float_format (str, optional) – Decimal precision to use in writing output file as input to pd.DataFrame.to_csv(float_format=float_format). For example, use “%.3g” for 3 decimal precision.

  • cmap_args (dict, default {}) – Potential keyword arguments for annotate_cmap(). See cyto_utils/annotate_custom.py for more details.

Returns:

annotated – DataFrame of annotated features. If output_file=None, then return the DataFrame. If you specify output_file, then write to file and do not return data.

Return type:

pandas.core.frame.DataFrame, optional

pycytominer.normalize module

Normalize observation features based on specified normalization method

pycytominer.normalize.normalize(profiles, features='infer', image_features=False, meta_features='infer', samples='all', method='standardize', output_file=None, output_type='csv', compression_options=None, float_format=None, mad_robustize_epsilon=1e-18, spherize_center=True, spherize_method='ZCA-cor', spherize_epsilon=1e-06)

Normalize profiling features

Parameters:
  • profiles (pandas.core.frame.DataFrame or path) – Either a pandas DataFrame or a file that stores profile data

  • features (list) – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume features are from CellProfiler output and prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.

  • image_features (bool, default False) – Whether the profiles contain image features.

  • meta_features (list) – A list of strings corresponding to metadata column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume CellProfiler metadata features, identified by column names that begin with the Metadata_ prefix.”

  • samples (str) – The metadata column values to use as a normalization reference. We often use control samples. The function uses a pd.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). Defaults to “all”.

  • method (str) – How to normalize the dataframe. Defaults to “standardize”. Check avail_methods for available normalization methods.

  • output_file (str, optional) – If provided, will write normalized profiles to file. If not specified, will return the normalized profiles as output. We recommend that this output file be suffixed with “_normalized.csv”.

  • output_type (str, optional) – If provided, will write normalized profiles as a specified file type (either CSV or parquet). If not specified and output_file is provided, then the file will be outputed as CSV as default.

  • compression_options (str or dict, optional) – Contains compression options as input to pd.DataFrame.to_csv(compression=compression_options). pandas version >= 1.2.

  • float_format (str, optional) – Decimal precision to use in writing output file as input to pd.DataFrame.to_csv(float_format=float_format). For example, use “%.3g” for 3 decimal precision.

  • mad_robustize_epsilon (float, optional) – The mad_robustize fudge factor parameter. The function only uses this variable if method = “mad_robustize”. Set this to 0 if mad_robustize generates features with large values.

  • spherize_center (bool) – If the function should center data before sphering (aka whitening). The function only uses this variable if method = “spherize”. Defaults to True.

  • spherize_method (str) – The sphering (aka whitening) normalization selection. The function only uses this variable if method = “spherize”. Defaults to “ZCA-corr”. See pycytominer.operations.transform() for available spherize methods.

  • spherize_epsilon (float, default 1e-6.) – The sphering (aka whitening) fudge factor parameter. The function only uses this variable if method = “spherize”.

Returns:

normalized – The normalized profile DataFrame. If output_file=None, then return the DataFrame. If you specify output_file, then write to file and do not return data.

Return type:

pandas.core.frame.DataFrame, optional

Examples

import pandas as pd from pycytominer import normalize

data_df = pd.DataFrame(
{

“Metadata_plate”: [“a”, “a”, “a”, “a”, “b”, “b”, “b”, “b”], “Metadata_treatment”: [

“drug”, “drug”, “control”, “control”, “drug”, “drug”, “control”, “control”,

], “x”: [1, 2, 8, 2, 5, 5, 5, 1], “y”: [3, 1, 7, 4, 5, 9, 6, 1], “z”: [1, 8, 2, 5, 6, 22, 2, 2], “zz”: [14, 46, 1, 6, 30, 100, 2, 2],

}

).reset_index(drop=True)

normalized_df = normalize(

profiles=data_df, features=[“x”, “y”, “z”, “zz”], meta_features=[“Metadata_plate”, “Metadata_treatment”], samples=”Metadata_treatment == ‘control’”, method=”standardize”

)

pycytominer.feature_select module

Select features to use in downstream analysis based on specified selection method

pycytominer.feature_select.feature_select(profiles, features='infer', image_features=False, samples='all', operation='variance_threshold', output_file=None, output_type='csv', na_cutoff=0.05, corr_threshold=0.9, corr_method='pearson', freq_cut=0.05, unique_cut=0.01, compression_options=None, float_format=None, blocklist_file=None, outlier_cutoff=500, noise_removal_perturb_groups=None, noise_removal_stdev_cutoff=None)

Performs feature selection based on the given operation.

Parameters:
  • profiles (pandas.core.frame.DataFrame or file) – DataFrame or file of profiles.

  • features (list, default "infer") – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.

  • image_features (bool, default False) – Whether the profiles contain image features.

  • samples (list or str, default "all") – Samples to provide operation on.

  • operation (list of str or str, default "variance_threshold) – Operations to perform on the input profiles.

  • output_file (str, optional) – If provided, will write feature selected profiles to file. If not specified, will return the feature selected profiles as output. We recommend that this output file be suffixed with “_normalized_variable_selected.csv”.

  • output_type (str, optional) – If provided, will write feature selected profiles as a specified file type (either CSV or parquet). If not specified and output_file is provided, then the file will be outputed as CSV as default.

  • na_cutoff (float, default 0.05) – Proportion of missing values in a column to tolerate before removing.

  • corr_threshold (float, default 0.9) – Value between (0, 1) to exclude features above if any two features are correlated above this threshold.

  • corr_method (str, default "pearson") – Correlation type to compute. Allowed methods are “spearman”, “kendall” and “pearson”.

  • freq_cut (float, default 0.05) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])

  • unique_cut (float, default 0.01) – Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples.

  • compression_options (str or dict, optional) – Contains compression options as input to pd.DataFrame.to_csv(compression=compression_options). pandas version >= 1.2.

  • float_format (str, optional) – Decimal precision to use in writing output file as input to pd.DataFrame.to_csv(float_format=float_format). For example, use “%.3g” for 3 decimal precision.

  • blocklist_file (str, optional) – File location of datafrmame with with features to exclude. Note that if “blocklist” in operation then will remove standard blocklist

  • outlier_cutoff (float, default 500) – The threshold at which the maximum or minimum value of a feature across a full experiment is excluded. Note that this procedure is typically applied after normalization.

  • noise_removal_perturb_groups (str or list of str, optional) – Perturbation groups corresponding to rows in profiles or the the name of the metadata column containing this information.

  • noise_removal_stdev_cutoff (float,optional) – Maximum mean feature standard deviation to be kept for noise removal, grouped by the identity of the perturbation from perturb_list. The data must already be normalized so that this cutoff can apply to all columns.

Returns:

selected_df – The feature selected profile DataFrame. If output_file=None, then return the DataFrame. If you specify output_file, then write to file and do not return data.

Return type:

pandas.core.frame.DataFrame, optional

pycytominer.consensus module

Acquire consensus signatures for input samples

pycytominer.consensus.consensus(profiles, replicate_columns=['Metadata_Plate', 'Metadata_Well'], operation='median', features='infer', output_file=None, output_type='csv', compression_options=None, float_format=None, modz_args={'method': 'spearman'})

Form level 5 consensus profile data.

Parameters:
  • profiles (pandas.core.frame.DataFrame or file) – DataFrame or file of profiles.

  • replicate_columns (list, defaults to ["Metadata_Plate", "Metadata_Well"]) – Metadata columns indicating which replicates to collapse

  • operation (str, defaults to "median") – The method used to form consensus profiles.

  • features (list) – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume features are from CellProfiler output and prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.

  • output_file (str, optional) – If provided, will write consensus profiles to file. If not specified, will return the normalized profiles as output.

  • output_type (str, optional) – If provided, will write consensus profiles as a specified file type (either CSV or parquet). If not specified and output_file is provided, then the file will be outputed as CSV as default.

  • compression_options (str or dict, optional) – Contains compression options as input to pd.DataFrame.to_csv(compression=compression_options). pandas version >= 1.2.

  • float_format (str, optional) – Decimal precision to use in writing output file as input to pd.DataFrame.to_csv(float_format=float_format). For example, use “%.3g” for 3 decimal precision.

  • modz_args (dict, optional) – Additional custom arguments passed as kwargs if operation=”modz”. See pycytominer.cyto_utils.modz for more details.

Returns:

consensus_df – The consensus profile DataFrame. If output_file=None, then return the DataFrame. If you specify output_file, then write to file and do not return data.

Return type:

pandas.core.frame.DataFrame, optional

Examples

import pandas as pd from pycytominer import consensus

data_df = pd.concat(
[
pd.DataFrame(
{

“Metadata_Plate”: “X”, “Metadata_Well”: “a”, “Cells_x”: [0.1, 0.3, 0.8], “Nuclei_y”: [0.5, 0.3, 0.1],

}

), pd.DataFrame(

{

“Metadata_Plate”: “X”, “Metadata_Well”: “b”, “Cells_x”: [0.4, 0.2, -0.5], “Nuclei_y”: [-0.8, 1.2, -0.5],

}

),

]

).reset_index(drop=True)

consensus_df = consensus(

profiles=data_df, replicate_columns=[“Metadata_Plate”, “Metadata_Well”], operation=”median”, features=”infer”, output_file=None,

)

Helper functions

Module contents