oktoberfest.re.generate_features

oktoberfest.re.generate_features(library, search_type, output_file, additional_columns=None, all_features=False, xl=False, cms2=False, regression_method='spline', add_neutral_loss_features=False, remove_miss_cleavage_features=False, task='default', featured_ions=None)

Generate features to be used for percolator or mokapot target decoy separation.

The function calculates a range of metrics and features on the provided library for the chosen fdr estimation method, then writes the input tab file to the chosen output file.

Parameters:

library (Spectra) – the library to perform feature generation on
search_type (str) – One of “original” and “rescore”, which determines the generated features
output_file (str | Path) – the location to the generated tab file to be used for percolator / mokapot
additional_columns (str | list | None) – additional columns supplied in the search results to be used as features (either a list or “all”)
all_features (bool) – whether to use all features or only the standard set TODO
xl (bool) – crosslinked or linear peptide
cms2 (bool) – cleavable or non-cleavable crosslinker
regression_method (str) – The regression method to use for iRT alignment
add_neutral_loss_features (bool) – Flag to indicate whether to add neutral loss features to percolator or not
remove_miss_cleavage_features (bool) – Flag to indicate whether to remove miss cleavage features from percolator or not
task (str) – Flag to indicate whether to use multifrag features or not
featured_ions (Optional[list]) – The ion series to use for calculating percolator features

Example:

>>> from oktoberfest import rescore as re
>>> from oktoberfest import predict as pr
>>> from oktoberfest.data import Spectra, FragmentType
>>> import pandas as pd
>>> import numpy as np
>>> # Required columns: RAW_FILE, MODIFIED_SEQUENCE, SEQUENCE, CALCULATED_MASS, SCAN_NUMBER,
>>> # COLLISION_ENERGY, PRECURSOR_CHARGE, REVERSE and SCORE
>>> meta_df = pd.DataFrame({"RAW_FILE": ["File1","File1"],
>>>                         "MODIFIED_SEQUENCE": ["AAAC[UNIMOD:4]RFVQ","RM[UNIMOD:35]PC[UNIMOD:4]HKPYL"],
>>>                         "SEQUENCE": ["AAACRFVQ","RMPCHKPYL"],
>>>                         "CALCULATED_MASS": [1000,4000],
>>>                         "SCAN_NUMBER": [1,2],
>>>                         "COLLISION_ENERGY": [30,35],
>>>                         "PRECURSOR_CHARGE": [1,2],
>>>                         "FRAGMENTATION": ["HCD","HCD"],
>>>                         "REVERSE": [False,False],
>>>                         "SCORE": [0,0]})
>>> var = Spectra._gen_vars_df()
>>> library = Spectra(obs=meta_df, var=var)
>>> raw_intensities = np.random.rand(2,174)
>>> mzs = np.random.rand(2,174)*1000
>>> annotation = np.array([var.index,var.index])
>>> library.add_intensities(raw_intensities, annotation, FragmentType.RAW)
>>> library.add_mzs(mzs, FragmentType.MZ)
>>> library.strings_to_categoricals()
>>> intensity_predictor = pr.Predictor.from_koina(
>>>                         model_name="Prosit_2020_intensity_HCD",
>>>                         server_url="koina.wilhelmlab.org:443",
>>>                         ssl=True,
>>> intensity_predictor.predict_intensities(data=library)
>>> re.generate_features(library=library,
>>>                         search_type="original",
>>>                         regression_method="spline",
>>>                         output_file="./tests/doctests/output/original.tab")