oktoberfest.pp.convert_search

oktoberfest.pp.convert_search(input_path, search_engine, tmt_label='', custom_mods=None, output_file=None, ptm_unimod_id=0, ptm_sites=None)

Convert search results to Oktoberfest format.

Given a path to a file or directory containing search results from supported search engines, the function parses, converts them to the internal format used by Oktoberfest and returns it as a dataframe. If a path to an output file is provided, the converted results are also stored to the specified location. The specification of the internal file format can be found at Custom search results.

Parameters:

input_path (Union[str, Path]) – Path to the directory or file containing the search results.
search_engine (str) – The search engine used to produce the search results, currently supported are “Maxquant”, “Mascot” and “MSFragger”
tmt_label (str) – Optional tmt-label to consider when processing peptides. If given, the corresponding fixed modification for the N-terminus and lysin will be added
custom_mods (Optional[dict[str, int]]) – Optional dictionary parameter given when input_file is not in internal Oktoberfest format with static and variable mods as keys. The values are the integer values of the respective unimod identifier
output_file (Union[str, Path, None]) – Optional path to the location where the converted search results should be written to. If this is omitted, the results are not stored.
ptm_unimod_id (Optional[int]) – unimod id used for site localization
ptm_sites (Optional[list]) – possible sites that the ptm can exist on

Raises:

ValueError – if an unsupported search engine was given

Return type:

DataFrame

Returns:

A dataframe containing the converted results.

Example:

>>> from oktoberfest import preprocessing as pp
>>> import pandas as pd
>>> msms = pd.DataFrame({'Raw file': ['GN20170722_SK_HLA_G0103_R1_01', 'GN20170722_SK_HLA_G0103_R2_02'],
>>> 'Scan number': [21329, 20501],
>>> 'Scan index': [18847, 17998],
>>> 'Sequence': ['AAAAVVSGPKRGRKKP', 'AAAAVVSGPKRGRKKP'],
>>> 'Length': [16, 16],
>>> 'Missed cleavages': ['', ''],
>>> 'Modifications': ['Unmodified', 'Unmodified'],
>>> 'Modified sequence': ['_AAAAVVSGPKRGRKKP_', '_AAAAVVSGPKRGRKKP_'],
>>> 'Oxidation (M) Probabilities': ['', ''],
>>> 'Oxidation (M) Score Diffs': ['', ''],
>>> 'Oxidation (M)': [0, 0],
>>> 'Proteins': ['', ''],
>>> 'Charge': [3, 3],
>>> 'Fragmentation': ['HCD', 'HCD'],
>>> 'Mass analyzer': ['FTMS', 'FTMS'],
>>> 'Type': ['MULTI-SECPEP', 'MULTI-SECPEP'],
>>> 'Scan event number': [9, 5],
>>> 'Isotope index': [2, 2],
>>> 'm/z': [531.66176, 531.66176],
>>> 'Mass': [1591.9634, 1591.9634],
>>> 'Mass Error [ppm]': [-2.1109999999999998, -1.1018],
>>> 'Simple Mass Error [ppm]': [1259.2803, 1259.2803],
>>> 'Retention time': [46.272, 46.388000000000005],
>>> 'PEP': [0.57389, 0.57389],
>>> 'Score': [7.9138, 4.7582],
>>> 'Delta score': [3.5652, 1.4401],
>>> 'Score diff': ['', ''],
>>> 'Localization prob': [1, 1],
>>> 'Combinatorics': [0, 0],
>>> 'PIF': [0, 0],
>>> 'Fraction of total spectrum': [0, 0],
>>> 'Base peak fraction': [0, 0],
>>> 'Precursor Full ScanNumber': [-1, -1],
>>> 'Precursor Intensity': [0, 0],
>>> 'Precursor Apex Fraction': [0, 0],
>>> 'Precursor Apex Offset': [0, 0],
>>> 'Precursor Apex Offset Time': [0, 0],
>>> 'Matches Intensities': ['y5;y10;y5-NH3;a2;b12(2+)', 'y5;y5-NH3;b12(2+)'],
>>> 'Mass Deviations [Da]': ['34666.4;2191.7;88570.6;2148.7;89073.6', '10544.1;36224.8;73327.7'],
>>> 'Mass Deviations [ppm]': ['0.008335659;-0.01799215;-0.002397317;-0.0004952438;-0.004926575',
>>>                             '0.009286639;-0.004650567;-0.002822918'],
>>> 'Masses': ['14.23987;-16.19888;-4.217963;-4.303209;-9.237617', '15.86446;-8.182414;-5.293158'],
>>> 'Number of Matches': [5, 3],
>>> 'Intensity coverage': [0.1016966, 0.1564349],
>>> 'Peak coverage': [0.04166667, 0.04477612],
>>> 'Neutral loss level': ['None', 'None'],
>>> 'ETD identification type': ['Unknown', 'Unknown'],
>>> 'Reverse': ['Unknown +', 'Unknown +'],
>>> 'All scores': ['7.913836;4.348669;4.097387', '4.758178;3.318045;2.968256'],
>>> 'All sequences': ['AAAAVVSGPKRGRKKP;GVVAKGALTPKLSPVVG;GVVPSLKPTLAGKAVVG',
>>>                     'AAAAVVSGPKRGRKKP;VMKLLRHDKLVQL;QEILRKILPLGELA'],
>>> 'All modified sequences': ['_AAAAVVSGPKRGRKKP_;_GVVAKGALTPKLSPVVG_;_GVVPSLKPTLAGKAVVG_',
>>>                             '_AAAAVVSGPKRGRKKP_;_VMKLLRHDKLVQL_;_QEILRKILPLGELA_'],
>>> 'id': [1378, 1379],
>>> 'Protein group IDs': ['42625', '42625'],
>>> 'Peptide ID': [533, 533],
>>> 'Mod. peptide ID': [537, 537],
>>> 'Evidence ID': [1075, 1076],
>>> 'Oxidation (M) site IDs': ['', '']})
>>> msms.to_csv("./tests/doctests/input/msms.txt",sep='\t',index=False)
>>> converted_results = pp.convert_search(input_path="./tests/doctests/input/", search_engine="maxquant")
>>> print(converted_results)