oktoberfest.pp.split_search
- oktoberfest.pp.split_search(search_results, output_dir, filenames=None)
Split search results by spectrum file.
Given a list of spectrum file names from which search results originate the provided search results are split and filename specific csv files are written to the provided output directory. The provided file names need to correspond to the spectrum file identifier in the “RAW_FILE” column of the provided search results. The search results need to be provided in internal format (see Custom search results). If the list of file names is not provided, all spectrum file identifiers are considered, otherwise only the identifiers found in the list are taken into account for writing the individual csv files. The output file names follow the convention <filename>.rescore. If a file name is not found in the search results, it is ignored and a warning is printed. The function returns a list of file names for which search results are available, removing the ones that were ignored if a list of file names was provided.
- Parameters:
search_results (
DataFrame) – search results in internal formatoutput_dir (
Union[str,Path]) – directory in which to store individual csv files containing the search results for individual filenamesfilenames (
Optional[list[str]]) – optional list of spectrum filenames that should be considered. If not provided, all spectrum file identifiers in the search results are considered.
- Return type:
- Returns:
list of file names for which search results could be found
- Example:
>>> from oktoberfest import preprocessing as pp >>> import pandas as pd >>> search_results = pd.DataFrame({"RAW_FILE": ["File1","File2"], >>> "SCAN_NUMBER": [5123,4012], >>> "MODIFIED_SEQUENCE": ["AAAC[UNIMOD:4]RFVQ","RM[UNIMOD:35]PC[UNIMOD:4]HKPYL"], >>> "PRECURSOR_CHARGE": [1,2], >>> "SCAN_EVENT_NUMBER": [4,10], >>> "MASS": [1000.41,1589.1], >>> "SCORE": [3.64,5.45], >>> "REVERSE": [False,False], >>> "SEQUENCE": ["AAACRFVQ","RMPCHKPYL"], >>> "PEPTIDE_LENGTH": [8,9]}) >>> pp.split_search(search_results=search_results, output_dir="./tests/doctests/output/")