oktoberfest.pp.list_spectra

oktoberfest.pp.list_spectra(input_dir, input_format)

Return a list of all spectra files of a given format.

Given an input directory, the function searches all files containing spectra and returns a list of paths pointing to the files. Files are included if the extension matches the provided format (case-insensitive). In case the input directory is a file, the function will check if it matches the format and return it wrapped in a list. If the format is “d” and the input directory ends with “.d”, the function will return the input directory wrapped in a list.

Parameters:
  • input_dir (Union[str, Path]) – Path to the directory to scan for spectra files

  • input_format (str) – Format of the input for the provided directory. This must match the file extension (mzml, raw, hdf) or directory extension (d). Matching is case-insensitive.

Raises:
  • NotADirectoryError – if the specified input directory does not exist

  • ValueError – if the specified file format is not supported

  • AssertionError – if the provided input directory (d) does not match the provided format or if none of the files within the provided input directory (mzml, raw, hdf) match the provided format

Return type:

list[Path]

Returns:

A list of paths to all spectra files found in the given directory

Example:

>>> from oktoberfest import preprocessing as pp
>>> import os
>>> # creating minimum viable example .mzml file
>>> filecontent = '''<?xml version="1.0" encoding="UTF-8"?>
>>> <mzML xmlns="http://example" version="1.1.0">
>>>   <cvList count="2">
>>>     <cv id="MS" fullName="Mass Spectrometry Ontology" version="4.1.0" URI="https://example"/>
>>>     <cv id="UO" fullName="Unit Ontology" version="1.23" URI="http://example"/>
>>>   </cvList>
>>>   <fileDescription>
>>>     <fileContent>
>>>       <cvParam cvRef="MS" accession="MS:1000579" name="MS1 spectrum"/>
>>>     </fileContent>
>>>   </fileDescription>
>>>   <referenceableParamGroupList count="1">
>>>     <referenceableParamGroup id="commonInstrumentParams">
>>>       <cvParam cvRef="MS" accession="MS:1000031" name="instrument model" value="Example Instrument"/>
>>>     </referenceableParamGroup>
>>>   </referenceableParamGroupList>
>>>   <run id="run1" defaultInstrumentConfigurationRef="IC1">
>>>     <spectrumList count="1">
>>>       <spectrum index="0" id="scan=1" defaultArrayLength="5">
>>>         <cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1"/>
>>>         <binaryDataArrayList count="2">
>>>           <binaryDataArray encodedLength="20">
>>>             <cvParam cvRef="MS" accession="4" name="m/z" unitCvRef="MS" unitAccession="0" unitName="m/z"/>
>>>             <binary>...</binary>
>>>           </binaryDataArray>
>>>           <binaryDataArray encodedLength="20">
>>>             <cvParam cvRef="MS" accession="5" name="i" unitCvRef="MS" unitAccession="1" unitName="c"/>
>>>             <binary>...</binary>
>>>           </binaryDataArray>
>>>         </binaryDataArrayList>
>>>       </spectrum>
>>>     </spectrumList>
>>>   </run>
>>> </mzML>'''
>>> os.makedirs("./tests/doctests/input/spectra", exist_ok=True)
>>> with open("./tests/doctests/input/spectra/File1.mzml","w+") as f:
>>>     f.writelines(filecontent)
>>> paths = pp.list_spectra(input_dir="./tests/doctests/input/spectra/", input_format="mzml")
>>> print(paths)