oktoberfest.pp.digest

oktoberfest.pp.digest(fasta, digestion, missed_cleavages, db, enzyme, special_aas, min_length, max_length)

Digest a given fasta file with specific settings.

This function performs an in-silico digestion of a fasta file based on the provided settings. It returns a dictionary that maps peptides to the list of associated protein IDs.

Parameters:
  • fasta (Union[str, Path]) – Path to fasta file containing sequences to digest

  • digestion (str) – The type of digestion, one of “full, “semi”, “none”

  • missed_cleavages (int) – The number of allowed miscleaveages

  • db (str) – The desired database to produce, can be target, decoy, or both

  • enzyme (str) – The protease to use for digestion TODO list available proteases

  • special_aas (str) – List of aas to be swapped with preceding aa in reverse sequences. This mimics the behaviour of MaxQuant when creating decoys.

  • min_length (int) – Minimal length of digested peptides

  • max_length (int) – Maximal length of digested peptides

Return type:

dict[str, list[str]]

Returns:

A Dictionary that maps peptides (keys) to a list of protein IDs (values).

Example:

>>> from oktoberfest import preprocessing as pp
>>> peptides = [
>>>     (">Peptide1 Example peptide 1", "MKTIIALSYIFCLVFAD"),
>>>     (">Peptide2 Example peptide 2", "GILGFVFRTLTVPS"),
>>>     (">Peptide3 Example peptide 3", "LLGATCMFV")
>>> ]
>>> with open("./tests/doctests/input/peptides.fasta", "w") as file:
>>>     for header, sequence in peptides:
>>>         file.write(f"{header}\n")
>>>         file.write(f"{sequence}\n")
>>> digest_dict = pp.digest(fasta="./tests/doctests/input/peptides.fasta",
>>>                         digestion="full",
>>>                         missed_cleavages=2,
>>>                         db="concat",
>>>                         enzyme="trypsin",
>>>                         special_aas="KR",
>>>                         min_length=7,
>>>                         max_length=60)
>>> print(digest_dict)