For several reasons, the dark proteome (DP) is orders-of-magnitude more likely to be biologically valuable.
Firstly, the DP contains the unsearched post translational modifications (PTMs), such as
* phosphorylation,
* citrulination,
* methylation,
* acetylation,
* ubitiquination, and
* 100s of other already known PTMs (e.g., from UniProt)
As an example of this importance, many PTMs are critical as either markers of disease or as primary disease causing agents. For example, the recently approved biomarker for Alzheimer's refers to a very specific set of phosphorylations (pT181, PT217, pT231) on the tau protein. (But, how would a DIA-MS scientists know a priori which of the 100s of possible PTMs to search for, since it is often hard to search for even ~5 potential PTMs, let alone 100s?) Further, the above list still excludes the 100s of different forms of glycoforms, which is deeply unfortunate, as glycoforms have been known for decades to be disproportionally biologically valuable (e.g., >90% of FDA-approved drugs target glycoproteins). But, the MS is an ubiased instrument, i.e., the peptides are present in the MS regardless if it say a tau protein has pT181 vs pT217 or perhaps some odd acetylation (e.g., K174) or ubitiquination etc. So why discard all that MS data which is arguably far more valuable than unmodified protein sequences found in generic FASTA files?
Secondly, the DP contains those proteins that are not present in the protein library (FASTA) files. This situation can happen for two reasons: (a) the FASTA file is generated through a computational prediction of protein sequence using genomics data, but that prediction is far from exhaustive, and so many biologically critical proteins — even proteins now known to be a biomarker for key diseases if not the causal agent of disease — are known to be missing, such as microproteins. (b) As well, the vast majority of researchers use a generic FASTA file retrieved from a public library, such as UniProt, instead of one specific to the study conditions, and so proteins that may be specific to the specific disease, drug, and/or demographics are not present. (In theory, one could create a FASTA specific file for every patient in one's study, but that is a very expensive RNA-seq experiment, is complex to compute, and would still be non exhaustive since the precise relationship between RNA-seq and computationally predicted proteins is still not fully known.)
Thirdly, the DP contains even unexpected sequence variants (e.g., SNPs, which are possible but hard to predict conclusively or exhaustively) as well as proteins with unexpected proteolytic cleavages (which are not currently predictable). For example, the second FDA-approved set of biomarkers for Alzheimer's disease is the AB42 vs AB40 proteins. But these two proteins do not exist as unique proteins in FASTA files, because they are a a small (i.e., <= 42 amino acid) subset resulting form proteolytic cleavage of a >695 amino acid protein called APP. In other words, with current DIA-MS algorithms, even if the tail peptides of AB42 or AB40 were present in the MS, even at high intensities, the existing DIA-MS algorithms would simply never report these two critical proteins in any output results…which is a scientific and ultimately patient-health tragedy.
Lastly, as a coda to the above paragraphs, there was recently a lively discussion on BlueSky between leading scientists on the value of PTMs, and the summary by Dr. MacCoss is worth reprinting here: "We know that most disease markers are a specific proteoform. For example, we care a lot about measuring c-peptide, BNP, proteoforms of troponin I and T, amyloid-beta, pTau-217, hemoglobin-A1C, etc..."
For several reasons, the dark proteome (DP) is orders-of-magnitude more likely to be biologically valuable.
Firstly, the DP contains the unsearched post translational modifications (PTMs), such as
* phosphorylation,
* citrulination,
* methylation,
* acetylation,
* ubitiquination, and
* 100s of other already known PTMs (e.g., from UniProt)
As an example of this importance, many PTMs are critical as either markers of disease or as primary disease causing agents. For example, the recently approved biomarker for Alzheimer's refers to a very specific set of phosphorylations (pT181, PT217, pT231) on the tau protein. (But, how would a DIA-MS scientists know a priori which of the 100s of possible PTMs to search for, since it is often hard to search for even ~5 potential PTMs, let alone 100s?) Further, the above list still excludes the 100s of different forms of glycoforms, which is deeply unfortunate, as glycoforms have been known for decades to be disproportionally biologically valuable (e.g., >90% of FDA-approved drugs target glycoproteins). But, the MS is an ubiased instrument, i.e., the peptides are present in the MS regardless if it say a tau protein has pT181 vs pT217 or perhaps some odd acetylation (e.g., K174) or ubitiquination etc. So why discard all that MS data which is arguably far more valuable than unmodified protein sequences found in generic FASTA files?
Secondly, the DP contains those proteins that are not present in the protein library (FASTA) files. This situation can happen for two reasons: (a) the FASTA file is generated through a computational prediction of protein sequence using genomics data, but that prediction is far from exhaustive, and so many biologically critical proteins — even proteins now known to be a biomarker for key diseases if not the causal agent of disease — are known to be missing, such as microproteins. (b) As well, the vast majority of researchers use a generic FASTA file retrieved from a public library, such as UniProt, instead of one specific to the study conditions, and so proteins that may be specific to the specific disease, drug, and/or demographics are not present. (In theory, one could create a FASTA specific file for every patient in one's study, but that is a very expensive RNA-seq experiment, is complex to compute, and would still be non exhaustive since the precise relationship between RNA-seq and computationally predicted proteins is still not fully known.)
Thirdly, the DP contains even unexpected sequence variants (e.g., SNPs, which are possible but hard to predict conclusively or exhaustively) as well as proteins with unexpected proteolytic cleavages (which are not currently predictable). For example, the second FDA-approved set of biomarkers for Alzheimer's disease is the AB42 vs AB40 proteins. But these two proteins do not exist as unique proteins in FASTA files, because they are a a small (i.e., <= 42 amino acid) subset resulting form proteolytic cleavage of a >695 amino acid protein called APP. In other words, with current DIA-MS algorithms, even if the tail peptides of AB42 or AB40 were present in the MS, even at high intensities, the existing DIA-MS algorithms would simply never report these two critical proteins in any output results…which is a scientific and ultimately patient-health tragedy.
Lastly, as a coda to the above paragraphs, there was recently a lively discussion on BlueSky between leading scientists on the value of PTMs, and the summary by Dr. MacCoss is worth reprinting here: "We know that most disease markers are a specific proteoform. For example, we care a lot about measuring c-peptide, BNP, proteoforms of troponin I and T, amyloid-beta, pTau-217, hemoglobin-A1C, etc..."