Enumerating the Value of the Dark Proteome

Enumerating the Value of
the Dark Proteome

For at least three reasons, the dark proteome (DP) is biologically valuable. One can read the enumerated the reasons below; alternatively, one can simply scroll down to the last sentence of the last paragraph on this page (highlighted in orange) for a pithy summary by Dr. MacCoss.

Firstly, the DP contains the unsearched post translational modifications (PTMs), such as

* phosphorylation,
* citrulination,
* methylation,
* acetylation,
* ubitiquination, and
* 100s of other already known PTMs (e.g., from UniProt)

As an example of this importance, many PTMs are critical as either markers of disease or as primary disease causing agents. For example, the recently FDA approved biomarker for Alzheimer's refers to a very specific set of phosphorylations (pT181, PT217, pT231) on the tau protein. Further, the above list still excludes the 100s of different forms of glycoforms, which is deeply unfortunate, as glycoforms have been known for decades to be disproportionally biologically valuable (e.g., >90% of FDA-approved drugs target glycoproteins).

Secondly, the DP contains those proteins that are not present in the protein library (FASTA) files. This situation can happen for two reasons: (a) the FASTA file is generated through a computational prediction of protein sequence using genomics data, but that prediction is far from exhaustive, and so many biologically critical proteins — even proteins now known to be a biomarker for key diseases if not the causal agent of disease — are known to be missing, such as microproteins. (b) As well, the vast majority of researchers use a generic FASTA file retrieved from a public library, such as UniProt, instead of one specific to the study conditions, and so proteins that may be specific to the studied disease, drug, and/or demographics are not present. In theory, one could create a FASTA specific file for every patient in one's study, but that is a very expensive RNA-seq experiment, is complex to compute, and would still be non exhaustive since the precise relationship between RNA-seq and computationally predicted proteins is still not fully known.

Thirdly, the DP contains even unexpected sequence variants (e.g., SNPs, which are possible but hard to predict conclusively or exhaustively) as well as proteins with unexpected proteolytic cleavages, which are not currently predictable. For example, the second FDA-approved set of biomarkers for Alzheimer's disease is the AB42 vs AB40 proteins. But these two proteins do not exist as unique proteins in FASTA files because they are a a small (i.e., <= 42 amino acid) subset resulting form proteolytic cleavage of a >695 amino acid protein called APP. In other words, with current DIA-MS algorithms, even if the tail peptides of AB42 or AB40 were present in the MS, even at high intensities, the existing DIA-MS algorithms would simply never report these two critical proteins in any output results — which is a scientific and ultimately patient-health tragedy.

Lastly, as a coda to the above paragraphs, there was recently a discussion on BlueSky on the value of PTMs and sequences not in FASTA files (i.e., proteoforms), and the summary by Dr. MacCoss is worth reprinting here: "We know that most disease markers are a specific proteoform. For example, we care a lot about measuring c-peptide, BNP, proteoforms of troponin I and T, amyloid-beta, pTau-217, hemoglobin-A1C, etc..."

For at least three reasons, the dark proteome (DP) is biologically valuable. One can read the enumerated the reasons below; alternatively, one can simply scroll down to the last sentence of the last paragraph on this page (highlighted in orange) for a pithy summary by Dr. MacCoss.

Firstly, the DP contains the unsearched post translational modifications (PTMs), such as

* phosphorylation,
* citrulination,
* methylation,
* acetylation,
* ubitiquination, and
* 100s of other already known PTMs (e.g., from UniProt)

As an example of this importance, many PTMs are critical as either markers of disease or as primary disease causing agents. For example, the recently FDA approved biomarker for Alzheimer's refers to a very specific set of phosphorylations (pT181, PT217, pT231) on the tau protein. Further, the above list still excludes the 100s of different forms of glycoforms, which is deeply unfortunate, as glycoforms have been known for decades to be disproportionally biologically valuable (e.g., >90% of FDA-approved drugs target glycoproteins).

Secondly, the DP contains those proteins that are not present in the protein library (FASTA) files. This situation can happen for two reasons: (a) the FASTA file is generated through a computational prediction of protein sequence using genomics data, but that prediction is far from exhaustive, and so many biologically critical proteins — even proteins now known to be a biomarker for key diseases if not the causal agent of disease — are known to be missing, such as microproteins. (b) As well, the vast majority of researchers use a generic FASTA file retrieved from a public library, such as UniProt, instead of one specific to the study conditions, and so proteins that may be specific to the studied disease, drug, and/or demographics are not present. In theory, one could create a FASTA specific file for every patient in one's study, but that is a very expensive RNA-seq experiment, is complex to compute, and would still be non exhaustive since the precise relationship between RNA-seq and computationally predicted proteins is still not fully known.

Thirdly, the DP contains even unexpected sequence variants (e.g., SNPs, which are possible but hard to predict conclusively or exhaustively) as well as proteins with unexpected proteolytic cleavages, which are not currently predictable. For example, the second FDA-approved set of biomarkers for Alzheimer's disease is the AB42 vs AB40 proteins. But these two proteins do not exist as unique proteins in FASTA files because they are a a small (i.e., <= 42 amino acid) subset resulting form proteolytic cleavage of a >695 amino acid protein called APP. In other words, with current DIA-MS algorithms, even if the tail peptides of AB42 or AB40 were present in the MS, even at high intensities, the existing DIA-MS algorithms would simply never report these two critical proteins in any output results — which is a scientific and ultimately patient-health tragedy.

Lastly, as a coda to the above paragraphs, there was recently a discussion on BlueSky on the value of PTMs and sequences not in FASTA files (i.e., proteoforms), and the summary by Dr. MacCoss is worth reprinting here: "We know that most disease markers are a specific proteoform. For example, we care a lot about measuring c-peptide, BNP, proteoforms of troponin I and T, amyloid-beta, pTau-217, hemoglobin-A1C, etc..."

<Back to previous page>

<Back to previous page>