ORVAL is the first web bioinformatics platform for the exploration of predicted candidate disease-causing variant combinations, aiming to aid in uncovering the causes of oligogenic diseases (i.e. diseases caused by variants in a small number of genes). This tool integrates innovative machine learning methods for combinatorial variant pathogenicity prediction, further external annotations and interactive and exploratory visualisation techniques.
What can you do with ORVAL?
PREDICT CANDIDATE DISEASE-CAUSING VARIANT COMBINATIONS
NOTE: The main results of this platform are based on predictive tools.
They are provided for research, educational and informational purposes only and the pathogenicity predictions should be subject to further scientific and clinical investigation.
It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.
The input data
ORVAL accepts a list of variants from a single individual only, as it creates all possible variant combinations between pairs assuming that these belong to the same individual.
You can provide either Single Nucleotide Variants (SNVs) or small insertions/deletions (indels).
NOTE: At the moment, ORVAL is not recommended for the analysis of complete patient exomes, due to the amount of False Positives that will be obtained.
It is highly recommended to choose the suggested Variant Filtering options that are provided in the Submission page and to restrict your analysis to relevant gene panels using the Gene Filtering option (a limit of 300 genes is recommended).
If your VCF contains the complete exome of an individual, you can still upload it in ORVAL and then specify your filtering options before submitting your data.
Types of input files
There are two different types of variant input that you can use to upload your data: either a tab-delimited variant list or a VCF file. After uploading your data, you can start the analysis by clicking on the button.
Tab-delimited variant list
At the left panel of the Submission page you can copy-paste a variant list. Each line should contain tab-delimited information for one variant, in the corresponding order: chromosome, position, reference allele, alternative allele, zygosity.
No headers are needed.
The zygosity values should be either Heterozygous or Homozygous. During the analysis, ORVAL automatically converts X-linked variants in males as Hemizygous.
You can also manually insert a variant by typing information on the corresponding chr, position, reference allele, alternative allele, Zygosity column fields and pressing the button.
NOTE: using the variant list panel, you can upload up to 80000 variants.
Alternatively, you can submit a VCF file (version 4.2) with your variants at the right panel of the submission page.
ORVAL requires as minimum the presence of:
- the #Header Line: #CHROM POS ID REF ALT etc... line
- the columns CHROM, POS, ID, REF, ALT, FORMAT, SAMPLE_NAME (patient information column containing values corresponding to
the FORMAT field).
- the genotype (GT) field for each variant at the FORMAT and SAMPLE_NAME columns. In case variants with GT: 0/0 or 0|0 are present, these are discarded from the analysis.
Any other meta-information lines on the top of the file or any extra columns and fields (e.g. QUAL, INFO, etc.) can be present, but ORVAL will ignore them.
NOTE: if your VCF contains information for several individuals, you should separate the information of each individual in different VCF files and run them individually in ORVAL.
NOTE: in case of many alternative variants in some rows, we only consider the first of alternative variants for our analysis.
NOTE: you can upload a VCF file of size up to 50 MB. The file can also be compressed either with zip, gzip, bzip2 or xz.
In case you want to create your own VCF file, you can download and take a look at the example VCFs that are present at the VCF submission panel and/or consult the Samtools specification page on how to construct a proper VCF file.
You can either submit Single Nucleotide Variants (SNVs) or small insertions/deletions (indels). Other types of variants (e.g. CNVs) can be present in your list, but they will not be included in the analysis.
Specifically for indels, you can submit your variants in either one of the two different ways that are shown for a particular variant example (the VCF file can contain more columns).
|Tab-delimited list||VCF file|
|Example with dashes||16 3254468 CTT - Heterozygous||16 3254468 . CTT - PASS GT 1/0|
|Example without dashes||16 3254467 CCTT C Heterozygous||16 3254467 . CCTT C PASS GT 1/0|
At the moment ORVAL accepts and annotates variants using the GRCh37/hg19 human genome assembly.
We do not make conversions of genomic coordinates from different genome versions. In case you need to convert your variants, you are encouraged to use tools like the UCSC, Ensembl and NCBI assembly converters.
Except from the variant list, you should also provide (if available) the sex information of the patient, i.e. if the person is a male or a female.
ORVAL handles differently X-linked variants in males (hemizygous variants) compared to females, and therefore this information is important in order to provide better predictions.
Example input files
You can try ORVAL with the two example VCF files that are present in the VCF file section of the variant submission page. These files give you the opportunity to test ORVAL on a small or large number of variants and see what the webserver has to offer.
- the Example_VCF_1 file contains 25 variants and its running time (with filtering) is 15 seconds.
- the Example_VCF_2 file contains 1800 variants and its running time (with filtering) is 18 seconds.
Every time you submit your data, you will first get directed to the Submitted ORVAL Job page where you can follow the status of your submission.
In this page you will also receive a Job Id, which you can use to re-access the results of that specific submission or report errors. That Job Id is also present in the Results site in the format: orval.ibsquare.be/results?id=YourJobID.
You can re-access your results by:
- saving the URL of the Job or the Results page
- typing on your browser https://orval.ibsquare.be/results?id= followed by the Job ID
Do you receive error or warning messages during your data submission? You can consult the Frequently Asked Questions (FAQ) section for detailed explanations on how to handle them.
NOTE: all result information is automatically deleted 7 days after the submission at 04.00 am (GMT+2). After this period, you have to re-submit your data and you will receive a new Job Id.
NOTE: for server monitoring purposes we allow every user (based on their IP address) to run up to 5 different Submission Jobs at the same time. In case you exceed this number, you will have to wait until at least one of the running Jobs is finished to launch a new one.
Data filtering and annotation
In the submission page, ORVAL offers a recommended variant and gene filtering procedure that will automatically run when you submit your data. This procedure is highly recommended, as it will limit the amount of variant combinations to be tested and will restrict the analysis to the most relevant variants.
The variant filtering procedure ensures that your analysis will contain relevant variants, which will be in accordance with the variant types used to train the predictive methods (VarCoPP and Digenic Effect predictor) integrated in ORVAL: exonic and splicing variants of MAF lower or equal than 3% in protein-coding genes.
The three different filtering options are already pre-selected in the Variant Filtering panel of the submission page. You can unselect a filtering option, by clicking on its corresponding check-box.
Select the minimum threshold of ExAC MAF for the variants. A MAF of ≤ 0.03 was used to train VarCoPP and is the recommended threshold.
Removes variants that are not inside the defined gene coordinates, based on the human assembly GRCh37/hg19.
Remove intronic and synonymous
- all intronic variants that have a distance bigger than 13 nucleotides from each exon edge, based on the exon coordinates of the canonical transcript of the gene.
- all synonymous variants that have a distance bigger than 7 nucleotides from each exon edge, based on the exon coordinates of the canonical transcript of the gene.
NOTE: apart from the requested filtering steps, ORVAL may also exclude some extra variants during the data annotation process. You can consult the complete list of variant exclusion cases during that process here.
The gene filtering option restricts the analysis to a specified list of relevant genes that can be present in your data. This procedure is highly recommended in case your VCF contains the complete exome of an individual, as it can dramatically limit the amount of False Positives that can be obtained.
To run your analysis only with a subset of genes, you can simply upload a .txt file with the gene symbols you are interested to include, each gene being in a different line. After submission, ORVAL will use this list to filter the genes that will be used in the analysis.
After you submit your data, ORVAL:
- automatically annotates them with the biological information needed for the integrated predictive methods (VarCoPP and the Digenic Effect predictor)
- creates all possible variant combinations between any pair of genes present in your variant input and
- orders the variants and genes inside each combination.
Below, you can find some important parameters for each process.
VarCoPP first annotates each variant based on the Ensembl GRCh37/hg19 genome version database and obtaining only the canonical transcript of each gene, as these are defined from Ensembl.
For small insertions and deletions, we also obtain protein sequences from Uniprot using at first the canonical ENSEMBL transcript identifiers, as these are needed to calculate some of the features of our predictive methods. However, we also check whether the reference aminoacid is indeed present in the correct position in the protein canonical sequence.
In some situations during the data annotation process ORVAL excludes variants from the analysis and you will not find them in the results:
- Variant not in database
If the variant is not present in the Ensembl database, it will not be included in the analysis.
- Variant not exonic in canonical transcript
There may be some cases where a variant is exonic for some alternative transcripts of the corresponding gene, but not for the canonical transcript that ORVAL is using. In this case, if you apply a filtering procedure for intronic variants, this variant will be excluded.
- Variant with invalid zygosity
Variants with GT:0/0 or GT:0|0 in a VCF file are considered invalid and are excluded from the analysis.
- Alternative variant
In case multiple alternative variants are present in a row in a VCF file, we only take into account the first alternative variant. The rest of the variants are excluded from the analysis.
- CADD score not available
ORVAL annotates variants also with a CADD score. As this feature is important for VarCoPP, if a CADD score is not available for that variant, it is excluded for the analysis, as a missing value may severely alter the results.
- Variants only in one gene
As ORVAL creates combinations between gene pairs, if your input includes only variants from one gene, you will not get any results at the end of the annotation.
- The variant is a CNV
ORVAL analyses only SNVs and small insertions and deletions. Any other variant type in your data is automatically excluded from the analysis.
VarCoPP annotates each gene name based on information from the Ensembl GRCh37/hg19 genome version database and by obtaining only the canonical transcript of each gene, as these are defined from Ensembl.
It then annotates the genes with the required features for VarCoPP and the Digenic Effect predictor. The gene recessiveness and haploinsufficiency probabilities, essentiality in mouse and pathway features for the predictive methods are obtained using the dbNSFP database.
Another feature that ORVAL uses to annotate genes is the Gene Damage Index (GDI),
a metric that shows the susceptibility of a gene to disease. Lower values of GDI indicate greater susceptibility of a gene to candidate disease-causing mutations.
NOTE: ORVAL uses the GDI to order the appearance of genes inside each digenic variant combination, with gene A being always the gene with the lower GDI value. You can find more information about this procedure in the Creating digenic combinations section of the Documentation page.
Gene pair annotation
At the gene pair level, ORVAL annotates the genes of a pair with pathway information from Reactome and with their Biological Distance, a metric of biological relatedness between any two genes, based on protein-protein interaction information.
Creating digenic combinations
After annotation, VarCoPP creates all possible variant combinations between any gene pair present in your input, taking into consideration any filtering options you have included during your variant submission.
You can find below a list of details and constraints that take place during this procedure.
Number of variants per combination
ORVAL creates for any gene pair variant combinations that can be:
- bi-allelic (i.e. one mutated allele at each gene)
e.g.: one heterozygous variant per gene
- tri-allelic (i.e. three mutated alleles in total)
e.g.: an homozygous variant at gene A and an heterozygous variant in gene B
- tetra-allelic (i.e. four mutated alleles in total)
e.g.: one homozygous variant per gene
In the tri-allelic and tetra-allelic cases, a digenic combination can also include heterozygous compound variants (i.e. two different mutated alleles in the same gene), along with the presence of variant(s) in another gene.
NOTE: Tetra-allelic variant combinations with heterozygous compound variants in BOTH genes are not created.
Order of variant alleles inside the gene
In case of two different mutated alleles in the same gene (heterozygous compound cases), the variant allele 1 is always the variant allele with the highest CADD score.
A graphical representation of a digenic combination
The predictive methods of ORVAL
VarCoPP: the variant combination pathogenicity predictor
VarCoPP stands for Variant Combination Pathogenicity Predictor. It is a machine-learning method that predicts the pathogenicity of any bi-locus variant combination (i.e. a combination of two to four variant alleles between two genes).
Based on VarCoPP, a bi-locus variant combination can either be candidate disease-causing or neutral.
Structure of VarCoPP
VarCoPP is an ensemble predictor that consists of 500 individual predictors, and more specifically, 500 classification Random Forest (RF) algorithms.
Each predictor of VarCoPP has been trained on the pathogenic variant combinations present in the Digenic Diseases Database (DIDA) against a different subset, each time, of variant data derived from control individuals of the 1000 Genomes Project (1KGP).
The variant types that were used for training were the same for both DIDA and 1KGP: exonic and splicing variants of up to 3% MAF, while all genes were protein coding genes.
When a bi-locus variant combination is tested with VarCoPP, each individual RF provides a probability on that combination to be candidate disease-causing. If the probability is above 0.489, then the RF predicts that this combination is candidate disease-causing. The final prediction is based on a majority vote: if 50% or more of the RFs agree that a bi-locus combination is candidate disease-causing, then the final prediction is that it belongs to the candidate disease-causing class.
Therefore, in general, a bi-locus combination is predicted as candidate disease-causing if ≥50% of the predictors agree that it is candidate disease-causing and the median probability for this prediction among all predictors will be, consequently, ≥0.489.
A graphical representation of the structure of VarCoPP
VarCoPP uses different variant, genes and gene pairs biological features to make the predictions.
|Feature||Feature abbreviation||Gene / Variant allele|
|CADD raw score
|Gene A / Variant allele 1
Gene A / Variant allele 2
Gene B / Variant allele 1
Gene B / Variant allele 2
|Amino acid hydrophobicity difference
|Hydr1||Gene A / Variant allele 1|
|Amino acid flexibility difference||Flex1||Gene A / Variant allele 1|
|Gene haploinsufficiency probability
|Gene recessiveness probability
|Biol_Dist||Gene pair AB|
For each bi-locus combination VarCoPP provides two prediction scores, based on the way it makes the predictions. These scores are also used to rank the bi-locus combinations in the output files.
Support score (SS)
The Support score (SS) of a bi-locus combination indicates the percentage of RFs that agree that the combination is candidate disease-causing. It can therefore take values between 0 (no RF predicted that the combination is pathogenic) to 100 (all RFs predicted that the combination is pathogenic).
For candidate disease-causing combinations, SS is always equal or larger than 50.0.
Classification score (CS)
The classification score (CS) of a bi-locus variant combination is defined as the median probability of that combination being disease-causing among all RFs. It can take values between 0 and 1.
For candidate disease-causing combinations, CS is always larger than 0.489.
In general, the higher these scores are, the more confident VarCoPP is for the disease-causing class. These scores can be used for a prioritisation of candidate disease-causing variant combinations, you can further consult our tutorial.
95% and 99% confidence zones
With VarCoPP we have defined 95%- and 99% confidence zones, delimited by minimal Classification (CS) and Support scores (SS), which provide a probability of whether a particular combination predicted as candidate disease-causing, is actually a True Positive (TP) result. This indication can be useful for further evaluation and filtering of the predictions.
These confidence zones were created by testing neutral bi-locus combinations from the 1000 Genomes Project and obtaining the minimal CS and SS scores that gave 5% and 1% False Positives. If a combination falls into either one of the two zones, a coloured indication will appear in the summary results.
Requires CS≥0.55 and SS≥75. If a digenic combination falls inside this zone, it has 95% probability of being a TP result.
Requires CS≥0.74 and SS=100. If a digenic combination falls inside this zone, it has 99% probability of being a TP result.
The Digenic Effect Predictor
The Digenic Effect predictor is a machine-learning method that predicts the type, or else the digenic effect of a pathogenic digenic variant combination. This information could be useful in case there is no pedigree information or parent genotypes available, as it could give a predictive indication of the effect of a predicted as pathogenic variant combination. As this is a machine-learning approach, again, a manual investigation by the user can confirm or reject the assigned digenic effect class.
The Digenic Effect predictor has been published in the Artificial Intelligence in Medicine journal: https://doi.org/10.1016/j.artmed.2019.06.006 and the Nucleic Acids Research journal: https://doi.org/10.1093/nar/gkx557. See also the Cite us section in the About page, for a list of all relevant citations.
The Digenic Effect predictor can distinguish between three classes of pathogenic variant combinations:
Variants at both genes are needed to show the disease phenotype.
Monogenic + Modifier
The variant at the first gene acts as the major monogenic variant that can trigger disease symptoms, while the second variant acts as a modifier of symptoms severity or age of onset.
Dual Molecular Diagnosis
Conjunction of variants that trigger two independent monogenic disorders that occur simultaneously within a single patient.
The three types of digenic effects.
Combination a, a True Digenic combination, where the simultaneous presence of a pathogenic allele in each gene is necessary for the individual to express the disease. phenotype.
Combination b, a Monogenic plus Modifier combination, where a variant on the major gene induces a disease phenotype, while a mutation in the modifier gene modifies it, either by rendering it more severe or producing an early onset.
Combination c, a Dual Molecular Diagnosis combination, where both loci are responsible for either distinct or overlapping phenotypes for two different diseases.
The structure of the Digenic Effect predictor
The Digenic Effect predictor is a classification Random Forest (RF) algorithm.
The Digenic Effect predictor was trained on 240 pathogenic variant combinations.
More specifically, it has been trained on 90 True Digenic and 75 Monogenic+Modifier variant combinations present in the Digenic Diseases Database (DIDA) and 75 Dual Molecular Diagnosis combinations derived from the work of Posey et al.
The variant types were single nucleotide variations and small insertions/deletions.
The Digenic Effect predictor provides probabilities (from 0 to 1) for all three digenic effect classes for a variant combination.
The final digenic effect class is the class with the highest probability among the three.
The Digenic Effect predictor uses different variant, genes and gene pairs biological features to make the predictions.
|Feature||Feature abbreviation||Gene / Variant allele|
|CADD raw score
|GeneA / Variant allele 1
Gene A / Variant allele 2
Gene B / Variant allele 1
Gene B / Variant allele 2
|Gene recessiveness probability
|Essential in mouse
|Pathway||Gene pair AB|
We provide here some tutorials concerning certain aspects of the interpretation of results. In case you would like to see a tutorial regarding a specific topic in ORVAL, you can contact us.
Prioritisation of digenic variant combinations
Depending on the size of the data you are analysing you may end up with many digenic variant combinations predicted as candidate disease-causing. However, based on our machine learning methodology and some previous statistics analysis, it is possible to limit your analysis to those combinations that could potentially be more interesting for your research. We would like to stress that ORVAL cannot provide a prioritisation for single variants, but rather a way for the prioritisation of digenic combinations.
All combinations in the Summary Table of the Digenic Combinations Overview are ranked based on their Classification Score (CS) and Support Score (SS), with those at the top having the highest scores. You can consult the VarCoPP evaluation scores section for a detailed explanation. In general, the higher the CS and SS assigned to a digenic combination, the more confident VarCoPP is for the disease-causing class.
You can find below a way to prioritise the results of the digenic combinations, starting from the more general to the stricter criteria.
- Combinations predicted as candidate disease-causing
As combinations predicted as candidate disease-causing have at least CS of 0.489 and SS of 50, you could first focus on all combinations that pass this threshold for further inspection (also coloured with orange, red and darkred colour). Those combinations predicted as candidate disease-causing only without being present in a confidence zone are depicted in orange.
- Combinations predicted as candidate disease-causing with 95% confidence
For a stricter analysis, you could focus on the combinations predicted as candidate disease-causing with 95% confidence, meaning that they have 95% probability of being a True Positive result. These have CS≥0.55 and SS≥75 and are depicted with a red colour.
- Combinations predicted as candidate disease-causing with 99% confidence
As the strictest criterion, you could finally focus on the combinations predicted as candidate disease-causing with 99% confidence, meaning that they have 99% probability of being a True Positive result. These have CS≥0.74 and SS=100 and are depicted with a dark red colour.
Please note that the stricter the criteria you use, the less are the chances to keep False Positive results, but on the other hand, the more are the chances of eliminating potentially interesting (False Negative) results.
Example of a variant combination prioritisation
In this example, we see the first page of a Summary Table after an analysis with ORVAL.
The colours can immediately help discerning the different categories of predicted digenic combinations. Those with a blue colour are predicted as neutral, those 5 with a red colour are predicted as candidate disease-causing with 95% confidence and those 2 with darkred colour are predicted as candidate disease-causing with 99% confidence. In this example, combinations predicted as candidate disease-causing without being present in a confidence zone, are not present.
If you would like to apply very strict criteria, you could first focus on the 2 top combinations that have a 99% confidence of being a True Positive result and further explore them with ORVAL (using the oligogenic navigation section) and by clicking on them to get directed to their specialised digenic page.
On the other hand, if you would like to relax the strictness of your criteria, you can also inspect the variant combinations present in the 95% confidence zone.
You can find below which browsers are suitable for ORVAL based on your operational system:
Frequently Asked Questions
If answers to your questions are not provided in this section and no information about your question is mentioned in the Documentation page, you can contact us.