Summary

ORVAL is the first web bioinformatics platform for the exploration of predicted candidate disease-causing variant combinations, aiming to aid in uncovering the causes of oligogenic diseases (i.e. diseases caused by variants in a small number of genes). This tool integrates innovative machine learning methods for combinatorial variant pathogenicity prediction, further external annotations and interactive and exploratory visualisation techniques.

What can you do with ORVAL?

SUBMIT AND FILTER YOUR VARIANTS

You can submit the variants of a single individual either as a tab-delimited file or a VCF file.
You can also filter your variants based on their Minor Allele Frequency (MAF), their position in the gene and/or based on a specific gene panel of your choice.

PREDICT CANDIDATE DISEASE-CAUSING VARIANT COMBINATIONS

With ORVAL you can predict candidate pathogenic variant combinations in any gene pair present in your data with VarCoPP and further predict their digenic effect (True Digenic, Monogenic with a Modifier variant or Dual Diagnosis) with the Digenic Effect Predictor.

EXPLORE POTENTIAL OLIGOGENIC SIGNATURES

You can investigate potential oligogenic disease signatures by exploring the interactive gene networks that are created based on the predictions and examine them in the context of their protein-protein interactions, cellular locations and pathways.


The input data


ORVAL accepts a list of variants from a single individual only, as it creates all possible variant combinations between pairs assuming that these belong to the same individual.

You can provide either Single Nucleotide Variants (SNVs) or small insertions/deletions (indels).


Types of input files


There are two different types of variant input that you can use to upload your data: either a tab-delimited variant list or a VCF file. After uploading your data, you can start the analysis by clicking on the button.

Tab-delimited variant list

tab delimited list example

At the left panel of the Submission page you can copy-paste a variant list. Each line should contain tab-delimited information for one variant, in the corresponding order: chromosome, position, reference allele, alternative allele, zygosity.

No headers are needed.

The zygosity values should be either Heterozygous or Homozygous. During the analysis, ORVAL automatically converts X-linked variants in males as Hemizygous.

You can also manually insert a variant by typing information on the corresponding chr, position, reference allele, alternative allele, Zygosity column fields and pressing the button.

VCF file

VCF file example

Alternatively, you can submit a VCF file (version 4.2) with your variants at the right panel of the submission page.

ORVAL requires as minimum the presence of:

  • the #Header Line: #CHROM POS ID REF ALT etc... line

  • the columns CHROM, POS, ID, REF, ALT, FORMAT, SAMPLE_NAME (patient information column containing values corresponding to the FORMAT field).

  • genotype (GT) field for each variant at the FORMAT and SAMPLE_NAME columns. In case variants with GT: 0/0 or 0|0 are present, these are discarded from the analysis.

Any other meta-information lines on the top of the file or any extra columns and fields (e.g. QUAL, INFO, etc.) can be present, but ORVAL will ignore them.

In case you want to create your own VCF file, you can download and take a look at the example VCFs that are present at the VCF submission panel and/or consult the Samtools specification page on how to construct a proper VCF file.

Variant types

You can either submit Single Nucleotide Variants (SNVs) or small insertions/deletions (indels). Other types of variants (e.g. CNVs) can be present in your list, but they will not be included in the analysis.

Specifically for indels, you can submit your variants in either one of the two different ways that are shown for a particular variant example (the VCF file can contain more columns).


Tab-delimited list VCF file
Example with dashes 16 3254468 CTT - Heterozygous 16 3254468 . CTT - PASS GT 1/0
Example without dashes 16 3254467 CCTT C Heterozygous 16 3254467 . CCTT C PASS GT 1/0

Genome version

At the moment ORVAL accepts and annotates variants using the GRCh37/hg19 human genome assembly.

We do not make conversions of genomic coordinates from different genome versions. In case you need to convert your variants, you are encouraged to use tools like the UCSC, Ensembl and NCBI assembly converters.

Patient information

Except from the variant list, you should also provide (if available) the sex information of the patient, i.e. if the person is a male or a female.

ORVAL handles differently X-linked variants in males (hemizygous variants) compared to females, and therefore this information is important in order to provide better predictions.

patient information picture

Example input files

You can try ORVAL with the two example VCF files that are present in the VCF file section of the variant submission page. These files give you the opportunity to test ORVAL on a small or large number of variants and see what the webserver has to offer.

  • the Example_VCF_1 file contains 25 variants and its running time (with filtering) is 15 seconds.
  • the Example_VCF_2 file contains 1800 variants and its running time (with filtering) is 18 seconds.

Job submission

Every time you submit your data, you will first get directed to the Submitted ORVAL Job page where you can follow the status of your submission.

In this page you will also receive a Job Id, which you can use to re-access the results of that specific submission or report errors. That Job Id is also present in the Results site in the format: orval.ibsquare.be/results?id=YourJobID.

You can re-access your results by:

  • saving the URL of the Job or the Results page
  • typing on your browser https://orval.ibsquare.be/results?id= followed by the Job ID

Do you receive error or warning messages during your data submission? You can consult the Frequently Asked Questions (FAQ) section for detailed explanations on how to handle them.


Data filtering and annotation

Data filtering

In the submission page, ORVAL offers a recommended variant and gene filtering procedure that will automatically run when you submit your data. This procedure is highly recommended, as it will limit the amount of variant combinations to be tested and will restrict the analysis to the most relevant variants.

There are two types of filtering offered by ORVAL: a variant filtering and a gene filtering procedure.

Variant filtering

The variant filtering procedure ensures that your analysis will contain relevant variants, which will be in accordance with the variant types used to train the predictive methods (VarCoPP and Digenic Effect predictor) integrated in ORVAL: exonic and splicing variants of MAF lower or equal than 3% in protein-coding genes.

The three different filtering options are already pre-selected in the Variant Filtering panel of the submission page. You can unselect a filtering option, by clicking on its corresponding check-box.

variant filtering tab
Filtering options

ExAC MAF

Select the minimum threshold of ExAC MAF for the variants. A MAF of ≤ 0.03 was used to train VarCoPP and is the recommended threshold.

Remove Intergenic

Removes variants that are not inside the defined gene coordinates, based on the human assembly GRCh37/hg19.




Remove intronic and synonymous

Removes:

  • all intronic variants that have a distance bigger than 13 nucleotides from each exon edge, based on the exon coordinates of the canonical transcript of the gene.

  • all synonymous variants that have a distance bigger than 7 nucleotides from each exon edge, based on the exon coordinates of the canonical transcript of the gene.

Gene filtering

The gene filtering option restricts the analysis on a specified list of genes that are present in your variant list. This option can be useful in case you are interested to analyse only a subset of the genes present in your submitted data.

To run your analysis only with a subset of genes, you can simply upload a .txt file with the gene symbols you are interested to include, each gene being in a different line.

gene filtering tab

Data annotation

After you submit your data, ORVAL:

  1. automatically annotates them with the biological information needed for the integrated predictive methods (VarCoPP and the Digenic Effect predictor)
  2. creates all possible variant combinations between any pair of genes present in your variant input and
  3. orders the variants and genes inside each combination.

Below, you can find some important parameters for each process.

Variant annotation

VarCoPP first annotates each variant based on the Ensembl GRCh37/hg19 genome version database and obtaining only the canonical transcript of each gene, as these are defined from Ensembl.

For small insertions and deletions, we also obtain protein sequences from Uniprot using at first the canonical ENSEMBL transcript identifiers, as these are needed to calculate some of the features of our predictive methods. However, we also check whether the reference aminoacid is indeed present in the correct position in the protein canonical sequence. 

Then variants are annotated with the required features for VarCoPP and the Digenic Effect predictor.

Variant exclusion

In some situations during the data annotation process ORVAL excludes variants from the analysis and you will not find them in the results:

  • Variant not in database
    If the variant is not present in the Ensembl database, it will not be included in the analysis.
  • Variant not exonic in canonical transcript
    There may be some cases where a variant is exonic for some alternative transcripts of the corresponding gene, but not for the canonical transcript that ORVAL is using. In this case, if you apply a filtering procedure for intronic variants, this variant will be excluded.
  • Variant with invalid zygosity
    Variants with GT:0/0 or GT:0|0 in a VCF file are considered invalid and are excluded from the analysis.
  • Alternative variant
    In case multiple alternative variants are present in a row in a VCF file, we only take into account the first alternative variant. The rest of the variants are excluded from the analysis.
  • CADD score not available
    ORVAL annotates variants also with a CADD score. As this feature is important for VarCoPP, if a CADD score is not available for that variant, it is excluded for the analysis, as a missing value may severely alter the results.
  • Variants only in one gene
    As ORVAL creates combinations between gene pairs, if your input includes only variants from one gene, you will not get any results at the end of the annotation.
  • The variant is a CNV
    ORVAL analyses only SNVs and small insertions and deletions. Any other variant type in your data is automatically excluded from the analysis.

Gene annotation

VarCoPP annotates each gene name based on information from the Ensembl GRCh37/hg19 genome version database and by obtaining only the canonical transcript of each gene, as these are defined from Ensembl.

It then annotates the genes with the required features for VarCoPP and the Digenic Effect predictor. The gene recessiveness and haploinsufficiency probabilities, essentiality in mouse and pathway features for the predictive methods are obtained using the dbNSFP database.

Another feature that ORVAL uses to annotate genes is the Gene Damage Index (GDI), a metric that shows the susceptibility of a gene to disease. Lower values of GDI indicate greater susceptibility of a gene to candidate disease-causing mutations.

Gene pair annotation

At the gene pair level, ORVAL annotates the genes of a pair with pathway information from Reactome and with their Biological Distance, a metric of biological relatedness between any two genes, based on protein-protein interaction information.

Creating digenic combinations

After annotation, VarCoPP creates all possible variant combinations between any gene pair present in your input, taking into consideration any filtering options you have included during your variant submission.

You can find below a list of details and constraints that take place during this procedure.









Number of variants per combination

ORVAL creates for any gene pair variant combinations that can be:

  • bi-allelic (i.e. one mutated allele at each gene)
    e.g.: one heterozygous variant per gene

  • tri-allelic (i.e. three mutated alleles in total)
    e.g.: an homozygous variant at gene A and an heterozygous variant in gene B

  • tetra-allelic (i.e. four mutated alleles in total)
    e.g.: one homozygous variant per gene

In the tri-allelic and tetra-allelic cases, a digenic combination can also include heterozygous compound variants (i.e. two different mutated alleles in the same gene), along with the presence of variant(s) in another gene.


Order of genes

For each digenic variant combination, gene A is always the gene with the lowest Gene Damage Index (GDI) (see also the Gene Annotation section) and, thus, the one with a higher probability to be associated with a disease.

Order of variant alleles inside the gene

In case of two different mutated alleles in the same gene (heterozygous compound cases), the variant allele 1 is always the variant allele with the highest CADD score.

A graphical representation of a digenic combination

digenic combination example

The predictive methods of ORVAL

VarCoPP: the variant combination pathogenicity predictor

VarCoPP stands for Variant Combination Pathogenicity Predictor. It is a machine-learning method that predicts the pathogenicity of any bi-locus variant combination (i.e. a combination of two to four variant alleles between two genes).

Based on VarCoPP, a bi-locus variant combination can either be candidate disease-causing or neutral.

You can find below a general description of the method, in order to understand and interpret its results. You can further consult the corresponding manuscript on VarCoPP: https://doi.org/10.1073/pnas.1815601116.

Structure of VarCoPP


ALGORITHM

VarCoPP is an ensemble predictor that consists of 500 individual predictors, and more specifically, 500 classification Random Forest (RF) algorithms.




TRAINING DATA

Each predictor of VarCoPP has been trained on the pathogenic variant combinations present in the Digenic Diseases Database (DIDA) against a different subset, each time, of variant data derived from control individuals of the 1000 Genomes Project (1KGP).

The variant types that were used for training were the same for both DIDA and 1KGP: exonic and splicing variants of up to 3% MAF, while all genes were protein coding genes.




RESULT CALCULATION

When a bi-locus variant combination is tested with VarCoPP, each individual RF provides a probability on that combination to be candidate disease-causing. If the probability is above 0.489, then the RF predicts that this combination is candidate disease-causing. The final prediction is based on a majority vote: if 50% or more of the RFs agree that a bi-locus combination is candidate disease-causing, then the final prediction is that it belongs to the candidate disease-causing class.

Therefore, in general, a bi-locus combination is predicted as candidate disease-causing if ≥50% of the predictors agree that it is candidate disease-causing and the median probability for this prediction among all predictors will be, consequently, ≥0.489.

A graphical representation of the structure of VarCoPP
summary of varcopp structure

Prediction features

VarCoPP uses different variant, genes and gene pairs biological features to make the predictions.

Feature Feature abbreviation Gene / Variant allele
CADD raw score
PMID: 24487276
CADD1
CADD2
CADD3
CADD4
Gene A / Variant allele 1
Gene A / Variant allele 2
Gene B / Variant allele 1
Gene B / Variant allele 2
Amino acid hydrophobicity difference
PMID: 8836100
Hydr1 Gene A / Variant allele 1
Amino acid flexibility difference Flex1 Gene A / Variant allele 1
Gene haploinsufficiency probability
PMID: 20976243
HI_A
HI_B
Gene A
Gene B
Gene recessiveness probability
PMID: 22344438
RecA
RecB
Gene A
Gene B
Biological distance
PMID: 24694260
Biol_Dist Gene pair AB

Evaluation scores

For each bi-locus combination VarCoPP provides two prediction scores, based on the way it makes the predictions. These scores are also used to rank the bi-locus combinations in the output files.



Support score (SS)

The Support score (SS) of a bi-locus combination indicates the percentage of RFs that agree that the combination is candidate disease-causing. It can therefore take values between 0 (no RF predicted that the combination is pathogenic) to 100 (all RFs predicted that the combination is pathogenic).

For candidate disease-causing combinations, SS is always equal or larger than 50.0.


Classification score (CS)

The classification score (CS) of a bi-locus variant combination is defined as the median probability of that combination being disease-causing among all RFs. It can take values between 0 and 1.

For candidate disease-causing combinations, CS is always larger than 0.489.

In general, the higher these scores are, the more confident VarCoPP is for the disease-causing class. These scores can be used for a prioritisation of candidate disease-causing variant combinations, you can further consult our tutorial.

95% and 99% confidence zones

With VarCoPP we have defined 95%- and 99% confidence zones, delimited by minimal Classification (CS) and Support scores (SS), which provide a probability of whether a particular combination predicted as candidate disease-causing, is actually a True Positive (TP) result. This indication can be useful for further evaluation and filtering of the predictions.

These confidence zones were created by testing neutral bi-locus combinations from the 1000 Genomes Project and obtaining the minimal CS and SS scores that gave 5% and 1% False Positives. If a combination falls into either one of the two zones, a coloured indication will appear in the summary results.


95%-confidence zone

Requires CS≥0.55 and SS≥75. If a digenic combination falls inside this zone, it has 95% probability of being a TP result.


99%-confidence zone

Requires CS≥0.74 and SS=100. If a digenic combination falls inside this zone, it has 99% probability of being a TP result.

The Digenic Effect Predictor


The Digenic Effect predictor is a machine-learning method that predicts the type, or else the digenic effect of a pathogenic digenic variant combination. This information could be useful in case there is no pedigree information or parent genotypes available, as it could give a predictive indication of the effect of a predicted as pathogenic variant combination. As this is a machine-learning approach, again, a manual investigation by the user can confirm or reject the assigned digenic effect class.

The Digenic Effect predictor can distinguish between three classes of pathogenic variant combinations:

True Digenic

Variants at both genes are needed to show the disease phenotype.

Monogenic + Modifier

The variant at the first gene acts as the major monogenic variant that can trigger disease symptoms, while the second variant acts as a modifier of symptoms severity or age of onset.

Dual Molecular Diagnosis

Conjunction of variants that trigger two independent monogenic disorders that occur simultaneously within a single patient.

navigation bar of Results page

The three types of digenic effects.
Combination a, a True Digenic combination, where the simultaneous presence of a pathogenic allele in each gene is necessary for the individual to express the disease. phenotype.
Combination b, a Monogenic plus Modifier combination, where a variant on the major gene induces a disease phenotype, while a mutation in the modifier gene modifies it, either by rendering it more severe or producing an early onset.
Combination c, a Dual Molecular Diagnosis combination, where both loci are responsible for either distinct or overlapping phenotypes for two different diseases.

The structure of the Digenic Effect predictor

ALGORITHM

The Digenic Effect predictor is a classification Random Forest (RF) algorithm.



TRAINING DATA

The Digenic Effect predictor was trained on 240 pathogenic variant combinations.

More specifically, it has been trained on 90 True Digenic and 75 Monogenic+Modifier variant combinations present in the Digenic Diseases Database (DIDA) and 75 Dual Molecular Diagnosis combinations derived from the work of Posey et al.

The variant types were single nucleotide variations and small insertions/deletions.


RESULT CALCULATION

The Digenic Effect predictor provides probabilities (from 0 to 1) for all three digenic effect classes for a variant combination.

The final digenic effect class is the class with the highest probability among the three.

Prediction features

The Digenic Effect predictor uses different variant, genes and gene pairs biological features to make the predictions.

Feature Feature abbreviation Gene / Variant allele
CADD raw score
PMID: 24487276
CADD1
CADD2
CADD3
CADD4
GeneA / Variant allele 1
Gene A / Variant allele 2
Gene B / Variant allele 1
Gene B / Variant allele 2
Gene recessiveness probability
PMID: 22344438
RecA
RecB
Gene A
Gene B
Essential in mouse
PMID: 23675308
EssA
EssB
Gene A
Gene B
Same pathway
SOURCE: Reactome
Pathway Gene pair AB

Navigation of the ORVAL results

After the analysis is finished, you will be directed to the Results page, where you will be able to explore the oligogenic network that is created using the VarCoPP predictions, the ranking of your gene pairs, based on their content of predicted candidate disease-causing combinations and the detailed digenic pathogenicity probabilities and scores information of your input.

You can access each section by clicking on the corresponding tab at the navigation bar at the top of the page.

navigation bar of Results page

Oligogenic exploration

This section provides the space for the exploration of potentially oligogenic signatures. The information is guided by the predictions of VarCoPP, which predicts the pathogenicity of variant combinations between gene pairs.

The oligogenic information is mainly shown in the form of a gene network, whose nodes represent genes and whose edges connect two genes only if there exists at least one variant combination between them that has been predicted as candidate disease-causing with VarCoPP. The users can explore and filter the network, as well as investigate the protein-protein interactions and the involved pathways of the genes that belong in the same module.

Oligogenic combination network

The first panel of the main Results page, contains the predicted candidate disease-causing oligogenic combination network.

Network description

Like every network, the oligogenic combination network contains nodes and edges.

example network

Node

Each node represents a gene present in your data.



Edge

Connects two genes only if there exists at least one candidate disease-causing variant combination predicted by VarCoPP between them.

The colour of the edge represents the highest pathogenicity score for that pair, and more specifically, the highest Classification Score (CS) computed for a variant combination of that pair (see the VarCoPP scores section a detailed explanation of the score).
This score is represented in a colour range from yellow (low pathogenicity score) to dark red (high pathogenicity score), representing the CS values from 0.489 to 1.0, respectively.

You can:

  • Move a node
    You can select and move a node to arrange it in the network.
  • Click on a node
    By clicking on a node, this node appears with a purple border and a module panel appears automatically on the right of the panel with more information about the module the gene belongs to. At the top of that module panel, the Click here to further explore this gene module link directs you to the specialised page for your selected oligogenic module.
  • Download the network
    By clicking on the download button, which is present at the bottom right of the panel, you can download the network in its current state, including the filtering options you have selected, in the Graph ML format. This file format can be imported in various graphical tools, e.g. with yED or in network analysis tools, such as Cytoscape and and Gephi.

Gene selection

gene selection table

The Gene selection table on the left of the panel contains all genes present in the oligogenic network. The gene table changes automatically according to the filters you select either on the table itself or on the Filtering section below.

At the beginning, all genes are automatically selected and shown in the network.

You can:

  • remove a gene from the network by clicking on its corresponding check box and unselecting it.

  • order the appearance of genes based on their centrality in the network, and this centrality can be based either on the:
    • degree of the node: the number of edges connected with that node
    • closeness of the node: the sum of the length of the shortest paths between the node and all other nodes in the graph, i.e. how close the node is with the other nodes of the network

  • click on a gene to show the module panel with more information about the gene module it belongs to.

  • search the table based on a gene name.

  • download the table in its current state with the button.

Network filtering

gene selection table

The network filtering option allows you to remove edges from the network by adjusting the thresholds of two metrics:

  • the pathogenicity score: the threshold for the highest pathogenicity score for a combination of a gene pair, which is based on the Classification Score provided by VarCoPP (see the VarCoPP scores section for more details)

  • the centrality: the centrality threshold of a gene in the network

Oligogenic gene module

In this section you can further explore and filter the genes of your selected gene module, with the oligogenic gene module network on the right and the module gene selection table on the left of the panel.

The oligogenic gene module network description

This is the selected gene sub-network shown in the exact same way that is present in your main oligogenic network. The nodes and edges of the network represent, again, the genes and the highest Classification Scores of the gene pairs, respectively (see the Oligogenic Network section for a description).

example of a selected gene module from the oligogenic network

You can:

  • Move a node
    You can select and move a node to arrange it in the network.
  • Download the module
    By clicking on the download button, which is present at the bottom right of the panel, you can download the gene module in its current state, in the GraphML format. This file format can be imported in various graphical tools, e.g. with yED or in network analysis tools, such as Cytoscape and and Gephi.

Module gene selection

module gene selection table

The Module gene selection table on the left of the page contains all the genes present in your selected oligogenic module.

You can:

  • search a gene on the table based on its name or external ID (e.g. Ensembl or Uniprot ID).

  • order the appearance of the genes based on their Gene Damage Index (GDI), with genes having a lower GDI being more probable to carry pathogenic mutations.

  • click on a gene name to be directed to its corresponding HGNC page.

  • click on an external ID to be directed to its corresponding source page.

  • download the table in its current state with the button.

Protein-protein interaction information

In this section you can explore any existing direct and indirect protein-protein interactions (PPIs) present in your selected module and get information about the position of the proteins in the cell.

All required information is extracted from the comPPI database.

Protein-protein interaction network

On the left panel of this section you can see a protein-protein interaction network that contains nodes and edges.

Example of a PPI network





Node

Each node represents a protein.

There are two types of nodes in this network:

  • Purple nodes: the proteins of your selected module
  • Grey nodes: external proteins that are present in the network only if they directly interact with two proteins of the selected module. These proteins are useful to show indirect physical interactions of your selected proteins.



Edge

Connects two nodes (proteins) if they directly physically interact.

There are two types of edges in this network:

  • Purple edges: direct interactions between the proteins of your selected module.
  • Grey nodes: direct interactions between a protein of your selected module and an external protein.

You can:

  • Click and move a node
    You can select and move a node to arrange it in the network.
  • Hover on a node
    By hovering upon a node, a box appears with further information about the corresponding gene name, the Uniprot Accession ID and the cellular location of the protein.

    Example of hovering on a PPI network noden
  • Download the PPI network
    By clicking on the download button, which is present just above the network module, you can download it in its current state, in the Graph ML format. This file format can be imported in various graphical tools, e.g. with yED or in network analysis tools, such as Cytoscape and and Gephi.

Cellular information

At the right panel of this section you can explore the cellular location of all proteins present in the PPI network, with the interactive cellular location pie chart. Each part of the chart corresponds to a different cellular location.

Example of the cellular pie chart

You can:

  • Hover over a cellular location
    By hovering on a particular cellular location you can get further statistics inside the plot for the:

    • Ratio of the location: number of protein-cellular location links among all protein-cellular location links
    • Overlap ratio of the location: number of proteins present in the cellular location among all proteins of the network

    All proteins of the PPI network that belong to this location will be automatically coloured as well.

Pathway information

In this section you can explore the cellular pathways where the genes in your selected module are involved in with the summary pathway treemap on the left panel and the detailed pathway table on the right panel.

All required information is extracted from Reactome.

Pathway treemap

pathway treemap example

The treemap on the left panel of this section shows the summary of the different pathway categories of the genes present in your selected gene module.

Each main pathway category is enclosed in a box surrounded by a black stroke and contains nested pathway subcategories, based on the information from Reactome, descending from the more general to the more specific ones. The last sub-category is the most detailed pathway mapping of the gene.

The ordering from the more general to the more specific pathway categories is shown with a transition from:

  • bigger to smaller text font
  • lighter to darker colour gradient

The size of each main pathway category is determined by the number of genes of the selected module that it contains.

Pathway table

pathway table example

The pathway table shows more details about all pathway categories (general and specific) of your gene module.

You can:

  • order the appearance of the pathways based on the number of your module genes they contain.

  • click on each pathway to get further information from its corresponding page in Reactome.

  • search/filter the table based on a pathway or gene name(s). You can provide multiple gene names, separated with space.

  • download the table with the button.

Gene pair ranking exploration

With this section you can explore the gene pairs that are present in your data and rank them based on the content of candidate disease-causing variant combinations that have been predicted with VarCoPP.

S-plot example

This information is shown in a gene pair table that provides statistics on all gene pairs present in your data. The table is divided into the statistics on the percentage and number of pathogenic variant combinations for each pair, and the median pathogenic scores provided by VarCoPP (i.e. the Support Score and the Classification score) among combinations of that pair, to get an idea of their severity.
For further explanations on how these pathogenicity scores are calculated, you can consult the VarCoPP Prediction Scores section on this Documentation page.

The table is initially ranked based on the following columns in descending order of importance:

  1. percentage of pathogenic combinations
  2. median VarCoPP Classification Score
  3. median VarCoPP Support Score

You can:

  • Rank the table based on a column:
    You can rank your table based on a column by clicking on the arrows on the column name.
  • Search/filter the table based on gene(s):
    You can search for a gene by typing the gene name in the search area.
    You can also search for a gene pair by typing the two genes you are searching, separated with a space.
  • Download the table:
    You can download the current table by clicking on the button. If you have filtered first your table based on a gene, the downloaded table will only contain that selection.

Digenic combinations exploration

With this section you can explore the results of the digenic pathogenicity predictions of VarCoPP for all digenic variant combinations of your data.

You can get a visual overview of the results with the interactive S-plot and inspect and download all results with the Summary table. By clicking on each digenic combination in the table you can get more details about its pathogenicity prediction, its pathogenic digenic effect and get access to useful variant, gene and gene-pair annotations.

Digenic results overview: S-plot

The S-plot gives an interactive visual overview of the VarCoPP predictions for all digenic variant combinations present in your data.

S-plot example

All combinations are plotted based on the two prediction scores provided by VarCoPP:

y-axis

Support Score: the percentage of individual VarCoPP predictors agreeing that the digenic combination is candidate disease-causing

x-axis

Classification Score: the median probability among all individual predictors of VarCoPP that the combination is candidate disease-causing

The colour of each digenic combination represents the prediction and the pathogenicity confidence that is provided with VarCoPP for that combination (for details on how this confidence is calculated, you can consult the VarCoPP confidence zones section in the Documentation).

dark red

the variant combination is predicted as candidate disease-causing with 99% confidence

red

the variant combination is predicted as candidate disease-causing with 95% confidence

orange

the variant combination is predicted as candidate disease-causing without falling into one of the two confidence zones

blue

the variant combination is predicted as neutral

grey

a previously tested neutral combination to serve as validation background

You can interact with the plot in several ways:

  • Hover on a combination:
    By hovering on a combination in the plot, a box appears with information about the gene pair, the VarCoPP pathogenicity scores and further links.
  • Zoom-in:
    You can zoom-in on the plot by selecting a rectangular area with your mouse. The Summary Table on the right panel is updated automatically to include only the variant combinations present in the plot at that specific time.
  • Re-initialize the plot:
    You can re-initialize the S-plot after zooming in by clicking on the button.
  • Remove the 10K neutral background combinations:
    You can remove the tested background neutral variant combinations from the plot by clicking on the button.
  • Download the plot:
    You can download the plot by clicking on the button. Note that if you have zoomed-in on the plot, the plot will be downloaded in the zoom-in mode.

Digenic results overview: Summary table

The summary table on the right panel shows further details for each digenic combination. That table is automatically updated based on the filters you choose on the table itself or on the S-plot on the left.

The combinations are ranked based on their Support Score, with those having the highest score being first.

You can:

  • change the ranking by clicking on either the Support Score or Classification Score columns.

  • click on each digenic combination to get more details about its pathogenicity prediction, its pathogenic digenic effect and get access to useful variant, gene and gene-pair annotations.

  • search/filter the table based on a variant or gene name(s). You can use multiple variants or genes by separating them with a space.

  • download the table in its current state by clicking on the button.
example of the VarCoPP summary table

The colour of each digenic combination in the table represents the pathogenicity confidence of the combination (for details on how this confidence is calculated, you can consult the VarCoPP confidence zones section in the Documentation).

dark red

the variant combination is predicted as candidate disease-causing with 99% confidence

red

the variant combination is predicted as candidate disease-causing with 95% confidence

orange

the variant combination is predicted as candidate disease-causing without falling into one of the two confidence zones

blue

the variant combination is predicted as neutral

Pathogenicity prediction information

In this section you can explore the results of VarCoPP, which predicts the pathogenicity of a digenic variant combination as either candidate disease-causing or neutral. Furthermore, you can see further explanations on how each biological feature decides for either the disease-causing or the neutral class. This is an important step that can aid in understanding and evaluating the results obtained by our predictive methods.

In ORVAL we are using the tree-interpreter python module, a method that allows us to see, for every variant combination, the preference each feature shows for either the neutral or the disease-causing class inside each individual predictor of VarCoPP. Based on this method, we get specific preference values for each feature that range from negative to positive.

We visualise these class preference values per feature by using box plots that reveal both the median and variance of class preferences among the individual predictors in VarCoPP.

Feature in red color

The feature has a positive median preference value among all predictors of VarCoPP and votes in favor of the disease-causing class. The higher the value, the stronger the vote for the disease-causing class is.

Feature in blue color

The feature has a negative median preference value among all predictors of VarCoPP and votes in favor of the neutral class. The lower the value, the stronger the vote for the neutral class is.

For a detailed description of the biological features used for predictions, you can consult the VarCoPP features section.

An example of feature interpretation for a prediction

The following box-plot corresponds to a bi-locus combination that was predicted as candidate disease-causing with a Support Score of 89.4. This can already tell us that probably some of the features were conflicting among the predictors, as we do not have a clear consensus.

example of tree interpreter plot

In the boxplot, we see the preference of each feature for either the disease-causing or the neutral class among all individual predictors of VarCoPP, for that particular variant combination.

In this case, we can see that CADD1 and CADD2 (the CADD scores of the 1st and 2nd variant alleles of gene A, see also the Feature Description section), contribute a lot to the disease-causing class vote, as they have the highest positive contribution median value among the rest of the features. This probably means that the CADD scores of those variant alleles are quite high (here, we most probably deal with an homozygous or heterozygous compound variant in gene A, where the 2nd variant allele is not wild-type), something that we can verify by looking at the annotation of the digenic combination in the Digenic Results page.
On the other hand, CADD3 (the 1st variant allele of gene B) drives the prediction towards the neutral class.

We can also see that although RecA (the recessiveness probability of gene A) also has a positive median preference value, as it is coloured in red, its preference values in some of the predictors spread below zero, meaning that this feature was conflicting between the disease-causing and the neutral class.

Digenic Effect prediction

In this section you can explore the results of the Digenic Effect (DE) predictor that predicts the digenic effect of a pathogenic variant combination (i.e. whether it is True Digenic, Monogenic + Modifier or a Dual Molecular Diagnosis case). If a variant combination has been predicted as candidate disease-causing, you will find further information of its digenic effect in this section.

This information is presented on the left with a table that provides the probabilities for each digenic effect class and the predicted Digenic Effect (i.e. the class with the highest probability), while the radar plot on the right panel provides a visual representation of the prediction results.

An example of a Digenic Effect prediction

The following image corresponds to a pathogenic variant combination whose digenic effect is predicted to be True Digenic.

example of a digenic effect prediction for a combination

The table on the left provides the probabilities for all three possible digenic effect classes. We can see that the True Digenic class has the highest probability (0.756) compared to the rest and, therefore, this is the final Digenic Effect class that is predicted for that combination.

The radar plot on the right simply shows a summary visualisation of the results shown in the table. Each line represents a Digenic Effect class and can take probability values from 0 (center of the plot) to 1 (edge of the plot). Three dots fall to the corresponding probability value of each class respectively, forming a triangular shape. With this shape we can get a quick visual idea of which class is prefered and whether this preference is strong or not (depending on the skewness of the triangular shape). In this case, we can clearly see that the prediction falls to the True Digenic class based on the skewness of the triangular shape towards this class.

Exception messages

In some cases you will not be able to get information about the Digenic Effect of a variant combination and you will see some exceptional messages instead:

  • The variant combination is not pathogenic
    As the Digenic Effect predictor can only work on pathogenic variant combinations, if the combination you are exploring is predicted as neutral by VarCoPP, you will see the following message:

    example of a missing digenic effect prediction because of a neutral combination
  • There is some missing annotation for a variant combination
    The Digenic Effect predictor cannot make a prediction for a variant combination when some annotations that are used for the prediction are missing for that combination. In this case, you will see the following message:

    example of missing annotation for a variant combination

Tutorials

We provide here some tutorials concerning certain aspects of the interpretation of results. In case you would like to see a tutorial regarding a specific topic in ORVAL, you can contact us.

Prioritisation of digenic variant combinations

Depending on the size of the data you are analysing you may end up with many digenic variant combinations predicted as candidate disease-causing. However, based on our machine learning methodology and some previous statistics analysis, it is possible to limit your analysis to those combinations that could potentially be more interesting for your research. We would like to stress that ORVAL cannot provide a prioritisation for single variants, but rather a way for the prioritisation of digenic combinations.

All combinations in the Summary Table of the Digenic Combinations Overview are ranked based on their Classification Score (CS) and Support Score (SS), with those at the top having the highest scores. You can consult the VarCoPP evaluation scores section for a detailed explanation. In general, the higher the CS and SS assigned to a digenic combination, the more confident VarCoPP is for the disease-causing class.

You can find below a way to prioritise the results of the digenic combinations, starting from the more general to the stricter criteria.

  • Combinations predicted as candidate disease-causing
    As combinations predicted as candidate disease-causing have at least CS of 0.489 and SS of 50, you could first focus on all combinations that pass this threshold for further inspection (also coloured with orange, red and darkred colour). Those combinations predicted as candidate disease-causing only without being present in a confidence zone are depicted in orange.
  • Combinations predicted as candidate disease-causing with 95% confidence
    For a stricter analysis, you could focus on the combinations predicted as candidate disease-causing with 95% confidence, meaning that they have 95% probability of being a True Positive result. These have CS≥0.55 and SS≥75 and are depicted with a red colour.
  • Combinations predicted as candidate disease-causing with 99% confidence
    As the strictest criterion, you could finally focus on the combinations predicted as candidate disease-causing with 99% confidence, meaning that they have 99% probability of being a True Positive result. These have CS≥0.74 and SS=100 and are depicted with a dark red colour.

Please note that the stricter the criteria you use, the less are the chances to keep False Positive results, but on the other hand, the more are the chances of eliminating potentially interesting (False Negative) results.

Example of a variant combination prioritisation
example of a digenic combinations table

In this example, we see the first page of a Summary Table after an analysis with ORVAL.

The colours can immediately help discerning the different categories of predicted digenic combinations. Those with a blue colour are predicted as neutral, those 5 with a red colour are predicted as candidate disease-causing with 95% confidence and those 2 with darkred colour are predicted as candidate disease-causing with 99% confidence. In this example, combinations predicted as candidate disease-causing without being present in a confidence zone, are not present.

If you would like to apply very strict criteria, you could first focus on the 2 top combinations that have a 99% confidence of being a True Positive result and further explore them with ORVAL (using the oligogenic navigation section) and by clicking on them to get directed to their specialised digenic page.

On the other hand, if you would like to relax the strictness of your criteria, you can also inspect the variant combinations present in the 95% confidence zone.


Browser compatibility

You can find below which browsers are suitable for ORVAL based on your operational system:

OS Version Chrome Firefox Microsoft Edge Safari
Linux Ubuntu 18 71.0.35 64.0 N/A N/A
MacOS High Sierra 71.0.35 64.0 N/A 12.0.1
Windows 10 71.0.35 64.0 44.17 N/A

Frequently Asked Questions

If answers to your questions are not provided in this section and no information about your question is mentioned in the Documentation page, you can contact us.

Is there a limit on the number of variants I can uploaded?
In general, we highly recommend the use of variants from up to 300 genes, as well as the application of the variant filtering procedure that is provided with ORVAL, in order to limit the amount non-relevant combinations that will be tested.

Based on our server testings, you can either copy-paste a variant list of up to 80000 variants or upload a VCF file of size up to 50 MB. You can consult the Input Data section of our Documentation page for more details.

Can I include variants from multiple patients in my input?
No, the analysis should be restricted to a single individual only, as ORVAL creates all possible variant combinations from your variant list assuming that they belong to the same person.
If you want to analyse multiple patients, you should separate their variants in different files and explore them individually with ORVAL.

Is there a specific input format for the insertions and deletions?
ORVAL accepts different types of variant format for the insertions and deletions, involving dashes or not.
You can consult the Variant Types section in the Documentation page for a detailed explanation.

I see that the Job status of my variant submission is FAILURE. What should I do?

If there is no specific error message that explains the problem, you can follow these steps:

  • Check if your internet connection is running smoothly.
  • Re-try your data submission.
  • Check if your input data is correctly formatted. You can consult the Input Data section in our documentation page for a detailed explanation on the correct format for your variant submission.

If you have checked the previous steps and you still experience issues, you can send us an email, providing also the Job Id you have obtained during the submission.

During my data submission I see a message that I have exceeded the 5 Job submissions. What should I do?

For server monitoring purposes we allow every user (based on their IP address) to run maximum 5 different data analyses at the same time. In case you exceed this number, you have to wait until at least one of the running jobs is finished to launch a new one. You can consult the Job Submission section of the Documentation page for more details.

During my data submission I see a message that my uploaded file is using an unsupported format. What should I do?

The VCF file you have uploaded is not correctly formatted and ORVAL cannot parse it. Make sure that your file contains the header line (#CHROM POS REF ALT etc...) and tab-delimited columns with the CHROM, POS, ID, REF, ALT, FORMAT, SAMPLE_NAME columns.
You can consult the VCF file specification page of the Documentation page for a detailed description on how to properly format your VCF file.

What types of digenic combinations does ORVAL create?
ORVAL creates all possible di-allelic, tri-allelic and tetra-llelic variant combinations between any gene pair present in your data, including the presence of heterozygous compound variants in the same gene except the cases of tetra-llelic combinations with four heterozygous variants (two in each gene).
For a detailed explanation, you can consult the Creating digenic combinations section of our Documentation page.

I receive an error message that there are no variant combinations left to do the analysis. What should I do?

First of all, if you have submitted a very small amount of variants, there is a possibility that none of these variants is present in our database. Nevertheless, you can check the following steps to explore another possible solution:

  • Check whether the genome version you are using for the variants is correct. ORVAL annotates variants using the GRCh37/hg19 genome version, see the Genome version section of the Documentation page for more details.
  • If you have uploaded a VCF file, please check whether the format of the variants and variant fields is correct. You can consult the VCF file specifications for ORVAL in our Documentation page.
  • If you copy-pasted a variant list, please make sure that the column order is correct, the columns are tab-delimited and the zygosity values are not misspelled (Homozygous or Heterozygous zygosity values are accepted). You can find further information on the section about the submission of a tab-delimited variant list in the Documentation page.
  • Try to relax your variant or gene filtering options, especially the option for removing intronic and synonymous variants.
  • Check that you have more than one genes present in your data. ORVAL makes variant combinations between gene pairs, so it requires the presence of at least two different genes in your data.

If you have checked the previous steps and you still experience issues, you can send us an email, providing also the Job Id you have obtained during the submission.

SOME variants of my initial submission are missing from the results

  • Please check the variant and/or gene filtering options you have selected during your submission, as these play an important role on the absence of some of your variants in the analysis. The variant filtering options offered by ORVAL are automatically pre-selected during the submission, unless you unselect them.
  • Make sure that the variant information and format you provided is correct for all variants. In case you copy-pasted a variant list using the box panel, make sure that the zygosity values are not misspelled (Heterozygous or Homozygous zygosity values are accepted), see also the tab-delimited variant list section in the Documentation page.
  • Otherwise, some variants may have been excluded from your analysis during the data annotation process. You can consult the Variant Exclusion section of our Documentation page for a detailed description of such cases.


For more information regarding our filtering options and the data annotation process, you can consult the Data Filtering and Annotation section of our Documentation page.

I cannot see a network in the Results page
You can see a network in the Results page only if there is at least one candidate disease-causing variant combination predicted with VarCoPP, see for more details the Network navigation section in the Documentation page.
In any case, you can still explore all variant combinations in the Digenic predictions of the Results page.

Why do we see only SOME external proteins in the PPI network?
You see an external protein in the PPI network only if it links two proteins of your selected module with a direct protein-protein interaction. External proteins that are connected with your module proteins with higher degrees of interactions are not shown. You can consult the PPI network exploration section in the Documentation page.

Where can I find the Digenic Effect prediction for a variant combination?
You can see the Digenic Effect prediction of a variant combination on their corresponding Digenic Combination page, which you can find by clicking on them either in the Digenic Combinations Overview table or on the S-plot.
Please note that you will only see a Digenic Effect prediction if the variant combination is predicted as pathogenic with VarCoPP.

Do you store any of my data?
We store the results of your data submission for 7 days, so that you can re-access the corresponding Result pages. After this period, all data is deleted.
We do not store email addresses that you may have provided to us during the data submission. However, we track general user traffic information (e.g. IP addresses) for job monitoring purposes (e.g. restricting the number of parallel submissions from the same ID address).
For a detailed explanation of our Data Privacy procedures, you can consult the Data privacy section of the About page.