

These are known as non-synonymous SNPs (nsSNPs) and are the most frequent cause of Mendelian diseases since they alter the protein’s function. Half a million of these observed SNPs are found within an exon and modify a single amino acid. For example, the dbSNP database (build 138) contains about 62 million SNPs and 11 million Indels. Recently, with the development of Next Generation Sequencing (NGS) techniques, the available information related to genetic human variation has evolved rapidly, resulting in overwhelming volumes of data. This requires the characterization and analysis of the type and distribution of the variations in human populations and/or each individual to understand how a specific genetic landscape can influence human health and behavior. and individual or population characteristics, risk of disease or response to the environment. The rules are interpretable by non-expert humans and they thus represent new knowledge about the relationships between the genotype and phenotypes of NFS-Indels and the causative molecular perturbations that result in the disease.Ī major goal in human genetics is to understand the links between the presence of genetic variations, including Single Nucleotide Polymorphisms (SNPs), insertions/deletions (Indels), Copy Number Variants (CNV), recombination events, etc. However, a major advantage of our approach is that we also provide the reasons for the predictions, in the form of a set of rules. We demonstrate in a large-scale evaluation that the accuracy of KD4i is comparable to existing state-of-the-art methods. Finally, we developed a new machine learning approach, KD4i, that can be used to predict the phenotypic effects of NFS-Indels. We then used these parameters to identify specific characteristics of disease-causing and neutral NFS-Indels. In order to study the relationships between NFS-Indels and Mendelian diseases, we characterized NFS-Indels according to numerous structural, functional and evolutionary parameters.

NFS-Indels insert/delete multiples of three nucleotides leading to the insertion/deletion of one or more amino acids. FS-Indels either modify the complete C-terminal region of the protein or result in premature termination of translation. When an Indel occurs within a coding region, it can be either frameshifting (FS) or non-frameshifting (NFS). A major challenge is to predict the effects of these Indels and to distinguish between deleterious and neutral variants. In particular, many Indels that occur in protein coding regions are known to impact the structure or function of the protein. Recent studies have shown that they have significant influence on genetic variation by altering human traits and can cause multiple human diseases. Small insertion and deletion polymorphisms (Indels) are the second most common mutations in the human genome, after Single Nucleotide Polymorphisms (SNPs).
