Prioritising genetic findings for drug target identification and validation

The decreasing costs of high-throughput genetic sequencing and increasing abundance of sequenced genome data have paved the way for the use of genetic data in identifying and validating potential drug targets. However, the number of identified potential drug targets is often prohibitively large to experimentally evaluate in wet lab experiments, highlighting the need for systematic approaches for target prioritisation. In this review


Introduction
Target-based drug development is a paradigm that aims to identify druggable targets whose function or concentration can be modified by compounds (drugs) to mitigate the effects of a disease.While around 90 % of current drugs target a protein [1], drug targets may include other biomolecules, such as nucleic acids and RNA [2].Historically, 90 % of clinical drug development programs fail [3], with the majority of late-stage clinical development stage failures driven by compound related toxicity or by lack of efficacy of the target protein, that is, the drug target is not observed to be causally related to the disease.This high rate of failure indicates the poor ability of pre-clinical experiments, conducted in animals, cell lines, and tissues, to appropriately anticipate effects of target perturbation in human diseases [4].One promising approach that could help in lowering rates of clinical trial failure is the use of genetic data in drug target identification and validation.
The sequencing of the human genome in 2003 paved the way for human genetic evidence to be used in the drug development process.However, until relatively recently, only ad-hoc drug development has been initiated directly from human genomic evidence, largely through the discovery of rare Mendelian variation to causally model the effects of the drug target.For example, candidate gene studies conducted in the mid-1990s identified the central role of the CCR5 gene in HIV progression [5], and family-and population-based genetic studies in the early 2000s found associations between the PCSK9 gene and LDL-C concentration, subsequently leading to the successful development of PCSK9 inhibiting drugs for the treatment of hypercholesteremia [6].More recently, with the increase in large cohort studies covering multiple diseases and biomarkers, and the fall in costs of whole genome sequencing and genotype arrays, there is a growing body of publicly accessible genomic data that can be used in drug target identification.In particular, genome-wide association studies (GWAS) present an opportunity to exploit genomics for drug target identification and validation.The use of GWAS has proven successful in identifying de novo targets in well-studied diseases.For example, analyses of the IL6R locus have anticipated that interleukin 6 receptor (IL-6R) inhibition through tocilizumab (originally indicated for rheumatoid arthritis) might be repurposed for treatment of coronary heart disease (CHD) [7][8][9].This hypothesis is now supported by the CANTOS trial, which confirmed that using a monoclonal antibody to target interleukin 1-beta (IL-1β), a protein that drives the IL-6 signalling pathway, reduces the rate of cardiovascular events in patients with a history of CHD [10].Additionally, trials of IL-6R blockade with tocilizumab are currently underway in patients with myocardial infarction (MI), with some positive results at early trial stages [11][12][13].GWAS have additionally proven successful in identifying established drug-target/disease combinations (GWAS rediscoveries) [7,[14][15][16], and systematic evaluation of historical drug development programs has found that compounds related to disease-target combinations with genetic support were substantially more likely to receive regulatory approval than those without (odds ratio (OR) 2.0, 95 % confidence interval (CI) 1.60, 2.40), clearly highlighting the potential of genetic evidence in drug development [17].
As showcased by industry investment into projects such as FinnGen [18], UK Biobank (UKB) [19] and other large population-scale genomic resources, pharmaceutical companies are increasingly considering genomics-first approaches.Here, information from the human genome is leveraged to identify and validate potential new drug targets.In modern, high powered genetic studies, leveraging large scale biobanks such as UKB [19], FinnGen [18], Estonia biobank [20], Biobank Japan [21] and China Kadoorie biobank [22], it is not uncommon to identify between 10 and 100+ genetic loci for a given disease.This number of loci is typically too large, and mechanistically too diverse, to systematically evaluate in wet lab experiments.Hence, further prioritisation is required to identify a subset of tractable targets for confirmatory analyses.For this purpose, many biomedical resources are available, providing orthogonal evidence to aid in prioritising genomics findings for drug development.
In this review, we will provide an overview of a subset of biomedical databases, and integrated software tools, relevant to prioritising findings from genetically guided drug development for subsequent wet-lab validation and eventually clinical testing.We will first discuss the most common types of genetic studies relevant for drug development: loss of function analyses, GWAS, colocalization and Mendelian randomisation.Subsequently, we will discuss the utility of biomedical databases in 1) ontologies mapping genes, proteins and disease 2) identifying druggable proteins 3) clinical effects profiles to identify repurposing candidates 4) tissue and cell expression 5) and key biological pathways including protein-protein interactions.We will illustrate the utility of human genetics and integration of biomedical databases by identifying and prioritising plasma proteins with an anticipated effect on non-alcoholic fatty liver disease (NAFLD).
When reviewing the examples illustrated below, it is worth considering the distinction between a geneticist's definition of a drug target, and the types of targets employed in clinical practice.Of the protein targets, it is important to note the type of target that we are interrogating when using genetic data.Almost exclusively, genetic data assays single protein targets, which make up 80 % of known clinically used target types in ChEMBL (ChEMBL v33).However, other highly represented target types include protein families which represent ~10 % of all protein targets, and protein complexes which make up ~7 % of targets [23].Whilst individual proteins within these groups are tested using genetic data, entire protein complexes or families are not, and it could be the case that the efficacy of some drugs relies on multiple protein target engagement.

Loss of function analysis
Predicted loss-of-function (pLoF) variants are rare genetic variants that are predicted to severely disrupt or inactivate the function of a protein, based on the changes that they encode in the protein sequence.Identifying pLoF variants and associating them with disease provides a 'natural' experiment for drug target discovery.pLoF variants usually occur within coding regions of a gene, are very rare and not typically correlated with other variants in the genome, and so are assumed to be causal.This has an advantage over other study designs evaluating common variation, as traversing from the associated genetic variant to the causal gene and protein is implicit in the study design.Given the known functional consequences of pLoF variants, their effect direction on disease provides valuable indication on the mechanism of action a drug compound should have.If a pLoF variant is associated with reduced disease risk, it suggests that the encoded protein should be the target of an inhibitor.Conversely, if the variant is associated with increased disease risk, the encoded protein should be targeted by an activator.Historically, the identification of these variants focussed on rare, monogenic diseases, driven by family-based linkage analyses, with genetic signals determined in families with the disease, and subsequently confirmed by sequencing to identify disease-causing alleles.More recently, the development of high-throughput sequencing technologies has allowed larger, more phenotypically diverse cohorts to be genotyped, and more computational techniques to be used to predict LoF variants [24].The products of genes harbouring disease-associated LoF variants are now the targets of drugs, both approved and in ongoing trials.For example, genetic studies finding that LoF variants in angiopoietin-like 3 gene (ANGPTL3) were associated with decreased concentrations of triglycerides and cholesterol, led to clinical trials, and subsequently the approval of the drug evinacumab targeting ANGPTL3 as a lipid-lowering therapy [25].The increase in DNA sequence data from large populations has also given rise to numerous computational models predicting the consequences of altered protein function, including LoF as well as missense variants.Examples include SIFT [26], PolyPhen [27], and most recently, AlphaMissense [28].
As discussed by Minikel et al., the sample size required to identify the LoF using whole genome or whole exome sequencing (WGS/WES) in unselected populations is generally prohibitively large.For example, they estimate this would require up to 1,000 times the worldwide available number of genotyped individuals [29].Genetic studies of isolated populations, where the frequency of rare alleles has genetically drifted upwards, or populations that have a historic propensity for consanguineous children, may provide an opportunity to identify LoF in a more realistic sample size setting.However, this also has significant cost and ethical concerns that may limit routine use.For example, the discovery of a deleterious genetic variant can have implications for participants as well as their family members, especially if the latter did not provide consent for the study.In addition, even if LoF variants are found, their importance may not be clear.Not all pLoF variants occur in disease-coding regions, and, even amongst those that do, many of them are in fact 'benign' and have no clear association with a disease or phenotype [30].

Genome-wide association studies
The common disease-common variant hypothesis proposes that for common diseases in a population, genetic variations associated with the disease will also be widespread within the population [24].One study type that supports and exploits this hypothesis is the genome-wide association study (GWAS).GWAS are high-throughput techniques which genotype large numbers of common genetic markers across the genome of a population and test for the association of each one with a phenotype of interest.GWAS can be used to study dichotomous traits, such as the N. Hukerikar et al.
diagnosis of coronary artery disease, as well as quantitative traits, such as body mass index (BMI) or metabolite or protein concentrations.Due to the relatively low costs of GWAS chips, GWAS currently cover a vast range of phenotypes, in large sample sizes, relevant for drug development.These include analyses of disease onset (e.g., CHD), biomedical traits (e.g., glucose or lipid concentrations), imaging traits (e.g., abdominal MRIs), as well as consideration of high throughput proteomics.Analyses are typically conducted by considering biallelic variants, and comparing difference in average phenotype across both alleles [4,31].
Despite the advantages of using GWAS, several limitations mean that further downstream analyses and drug target validation must be carried out on identified targets.Unlike LoF variants, GWAS associations tend to be common, and the identified genetic variants often reflect non-coding, predicted mutations, located near protein-coding genes.In addition, common variants tend to occur in groups of highly correlated alleles at a population level, a concept referred to as linkage disequilibrium (LD).Therefore, it is often difficult to discern the precise causal variant and gene driving the association signal.This issue has been pervasive in GWAS since the first studies conducted in 2007, however, in the latest causal gene prediction models, distance appears to be a key feature in identifying the causal gene, suggesting that in many cases it is the closest gene that drives the association [32,33].In addition, most GWAS variants are thought to be acting in a regulatory capacity, affecting either transcript or protein concentration rather than protein function itself.Therefore, even though most GWAS identify protein-coding genes, there can be no a-priori assumption that this is the case for a causal gene driving a GWAS signal.Additionally, given that the effect direction of a GWAS reflects the arbitrary choice of effect allele, inference on the required mechanism of a developed drug is not immediate and requires additional information, for example anchoring genetic associations on CHD by their LDL-C effect.The relevance of GWAS findings for drug development is often improved by conducting additional analyses such as Mendelian randomisation (MR), which natively account for these two sources of information.

Mendelian randomisation
Mendelian randomisation (MR) is a type of instrumental variable (IV) analysis which leverages genetic variants as instruments to identify causal associations between (modifiable) exposures and outcomes.For this purpose, MR leverages genetic variants associated with an exposure and subsequently determines whether there is a dose-response relationship with the genetic variant effect on an outcome (Fig. 2), where the estimated slope provides an indication of anticipated effect direction of the exposure-outcome association.The original IV methodology used in MR has been adapted to a 'two-sample' paradigm, where GWAS summary statistics are used rather than individual-level data, allowing the use of non-identifiable genetic data from different exposure and outcome datasets, maximising the available sample size compared to traditional cohort studies.
MR is based on three key principles:1) that genetic variants are strongly associated with the potential drug target 2) the genetic variant does not share any common causes with the exposure and/or outcome and 3) that there are no horizontal pleiotropy pathways where the genetic variants might affect disease risk without influencing the exposure of interest [4].By selecting GWAS hits as the variants to study, we can have confidence that the first assumption is met.While the second assumption is hard to formally prove it largely holds true by nature of the experiment.As genetic variation in the population is fixed at gamete

Table 1
Access details of bioinformatics resources for annotation of genes and proteins.Summary and URLs for data sources described in all sections of this review.formation, the probability of confounding, whilst not zero, is greatly reduced.The validity of the third assumption is more difficult to ascertain, but the influence of potential horizontal pleiotropy can be reduced analytically, for which a myriad of pleiotropy robust estimators have been derived [4,[34][35][36].
MR has predominantly been used to establish the causal effects of 'traditional' biomarkers such as blood pressure, LDL-C and BMI, using GWAS associations from throughout the genome as instrumental variables.However, due to the increasing abundance of available proteinquantitative trait loci (pQTLs), MR has been adapted to validate potential protein drug targets.MR studies on proteins typically only leverage proteins from in, or very close to, the encoding gene and are termed cis-MR or drug target MR.
MR for drug target identification and evaluation typically, but not exclusively, sources genetic instruments from within and around a small cis region of the protein encoding gene.MR has produced successful results in a range of settings in drug target validation for cardiovascular disease (CVD), CHD, and multiple other disease groups.In CHD and CVD, for example, MR studies have shown that on-target inhibition of cholesteryl ester transfer protein (CETP) is likely to reduce the risk of CHD and heart failure [15], and that previous failed trials were likely compound related rather than target related [15]; an MR study of HMG-coenzyme A reductase (HMGCR) [14], a licensed drug target for statins, has shown that inhibition of the protein may also have off-target effects such as an increased risk of Type 2 Diabetes [14]; another MR study showed an association between Interleukin 6 receptor (IL-6R) and the risk of ischemic stroke and coronary artery disease (CAD), presenting the protein as a viable therapeutic target for these diseases [37].Aside from this, MR studies found increased interleukin 18 (IL18) to be associated with a decreased risk of inflammatory bowel disease (IBD) [38].This highlighted the potential to repurpose IL18 inhibitors which were previously evaluated in clinical trials for treatment of diabetes.

Colocalization
Colocalization is a method which estimates if two or more distinct GWAS signals are in fact reflecting the same underlying causal variant.Colocalization of GWAS disease and biomarker associations with expression-quantitative trait loci (eQTL) and pQTL signals has been used to attempt to locate the causal gene for a GWAS, where co-located variants are taken as indicators that the gene encoding a pQTL protein is also responsible for the GWAS association.
For drug target validation, colocalization has generally been used post-MR as a prioritisation step to ensure that the identified signal is attributed to the correct exposure.In this context, if it is found that an exposure and the outcome are in fact associated with distinct, causal variants, then it is possible that the GWAS associations are distinct from those in the pQTL, and a pleiotropic pathway to the outcome may exist through, for example, a neighbouring gene [39].

Biomedical databases to prioritise genetic findings for drug development
Methods using genomic data, including MR and colocalization, can provide robust evidence for associations between numerous potential drug targets, and clinically relevant outcomes.However, as previously mentioned, the number of proteins will often be too large, and mechanistically too diverse, to evaluate each finding in confirmatory wet lab experiments.Enriching the results of genetically-based drug target identification and validation studies with a range of additional data sources, incorporating important biomedical context, can help in reducing this set of proteins, and prioritising those which are more likely to be clinically relevant.

Mapping gene, protein and disease identifiers
When using genetic data to prioritise protein drug targets, we assume a trivial one-to-one mapping of gene to protein.However, genes are not labelled with the same unique identifiers as their encoded proteins across datasets, where notably both proteins and genes may have more than one abbreviation or name.For example, the gene PCSK9 (written in Stages of genetically guided drug development as explained in this review: 1) identifying potential protein drug targets from genetic data using appropriate methods, 2) annotating potential targets using available biomedical datasets, 3) prioritising a subset of the annotated drug targets.Fig. 2. Dose-response curve between genetic variants associating with plasma concentration of CYB5A, and their effects on non-alcoholic fatty liver disease.Effect sizes represent mean differences in standard deviation change of protein CYB5A (x-axis), and the log(odds ratio) on non-alcoholic fatty liver disease (yaxis).Each point represents a variant effect, and the gradient of the line is the estimated beta coefficient effect size of the protein on the outcome, weighted by the precision of the y-axis estimates (using an inverse variance weighted Mendelian randomisation estimator [61]).The underlying data are available from Supplementary Table S1.italic font) has 3 gene synonyms (FH3, HCHOLA3, NARC-1), whereas the protein PCSK9 (written in roman font), has a single synonym NARC1.
A widely-used identification system for genes is Ensembl [40], a genome browser in which each gene is assigned a unique identifier.Ensembl incorporates gene annotations from a range of different sources such as the dbSNP [41] for variant information, and the Database of Genotypes and Phenotypes (dbGaP) [42] for phenotype data.Other gene identification systems include the Entrez Gene [43] database for gene-specific information, and the HUGO Gene Nomenclature Committee (HGNC) [44] which maintains unique symbols and names for human loci.An analogue for proteins is the UniProt Knowledgebase (UniProtKB) [45], which contains data on protein sequences and function, and each protein in the database is assigned a unique UniProt accession ID.UniProt provides functionality to map between different identifiers, including Ensembl IDs and UniProt accession IDs.
A common naming convention is also required to identify the diseases associated with the drug targets.Medical Subject Headings (MeSH) [46] are terms defined by the National Library of Medicine, and act as a standardised thesaurus for diseases and medical conditions which can be used to index PubMed.In some data sources, such as the Chemical Biology Database (ChEMBL) [23], diseases and outcomes will be identified by MeSH terms.However, in other cases, this will not be the case, and a metathesaurus such as the Unified Medical Language System (UMLS) [47] can be used to map synonymous disease terms.

Determining the druggability of potential drug targets
Even if a protein is causally related to a disease, for it to be modifiable, it must be a viable drug target, or 'druggable'.Not all genes encode druggable proteins and as such it is important to determine if this is the case for any of the candidate drug targets.By definition, targets of existing drugs must be druggable, however these represent less than 1,000 proteins out of the entire proteome, estimated to cover over 20,000 proteins [1,48].The question arises, how do we determine if a currently undrugged protein is indeed druggable?
To first identify disease-associated proteins which are already targeted by a drug compound, databases such as ChEMBL [23] can be consulted.ChEMBL is an open-source database which provides information on bioactive molecules and their interactions with biological targets, and contains data on over 2.4 million drug compounds and their effects on biological systems.ChEMBL data is manually retrieved from a variety of sources, including drug product labels for marketed drugs, published literature, and ClinicalTrials.gov,which publishes information from clinical trials around the world.From ChEMBL, a range of data can be extracted, including the clinical trial phase of a drug (i.e., was the drug licensed or did it fail at an earlier trial stage), the disease indication, the mechanism of action of the drug, and potential adverse effects.Pre-clinical compounds, that is compounds that are bioactive but have not yet been clinically trialled, are also included in the database [23].
For cases where proteins have not yet been targeted by approved drugs, there are various definitions of 'druggable' which can be consulted.The work by Finan et al. [49] combines protein data from both the British National Formulary (BNF) [50] and ChEMBL, in addition to proteins encoding secreted or plasma membrane proteins that are not included in these databases, to produce a list of 4,479 druggable proteins.These additional proteins, whilst not already targeted by compounds, possess biological characteristics such as location, size and membership in 'highly druggable' protein families which provide strong evidence that they could be targeted by monoclonal antibodies.Open Targets [51], a platform developed specifically for the identification and prioritisation of drug targets, integrates data from Finan et al. with a range of resources including ChEMBL, UniProt and the Human Protein Atlas (HPA) [48], to provide details on the tractability of a protein based on its structure, existing clinical trials, and other relevant features.
It is important to note that definitions of druggability are not static and are constantly evolving.Traditionally these definitions focus on proteins that can be activated or inhibited by small molecules.However, with the development of new targeting modalities such as Proteolysis-Targeting Chimeras (PROTACs) [52], which target specific proteins for degradation, or the targeting of peptides rather than small molecules themselves [53], it is likely that the number of druggable proteins will continue to increase.

Tissue and cell-specific expression of drug targets
Most diseases, at least initially, affect a single or a limited number of tissues.For example, asthma specifically affects the lung, and neurological diseases such as schizophrenia affect the tissue in the brain.It is therefore important to consider in which tissue a genetically identified drug target is expressed, and how likely tissue expression is related to  S2.
disease onset.Furthermore, tissue expression already provides some indication on which therapeutic modality might need to be pursued to ensure the drug can access the target [4].For example, targets expressed in tissues of privileged organs such as in the brain or eye require considerations on how a drug might traverse the blood brain barrier, which is designed to regulate and limit movement between plasma and the brain.Or alternatively, whether drugs acting in tissues such as blood plasma may indirectly affect processes in more privileged areas, for example through active or passive transport.Further insight can be gained by considering single-cell expression data, which can be used to measure the differential expression of a gene across specific cell types.Taking this into consideration can aid in anticipating the efficacy of modulating a potential drug target.For example, if the gene encoding a protein associated with a cancer is found to be expressed in healthy cells but not cancer cells, this could be an indication that targeting the protein may not be effective against the disease.
The Genotype-Tissue Expression (GTEx) [54] project aims to provide tissue-level information on how genetic variation influences gene expression across different tissues.The tissue data is obtained from donors, either post-mortem, or during organ and tissue transplantation surgery, and RNA-sequencing is conducted on the samples.GTEx publishes a range of data based on these analyses, including gene expression at the tissue-level across 54 tissues in the human body and expression-quantitative trait loci (eQTL) which capture genetic associations with gene expression levels across many tissues [55].
The Human Protein Atlas (HPA) [56] is a fully open-access resource which aims to map all human proteins in cells, tissues and organs by integrating results from a range of different technologies including RNA sequencing and tissue imaging.The HPA publishes a breadth of data.The integral part of the HPA is the tissue-level data which focusses on the expression of genes on the mRNA and protein level in human tissues.Here, data from GTEx is combined with internal HPA data and data from the FANTOM5 consortium [57] to provide a consensus classification of gene specificity (a measure of whether a gene is broadly expressed or tissue-specific) and details on gene expression profiles across tissues.The HPA additionally collates single-cell data which measures the expression profile of genes across cell types, and tissue cell-type data which measures cell type specificity of genes within given tissues.

Biological pathways and protein-protein interaction
In almost all cases, candidate drug targets will not act independently in determining disease onset but will rather form part of a complex network of interrelated pathways.Often, the failure of a drug trial is due to lack of efficacy, or adverse side effects of modulating the drug target.Adopting a more systems-based approach to drug target prioritisation, and identifying pathways that are implicated in disease onset and progression has multiple benefits in this regard.Understanding pathways affected by protein perturbation could help in identifying downstream effects, both beneficial and potentially adverse effects of a drug compound.This can be investigated on a more granular level, by observing the direct interactions between the candidate target and other proteins in either the same or different pathways.Furthermore, if a protein identified by GWAS is not druggable, it is possible that other proteins in shared pathways may be, and could be alternative candidates for targeting.
The Gene Ontology (GO) [58] project is a standardised model which organises and classifies gene products to annotate and analyse the role of different genes in biological processes.GO describes the gene products in three distinct domains: Molecular Function which describes activity solely at the molecular level, Biological Processes which describes larger processes accomplished by multiple molecular activities, and Cellular Component which describes locations in which gene products perform functions.These human curated annotations cover over 20,000 individual genes, as well as providing the ontology itself, allowing analysis of gene function at different granularities.A key use of GO is enrichment analysis; given a set of genes, the set of GO terms that are over-or under-represented can be ascertained.
The Reactome knowledgebase [59] is a comprehensive human pathway database where data is obtained from literature, verified manually by biological experts before being published, and is cross-referenced to other sources including GO, Ensembl and UniProt.Reactome is built as a network of reactions, defined as any molecular event, between molecules, including proteins and small molecules, where pathways are built as a series of connected reactions and are organised hierarchically [59].Alongside publishing these curated pathways, Reactome provides a number of tools for subsequent analysis including analysing gene lists for over-represented pathways.In addition, Reactome can query IntAct [60], a database of protein-protein interactions, to obtain lists of protein-protein interactions.
A summary of all mentioned data sources and how they may be accessed can be found in Table 1, and a graphical representation of the approach described in this review can be found in Fig. 1.

Illustrative example: identifying and prioritising proteins associated with non-alcoholic fatty liver disease
As an illustrative example, we identify and prioritise plasma proteins for involvement with non-alcoholic fatty liver disease (NAFLD), representing a range of conditions caused by the build-up of fat in the liver.NAFLD is the most common form of chronic liver disease, with an estimated prevalence of ~25 % globally [62], which is associated with an increased risk of all-cause mortality, predominantly through an increased risk of CVD [63].The aetiology of NAFLD is not yet clearly understood, and currently, no drugs exist for the treatment of NAFLD.Therapeutic strategies are instead aimed at symptom management, focusing on interventions such as improved diet and weight loss, and controlling the cardiometabolic risk factors associated with the disease [64].
In this illustrative example (see Supplementary Methods), we carry out Mendelian randomisation and colocalization analyses to identify a subset of plasma proteins associated with NAFLD.For this, we use the deCODE plasma pQTL (sample size 35,559) [65] and the Anstee et al.GWAS of NAFLD (with 1, 483 biopsy confirmed cases and 17,781 controls).We subsequently demonstrate how a subset of biomedical data resources can be leveraged to validate and prioritise these proteins as targets for drug development.
MR identified 91 plasma proteins which significantly associated with NAFLD after accounting for multiple testing (Fig. 3, See Supplementary Methods and Supplementary Table S2).Colocalization analysis between the plasma protein expression and NAFLD GWAS found evidence for shared variants at 40 loci (See Supplementary Methods and Supplementary Table S3), including five proteins with MR association for NAFLD: NCAN, DAPK2, CYB5A, TGFBI, NT5C; See Fig. 2 for the individual instruments for CYB5A.
The HPA database was used to identify the tissues these five proteins were expressed in, particularly focusing on any potential overexpression (i.e., above averagely expressed) in liver, adipose, or granulocyte tissue, which are of particular relevance to NAFLD [66] (Supplementary Methods).Each of the five proteins were found to be expressed in both liver and adipose tissue, with CYB5A over-expressed in the liver (Supplementary Table S4).
We next consulted the ChEMBL and druggable genome definition to determine whether any of these proteins have been drugged by existing compounds, by a developmental compound, or required completely de novo drug development.According to ChEMBL, none of the five proteins have been targeted by a compound or drug in clinical phase testing.ChEMBL included compounds with activity against DAPK2 and TGFBI, indicating these proteins are druggable and may be considered for NAFLD drug development.
Finally, we queried the Reactome pathway knowledgebase to identify any pathways which were enriched for the five NAFLD associated N. Hukerikar et al. proteins in comparison to all proteins available in the Decode GWAS.Enrichment analysis identified Reactome pathway R-HSA-1430728 reflecting cellular energy metabolism, including mitochondrial lipid metabolism, which is strongly implicated with NAFLD [67,68] (Supplementary Table S5).

Conclusion
In this review, we have discussed the benefits of using genetic data to guide drug target validation, discussing common methods used to identify drug targets associated with disease endpoints.We particularly focussed on leveraging information from biomedical datasets to annotate and prioritise candidate drug targets based on information on compound affinity, tissue expression, and biological pathway membership.Finally, we demonstrated how a combination of these datasets could be used to prioritise proteins associated with NAFLD.

Fig. 1 .
Fig. 1.Graphical abstract.Stages of genetically guided drug development as explained in this review: 1) identifying potential protein drug targets from genetic data using appropriate methods, 2) annotating potential targets using available biomedical datasets, 3) prioritising a subset of the annotated drug targets.

Fig. 3 .
Fig. 3.The Mendelian randomisation estimates of proteins on non-alcoholic fatty liver disease.(A) (left panel) Effect sizes and statistical significance of each protein on non-alcoholic fatty liver disease.Each point represents a protein, effect estimates are represented in log(odds ratio) (x-axis) and statistical significance in log(p-value) (y-axis).Coloured points represent proteins passing the Bonferroni multiplicitycorrected p-value of 3.20 × 10 − 5 based on the number of proteins (1,978).(B) (right panel) Mendelian randomisation estimates of five prioritised proteins with an effect on non-alcoholic fatty liver disease.Effect estimates are reported as odds ratios with 95 % confidence intervals (95 %CI).Proteins are annotated according to their druggability based on information from the British National Formulary and ChEMBL.Proteins are referred to by their Ensembl gene names.The underlying data are available from Supplementary TableS2.