Advertisement

Data-driven multivariate population subgrouping via lipoprotein phenotypes versus apolipoprotein B in the risk assessment of coronary heart disease

  • Pauli Ohukainen
    Affiliations
    Computational Medicine, Faculty of Medicine, University of Oulu, Oulu, Finland

    Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu, Finland

    Biocenter Oulu, University of Oulu, Oulu, Finland
    Search for articles by this author
  • Sanna Kuusisto
    Affiliations
    Computational Medicine, Faculty of Medicine, University of Oulu, Oulu, Finland

    Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu, Finland

    Biocenter Oulu, University of Oulu, Oulu, Finland

    NMR Metabolomics Laboratory, School of Pharmacy, University of Eastern Finland, Kuopio, Finland
    Search for articles by this author
  • Johannes Kettunen
    Affiliations
    Computational Medicine, Faculty of Medicine, University of Oulu, Oulu, Finland

    Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu, Finland

    Biocenter Oulu, University of Oulu, Oulu, Finland

    National Institute for Health and Welfare, Helsinki, Finland
    Search for articles by this author
  • Markus Perola
    Affiliations
    National Institute for Health and Welfare, Helsinki, Finland

    Diabetes and Obesity Research Program, University of Helsinki, Helsinki, Finland

    Estonian Genome Center, University of Tartu, Tartu, Estonia
    Search for articles by this author
  • Marjo-Riitta Järvelin
    Affiliations
    Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu, Finland

    Biocenter Oulu, University of Oulu, Oulu, Finland

    Unit of Primary Health Care, Oulu University Hospital, OYS, Oulu, Finland

    Department of Epidemiology and Biostatistics, MRC-PHE Centre for Environment and Health, School of Public Health, Imperial College London, London, UK

    Department of Life Sciences, College of Health and Life Sciences, Brunel University London, UK
    Search for articles by this author
  • Ville-Petteri Mäkinen
    Affiliations
    Computational and Systems Biology Program, Precision Medicine Theme, South Australian Health and Medical Research Institute, Australia

    Hopwood Centre for Neurobiology, Lifelong Health Theme, SAHMRI, Australia
    Search for articles by this author
  • Mika Ala-Korpela
    Correspondence
    Corresponding author. Computational Medicine, Faculty of Medicine University of Oulu, Oulu, Finland.
    Affiliations
    Computational Medicine, Faculty of Medicine, University of Oulu, Oulu, Finland

    Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu, Finland

    Biocenter Oulu, University of Oulu, Oulu, Finland

    NMR Metabolomics Laboratory, School of Pharmacy, University of Eastern Finland, Kuopio, Finland
    Search for articles by this author
Open AccessPublished:December 12, 2019DOI:https://doi.org/10.1016/j.atherosclerosis.2019.12.009

      Highlights

      • Data-driven subgrouping algorithm was trained by multivariate lipoprotein data.
      • Four coherent subgroups were identified in two large-scale population-based cohorts.
      • Subgroups had characteristic lipoprotein profiles and risk for CHD.
      • Apolipoprotein B quartiles stratified CHD risk better than multivariate subgroups.
      • Caution on multivariate data-driven subgrouping in risk assessment is warranted.

      Abstract

      Background and aims

      Population subgrouping has been suggested as means to improve coronary heart disease (CHD) risk assessment. We explored here how unsupervised data-driven metabolic subgrouping, based on comprehensive lipoprotein subclass data, would work in large-scale population cohorts.

      Methods

      We applied a self-organizing map (SOM) artificial intelligence methodology to define subgroups based on detailed lipoprotein profiles in a population-based cohort (n = 5789) and utilised the trained SOM in an independent cohort (n = 7607). We identified four SOM-based subgroups of individuals with distinct lipoprotein profiles and CHD risk and compared those to univariate subgrouping by apolipoprotein B quartiles.

      Results

      The SOM-based subgroup with highest concentrations for non-HDL measures had the highest, and the subgroup with lowest concentrations, the lowest risk for CHD. However, apolipoprotein B quartiles produced better resolution of risk than the SOM-based subgroups and also striking dose-response behaviour.

      Conclusions

      These results suggest that the majority of lipoprotein-mediated CHD risk is explained by apolipoprotein B-containing lipoprotein particles. Therefore, even advanced multivariate subgrouping, with comprehensive data on lipoprotein metabolism, may not advance CHD risk assessment.

      Keywords

      1. Introduction

      Increasing amounts of data available in epidemiology and medicine have generated interest in more detailed stratification of disease risk. Data-driven subgroup analyses have revealed new metabolic characteristics of complex diseases and uncovered subgroup-specific risk factors that could potentially improve risk assessment [
      • Ala-Korpela M.
      Data-driven subgrouping in epidemiology and medicine.
      ]. Recent studies have investigated this approach in type 1 [
      • Lithovius R.
      • Toppila I.
      • Harjutsalo V.
      • Forsblom C.
      • Groop P.H.
      • et al.
      Data-driven metabolic subtypes predict future adverse events in individuals with type 1 diabetes.
      ] and type 2 diabetes [
      • Ahlqvist E.
      • Storm P.
      • Käräjämäki A.
      • Martinell M.
      • Dorkhan M.
      • et al.
      Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables.
      ] as well as in sepsis [
      • Seymour C.W.
      • Kennedy J.N.
      • Wang S.
      • Chang C.H.
      • Elliott C.F.
      • et al.
      Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis.
      ]. Various algorithms can be utilised for subgrouping [
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ] but the core principle is the same; to characterise a heterogeneous population so that individuals with shared metabolic, genetic and/or clinical characteristics are grouped together. In addition to identifying subgroup-specific risk, this approach can be useful in understanding complex multivariate phenotypes and in finding metabolically and maybe also genetically characteristic subgroups of individuals [
      • Ala-Korpela M.
      Data-driven subgrouping in epidemiology and medicine.
      ].
      Here we applied a statistical artificial intelligence framework – a so-called self-organizing map (SOM) – that clusters individuals without explicit boundaries between groups or cut-off values for variables [
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ,
      • Mäkinen V.P.
      • Forsblom C.
      • Thorn L.M.
      • Wadén J.
      • Gordin D.
      • et al.
      Metabolic phenotypes, vascular complications, and premature deaths in a population of 4,197 patients with type 1 diabetes.
      ,
      • Kumpula L.S.
      • Mäkelä S.M.
      • Mäkinen V.P.
      • Karjalainen A.
      • Liinamaa J.M.
      • et al.
      Characterization of metabolic interrelationships and in silico phenotyping of lipoprotein particles using self-organizing maps.
      ]. SOM analyses have a long-term track record in biomedical applications [
      • Ala-Korpela M.
      Data-driven subgrouping in epidemiology and medicine.
      ,
      • Lithovius R.
      • Toppila I.
      • Harjutsalo V.
      • Forsblom C.
      • Groop P.H.
      • et al.
      Data-driven metabolic subtypes predict future adverse events in individuals with type 1 diabetes.
      ,
      • Mäkinen V.P.
      • Forsblom C.
      • Thorn L.M.
      • Wadén J.
      • Gordin D.
      • et al.
      Metabolic phenotypes, vascular complications, and premature deaths in a population of 4,197 patients with type 1 diabetes.
      ,
      • Kumpula L.S.
      • Mäkelä S.M.
      • Mäkinen V.P.
      • Karjalainen A.
      • Liinamaa J.M.
      • et al.
      Characterization of metabolic interrelationships and in silico phenotyping of lipoprotein particles using self-organizing maps.
      ] and recently open-source software, aimed at large-scale epidemiological data, was published as an R library [
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ]. Conceptually, the SOM is a projection of multi-dimensional data onto a two-dimensional map [
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ,
      • Kumpula L.S.
      • Mäkelä S.M.
      • Mäkinen V.P.
      • Karjalainen A.
      • Liinamaa J.M.
      • et al.
      Characterization of metabolic interrelationships and in silico phenotyping of lipoprotein particles using self-organizing maps.
      ]. For example, in this study each participant is assigned a location on the map based on a pre-selected set of lipoprotein-related variables: people within the same map area share a similar overall lipoprotein profile, while people far apart have different profiles. Therefore, comparisons between map areas are analogous to comparisons between subgroups of individuals.
      Detailed quantitative molecular data are becoming increasingly common for large-scale studies in epidemiology via quantitative high-throughput metabolomics [
      • Ala-Korpela M.
      • Davey Smith G.
      Metabolic profiling-multitude of technologies with great research potential, but (when) will translation emerge?.
      ,
      • Würtz P.
      • Kangas A.J.
      • Soininen P.
      • Lawlor D.A.
      • Davey Smith G.
      • et al.
      Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on -omic technologies.
      ]. A nuclear magnetic resonance (NMR) spectroscopy-based platform has been broadly applied in epidemiology and genetics over the last few years; this platform is particularly advantageous in detailed lipoprotein profiling [
      • Ala-Korpela M.
      • Davey Smith G.
      Metabolic profiling-multitude of technologies with great research potential, but (when) will translation emerge?.
      ,
      • Würtz P.
      • Kangas A.J.
      • Soininen P.
      • Lawlor D.A.
      • Davey Smith G.
      • et al.
      Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on -omic technologies.
      ,
      • Kettunen J.
      • Demirkan A.
      • Würtz P.
      • Draisma H.H.M.
      • Haller T.
      • et al.
      Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA.
      ,
      • Locke A.E.
      • Steinberg K.M.
      • Chiang C.W.K.
      • Service S.K.
      • Havulinna A.S.
      • et al.
      Exome sequencing of Finnish isolates enhances rare-variant association power.
      ,
      • Tukiainen T.
      • Kettunen J.
      • Kangas A.J.
      • Lyytikäinen L.P.
      • Soininen P.
      • et al.
      Detailed metabolic and genetic characterization reveals new associations for 30 known lipid loci.
      ]. These metabolic data are typically continuous and do not instinctively represent subgroups but continuous, heavily overlapping distributions. However, intuitive thinking would be that elaborate utilisation of extensive multivariate data would not only lead to better understanding of the complexities but also to better translational opportunities and improved disease risk assessment.
      We investigated here if unsupervised data-driven metabolic subgrouping, with SOM-based artificial intelligence and comprehensive NMR-based lipoprotein subclass data in large-scale population cohorts, could provide new insight on coronary heart disease (CHD) risk assessment. We demonstrate the resulting metabolic characteristics of the SOM-based subgroups and compare their risk assessment abilities to those of univariate subgroups based on a well-known causal CHD biomarker, apolipoprotein B (apoB) [
      • Ference B.A.
      • Kastelein J.J.P.
      • Ray K.K.
      • Ginsberg H.N.
      • Chapman M.J.
      • et al.
      Association of triglyceride-lowering LPL variants and LDL-C–lowering LDLR variants with risk of coronary heart disease.
      ,
      • Sniderman A.D.
      • Pencina M.
      • Thanassoulis G.
      ApoB: the power of physiology to transform the prevention of cardiovascular disease.
      ,
      • Ala-Korpela M.
      The culprit is the carrier, not the loads: cholesterol, triglycerides and apolipoprotein B in atherosclerosis and coronary heart disease.
      ,
      • Borén J.
      • Williams K.J.
      The central role of arterial retention of cholesterol-rich apolipoprotein-B-containing lipoproteins in the pathogenesis of atherosclerosis: a triumph of simplicity.
      ].

      2. Materials and methods

      2.1 Population cohorts

      The Northern Finland Birth Cohort 1966 (NFBC66) was set up in the two northernmost provinces of Finland to study factors associated with preterm birth and morbidity during follow-up (www.oulu.fi/nfbc). Originally, a total of 12,058 children (96% of all births in 1966 in the region) were born into the cohort. For this study, we utilised data from 46-year sample collection in which a well representative 52% of the original cohort attended [
      • Järvelin M.R.
      • Sovio U.
      • King V.
      • Lauren L.
      • Xu B.
      • et al.
      Early life factors and blood pressure at age 31 years in the 1966 Northern Finland birth cohort.
      ]. NMR-based lipoprotein data (96% fasting samples) were available from 5789 participants.
      FINRISK 1997 (FINRISK97) is a nationally representative cohort, established by the Finnish National Institute for Health and Welfare to monitor middle-aged population health outcomes and risk factors [
      • Borodulin K.
      • Vartiainen E.
      • Peltonen M.
      • Jousilahti P.
      • Juolevi A.
      • et al.
      Forty-year trends in cardiovascular risk factors in Finland.
      ]. Originally 8444 participants aged 25–74 years were recruited and 15-year follow-up data was available. NMR-based lipoprotein profiling was from semi-fasted (minimum 4 h of fasting before blood was drawn) serum samples from 7607 participants (mean age 48 ± 13 years).

      2.2 Apolipoprotein, lipid and lipoprotein subclass analyses

      An NMR spectroscopy-based methodology that is currently widely applied in large-scale epidemiology and genetics was applied [
      • Ala-Korpela M.
      • Davey Smith G.
      Metabolic profiling-multitude of technologies with great research potential, but (when) will translation emerge?.
      ,
      • Würtz P.
      • Kangas A.J.
      • Soininen P.
      • Lawlor D.A.
      • Davey Smith G.
      • et al.
      Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on -omic technologies.
      ,
      • Kettunen J.
      • Demirkan A.
      • Würtz P.
      • Draisma H.H.M.
      • Haller T.
      • et al.
      Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA.
      ,
      • Locke A.E.
      • Steinberg K.M.
      • Chiang C.W.K.
      • Service S.K.
      • Havulinna A.S.
      • et al.
      Exome sequencing of Finnish isolates enhances rare-variant association power.
      ,
      • Tukiainen T.
      • Kettunen J.
      • Kangas A.J.
      • Lyytikäinen L.P.
      • Soininen P.
      • et al.
      Detailed metabolic and genetic characterization reveals new associations for 30 known lipid loci.
      ]. This platform is powerful in lipoprotein subclass analysis and its large-scale epidemiological applications have recently been reviewed [
      • Würtz P.
      • Kangas A.J.
      • Soininen P.
      • Lawlor D.A.
      • Davey Smith G.
      • et al.
      Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on -omic technologies.
      ]. Briefly, the method provides apolipoprotein A-I (apoA-I) and B concentrations and standard clinical lipids; low-density lipoprotein (LDL) and high-density lipoprotein (HDL) cholesterol as well as total cholesterol and triglycerides. In addition, quantitative data on lipoprotein particle concentrations and their main lipid constituents (phospholipids, triglycerides, cholesteryl esters and free cholesterol molecules) for 14 lipoprotein subclasses are obtained. The lipoprotein subclasses are characterised by particle size as follows: very-low-density lipoprotein (VLDL) fraction consists of extremely large (average diameter >75 nm), very large (64 nm), large (53.6 nm), medium (44.5 nm), small (36.8 nm) and very small (31.3 nm) particles. Intermediate-density lipoprotein (IDL) particles are on average 28.6 nm in diameter. LDL particles are divided into three subclasses; large (25.5 nm), medium (23.0 nm) and small (18.7 nm). HDL fraction consists of four subclasses; very large (14.3 nm), large (12.1 nm), medium (10.9 nm) and small (8.7 nm).

      2.3 Univariate subgrouping – apolipoprotein B quartiles

      Apolipoprotein B quartiles were calculated and used in the survival analysis.

      2.4 Multivariate subgrouping – self-organizing map analysis

      The SOM analyses were undertaken with the Numero software package [
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ] in the R environment. Details and practicalities of SOM analysis are described elsewhere [
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ]. All analyses here were based on a total of 44 lipoprotein subclass measures (collectively referred to as input variables): particle, triglyceride and total cholesterol (cholesteryl esters and free cholesterol summed together in each subclass particle) concentration for six VLDL subclasses, IDL, three LDL subclasses, four HDL subclasses, apoB and apoA-I. Input variables were pre-processed with previously published tools within the Numero R package [
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ] (rank-based transform for men and women separately and normalized to the range between −1 and 1). The SOM analysis was first performed in NFBC66 to identify the apparent subgroups based on the extensive lipoprotein data. The resulted (trained) SOM and the pre-defined subgroups were then applied directly to classify the participants in the FINRISK97 cohort using an identical set of input variables. To visualise the overall lipid profile in each subgroup, z-scores of log-transformed measures were calculated as (subgroup mean – all data mean)/all data SD.
      As a sensitivity analysis, fully independent SOM training, subgrouping and survival analysis were performed solely on the basis of the FINRISK97 data; the characteristics of the resulting SOM subgroups as well as the Kaplan-Meier curves were very similar. Excluding participants with a prevalent CHD at baseline (n = 199) or not including apoB as an input variable in the SOM analysis yielded essentially the same results. The interpretation and conclusions regarding the comparison between univariate and multivariate subgrouping were identical.

      2.5 Survival analyses

      Participants in FINRISK97 with prevalent CHD (n = 199) and those with missing data or outliers (n = 7) were removed. Final analyses for the 15-year follow-up had 7306 participants with 575 incident CHD events (defined as fatal or nonfatal myocardial infraction, cardiac revascularization, or unstable angina). Kaplan-Meier curves for each SOM-based subgroup were calculated for incident first CHD events. Identical analyses were performed for the apoB quartiles. Analyses were performed in R statistical language.

      3. Results

      3.1 Data-driven population subgrouping – SOM-based analysis

      Results from the SOM-based classification of the participants in the FINRISK97 cohort are illustrated in Fig. 1. All 44 lipoprotein subclass measures (input variables) were used in the SOM but component planes (colourings of the SOM for individual variables) are shown only for the particle concentrations of the 14 lipoprotein subclasses as well as for apoA-I and apoB. The component planes for total cholesterol and triglyceride concentrations were very similar to the corresponding particle concentration ones shown. The component planes shown in Fig. 1 demonstrate strong regional patterns, particularly for the apoB-containing lipoprotein fractions (VLDL, IDL and LDL) and are labelled from I to IV in the ascending order of the subgroup mean apoB concentration. However, it is important to note that the circulating apoB concentration distributions overlap between all four subgroups, and heavily between the adjacent subgroups, as illustrated in Fig. 2.
      Fig. 1
      Fig. 1Statistical colourings (component planes) of circulating lipoprotein particle, apolipoprotein A-I and apolipoprotein B concentrations on the self-organizing map.
      Each component plane shows colouring on the same SOM. The SOM illustrated is for the FINRISK97 cohort data for 7607 participants using 44 input variables (particle, triglyceride and total cholesterol concentration for 14 lipoprotein subclasses as well as apoA-I and apoB concentrations). The SOM organisation and the areas depicting the four population subgroups are from an independent training of the SOM for 5789 participants in the NFBC66. Subgroup I is characterised by the lowest, and subgroup IV the highest, mean apoB and related triglyceride and cholesterol concentrations. Subgroups II and III have intermediate concentrations of apoB but have elevated cholesterol and elevated triglycerides, respectively (see for further details). The colour scale indicates deviation from the population mean with respect to random fluctuations that could be expected by chance; red refers to higher and blue for lower concentrations. The numbers on selected units tell the local mean value for that particular region in the original measurement unit (the values for VLDL are 10−10 mol/l, for IDL and LDL 10−8 mol/l, for HDL 10−7 mol/l, and for apoA-I and apoB g/l). The SOM is a two-dimensional organisation of the participants based on multi-dimensional input data, in this case 44 variables describing lipoprotein metabolism. The position on the map is unique and dependent on the input variable profile; thus each individual is always in the same place on each component plane. The P value below each component plane indicates the probability of observing equivalent regional variability for random data. Abbreviations: SOM, self-organizing map; VLDL, very-low-density lipoprotein; IDL, intermediate-density lipoprotein; LDL, low-density lipoprotein; HDL, high-density lipoprotein; XXL, extremely large; XL, very large; L, large; M, medium; S, small; XS, very small; apoA-I, apolipoprotein A-I; apoB, apolipoprotein B.
      Fig. 2
      Fig. 2Characteristics of the lipoprotein subclass profiles for the SOM-based population subgroups.
      Analysis details and abbreviations are as explained in the caption for . Subgroup I is characterised by the lowest concentrations of apoB and coherently low concentrations for all apoB-containing lipoprotein subclasses. The opposite is the case for subgroup IV. Subgroups II and II have intermediate and rather similar concentrations for apoB. However, subgroup II is characterised by rather high concentrations of VLDL subclasses and rather low concentrations of IDL and LDL subclasses. The situation is opposite for subgroup III. XS-VLDL subclass shows intermediary behaviour between VLDL and LDL subclasses in subgroups II and III. The values shown in the histograms are z-scores of log-transformed measures ((subgroup mean – all data mean)/all data standard deviation). Note that while the mean apoB concentrations differ for the different subgroups, their distributions are heavily overlapping, particularly between the adjacent subgroups. The subgroup interpretations therefore are characteristic for the entire group and not necessarily for a single individual within the group, exactly as anticipated in population epidemiology [
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ,
      • Mäkinen V.P.
      • Forsblom C.
      • Thorn L.M.
      • Wadén J.
      • Gordin D.
      • et al.
      Metabolic phenotypes, vascular complications, and premature deaths in a population of 4,197 patients with type 1 diabetes.
      ,
      • Kumpula L.S.
      • Mäkelä S.M.
      • Mäkinen V.P.
      • Karjalainen A.
      • Liinamaa J.M.
      • et al.
      Characterization of metabolic interrelationships and in silico phenotyping of lipoprotein particles using self-organizing maps.
      ,
      • Ala-Korpela M.
      • Davey Smith G.
      Metabolic profiling-multitude of technologies with great research potential, but (when) will translation emerge?.
      ].
      Subgroup I is characterised by the lowest, and subgroup IV the highest, mean apoB concentration and the related triglyceride and cholesterol concentrations. The key separators for the adjacent subgroups II and III are VLDL and LDL particle concentrations, respectively. Thus, subgroup II is characterised by elevated triglycerides and rather low cholesterol; the situation is vice versa for subgroup III. The lipoprotein subclass particle concentration histograms together with apoB and apoA-I concentrations for the four subgroups are illustrated in Fig. 2. The characteristics and behaviour for VLDL, IDL and LDL subclasses is systematic however not identical within and between the subgroups. The HDL subclasses behave in a more heterogeneous manner, however subgroup I being characterised by the highest and the subgroup IV with the lowest HDL particle concentrations.

      3.2 Risk of coronary heart disease by multivariate and univariate subgroups

      Results from the survival analyses in the FINRISK97 for the SOM-based population subgroup are presented in Fig. 3A and for the apoB quartiles in Fig. 3B. The Kaplan-Meier curves based on the quartiles of circulating apoB concentrations show striking dose-response behaviour in contradiction to the subgroups from the multivariate SOM analysis. Despite substantially different lipoprotein subclass concentration profiles for SOM-based subgroups II and III (Fig. 2) they have highly overlapping curves to incident CHD (Fig. 3A).
      Fig. 3
      Fig. 3Survival analysis of multivariate and univariate population subgroups.
      In part A the subgroups are based on the SOM analyses illustrated and detailed in Fig. 1, Fig. 2 and in B the subgroups represent apolipoprotein B quartiles. The Kaplan-Meier curves demonstrate the 15-year cumulative event risk for incident first coronary heart disease events (7306 individuals with 575 events including fatal or nonfatal myocardial infraction, cardiac revascularization or unstable angina). Individuals with baseline CHD were excluded from the analysis. The apolipoprotein B quartiles show excellent dose-response behaviour and better description of the overall risk than the participant subgroups from the multivariate SOM analysis.

      4. Discussion

      In this study, we present a novel application of an artificial intelligence algorithm, so-called self-organizing maps, to define subgroups based on detailed lipoprotein profiles in large population-based cohorts. Four distinct subgroups in relation to apolipoprotein B and A-I as well as to lipoprotein subclasses were characterised and the subgroup-specific risk for incident coronary heart disease risk evaluated. An instinctive expectation might be that utilisation of comprehensive multivariate data on lipoprotein metabolism would lead to better understanding and also to better estimation of coronary heart disease risk. However, the results did not fully support this expectation but a univariate analyses, using only apolipoprotein B quartiles, led to a better and more logical estimation of incident disease risk. In spite of that, the data-driven SOM analysis gave invaluable detailed information on how lipoprotein metabolism, at the subclass level, relates to the risk of CHD. This kind of information cannot be obtained via univariate analysis and has biological and potentially translational value, though the results show that data-driven analysis is not optimal for outcome risk assessment [
      • van Smeden M.
      • Harrell Jr., F.E.
      • Dahly D.L.
      Novel diabetes subgroups.
      ].
      The subgroup with the highest concentrations for non-HDL particles (VLDL, IDL and LDL) and for circulating apolipoprotein B represented the highest risk for CHD. Conversely, the subgroup characterised by the lowest values for these measures represented the lowest risk. The intermediate subgroups with elevated triglycerides and with elevated cholesterol had substantially different VLDL and LDL subclass profiles, yet a comparable concentration of apoB and overlapping event curves. Thus, even though multidimensional and comprehensive data on lipoprotein subclass profiles were used, the apoB concentrations in the population subgroups appeared to be directly related to the CHD risk, even in the presence of major variation in cholesterol and triglyceride concentrations.
      In this context is would be good to note that the circulating apoB concentrations overlap between all four SOM-based subgroups. This is obviously in contradiction with the apoB quartiles that by definition are separate. Our interpretation for the situation is that in the multivariate SOM analysis, with the equally weighted set of lipoprotein measures, inclusion of some variables that might not directly (or not as well as apoB) relate to the incident CHD events, though they markedly vary between individuals, is likely to diminish the predictive values of the subgroups. While the multivariate metabolic lipoprotein data on lipoprotein subclasses is noteworthy in understanding details related to lipoprotein metabolism, a single good (in this case causal) biomarker is likely to be more useful from the predictive perspective.
      Even though the results regarding the role of apoB as a single predictive biomarker may not be instinctive from the data analysis point of view, they are not surprising from the biological perspective and add to the burgeoning evidence for the fundamental role of apoB-containing lipoprotein particles in the development of atherosclerosis and in defining the risk for CHD [
      • Ference B.A.
      • Kastelein J.J.P.
      • Ray K.K.
      • Ginsberg H.N.
      • Chapman M.J.
      • et al.
      Association of triglyceride-lowering LPL variants and LDL-C–lowering LDLR variants with risk of coronary heart disease.
      ,
      • Sniderman A.D.
      • Pencina M.
      • Thanassoulis G.
      ApoB: the power of physiology to transform the prevention of cardiovascular disease.
      ,
      • Ala-Korpela M.
      The culprit is the carrier, not the loads: cholesterol, triglycerides and apolipoprotein B in atherosclerosis and coronary heart disease.
      ,
      • Borén J.
      • Williams K.J.
      The central role of arterial retention of cholesterol-rich apolipoprotein-B-containing lipoproteins in the pathogenesis of atherosclerosis: a triumph of simplicity.
      ,
      • Skålén K.
      • Gustafsson M.
      • Rydberg E.K.
      • Hultén L.M.
      • Wiklund O.
      • et al.
      Subendothelial retention of atherogenic lipoproteins in early atherosclerosis.
      ,
      • Tabas I.
      • Williams K.J.
      • Borén J.
      Subendothelial lipoprotein retention as the initiating process in atherosclerosis: update and therapeutic implications.
      ,
      • Proctor S.D.
      • Vine D.F.
      • Mamo J.C.
      Arterial retention of apolipoprotein B(48)- and B(100)-containing lipoproteins in atherogenesis.
      ,
      • Goldstein J.L.
      • Brown M.S.
      A century of cholesterol and coronaries: from plaques to genes to statins.
      ].
      Particularly Mendelian randomization analyses, using genetic instrument in large-scale studies, have played a crucial role in increasing our knowledge on the key causal molecular players in CHD [
      • Holmes M.V.
      • Ala-Korpela M.
      • Smith G.D.
      Mendelian randomization in cardiometabolic disease: challenges in evaluating causality.
      ]. A recent extensive study by Ference at al. [
      • Ference B.A.
      • Kastelein J.J.P.
      • Ray K.K.
      • Ginsberg H.N.
      • Chapman M.J.
      • et al.
      Association of triglyceride-lowering LPL variants and LDL-C–lowering LDLR variants with risk of coronary heart disease.
      ], comparing the effects of genetic modification of lowering triglycerides with the lowering of LDL cholesterol, convincingly showed that the clinical benefit of lowering triglycerides as well as LDL cholesterol is proportional to the absolute change in the circulating apoB concentration. Thus, the apoB-containing lipoprotein particles appear to be the key factor, not the lipids per se transported in these particles. However, the apoB protein molecule does not circulate without lipids, so if there is an apoB molecule, there are also lipid molecules, but the apoB seems to the biological component that defines the way [
      • Ference B.A.
      • Kastelein J.J.P.
      • Ray K.K.
      • Ginsberg H.N.
      • Chapman M.J.
      • et al.
      Association of triglyceride-lowering LPL variants and LDL-C–lowering LDLR variants with risk of coronary heart disease.
      ,
      • Sniderman A.D.
      • Pencina M.
      • Thanassoulis G.
      ApoB: the power of physiology to transform the prevention of cardiovascular disease.
      ,
      • Ala-Korpela M.
      The culprit is the carrier, not the loads: cholesterol, triglycerides and apolipoprotein B in atherosclerosis and coronary heart disease.
      ,
      • Borén J.
      • Williams K.J.
      The central role of arterial retention of cholesterol-rich apolipoprotein-B-containing lipoproteins in the pathogenesis of atherosclerosis: a triumph of simplicity.
      ,
      • Skålén K.
      • Gustafsson M.
      • Rydberg E.K.
      • Hultén L.M.
      • Wiklund O.
      • et al.
      Subendothelial retention of atherogenic lipoproteins in early atherosclerosis.
      ,
      • Tabas I.
      • Williams K.J.
      • Borén J.
      Subendothelial lipoprotein retention as the initiating process in atherosclerosis: update and therapeutic implications.
      ,
      • Proctor S.D.
      • Vine D.F.
      • Mamo J.C.
      Arterial retention of apolipoprotein B(48)- and B(100)-containing lipoproteins in atherogenesis.
      ,
      • Goldstein J.L.
      • Brown M.S.
      A century of cholesterol and coronaries: from plaques to genes to statins.
      ].
      These results have general implications on data-driven subgrouping in epidemiology and potential translational applications. SOMs have been successfully used to identify metabolically different subgroups in patients with type 1 diabetes and thus to gain deeper understanding of population diversity and multi-morbidity [
      • Ala-Korpela M.
      Data-driven subgrouping in epidemiology and medicine.
      ,
      • Lithovius R.
      • Toppila I.
      • Harjutsalo V.
      • Forsblom C.
      • Groop P.H.
      • et al.
      Data-driven metabolic subtypes predict future adverse events in individuals with type 1 diabetes.
      ,
      • Gao S.
      • Mutter S.
      • Casey A.
      • Mäkinen V.-P.
      Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
      ,
      • Mäkinen V.P.
      • Forsblom C.
      • Thorn L.M.
      • Wadén J.
      • Gordin D.
      • et al.
      Metabolic phenotypes, vascular complications, and premature deaths in a population of 4,197 patients with type 1 diabetes.
      ]. However, the risk prediction for specific clinical endpoints is a separate issue calling for careful and detailed analysis [
      • van Smeden M.
      • Harrell Jr., F.E.
      • Dahly D.L.
      Novel diabetes subgroups.
      ] and should not be conflated with exploratory studies. Even though individuals can be clustered with several different methods and on the basis of various metabolic data, these measures might not be optimal from the risk assessment perspective and the subgrouping may thus not provide general clinical utility. Theoretically, an unsupervised clustering is likely to suffer from a large number of variables that carry information not related to the outcome. However, all lipoprotein data used in this work in the SOM analyses were associated with CHD, and thus we consider this a negligible phenomenon in this study.
      Our results are conceptually consistent with a recent study in which data-driven cluster analysis was applied in patients with newly diagnosed type 2 diabetes [
      • Dennis J.M.
      • Shields B.M.
      • Henley W.E.
      • Jones A.G.
      • Hattersley A.T.
      Disease progression and treatment response in data-driven subgroups of type 2 diabetes compared with models based on simple clinical features: an analysis using clinical trial data.
      ]. In two type 2 diabetes related trials the authors applied a previously optimistically presented data-driven population subgrouping [
      • Ahlqvist E.
      • Storm P.
      • Käräjämäki A.
      • Martinell M.
      • Dorkhan M.
      • et al.
      Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables.
      ] and compared the clinical utility of this subgroup-based approach to predicting patient outcomes with an alternative strategy of developing models for each outcome using simple patient characteristics. Their conclusion was that for the best clinical utility, approaches using specific phenotypic measures to predict specific outcomes would most likely perform better than assigning patients to subgroups [
      • Dennis J.M.
      • Shields B.M.
      • Henley W.E.
      • Jones A.G.
      • Hattersley A.T.
      Disease progression and treatment response in data-driven subgroups of type 2 diabetes compared with models based on simple clinical features: an analysis using clinical trial data.
      ]. This independent finding supports our results and interpretation in relation to the data-driven population subgrouping versus apolipoprotein B in assessing the risk of CHD.

      4.1 Conclusion

      The results presented here provide evidence to temper some of the enthusiasm towards multivariable stratification and risk profiling in epidemiology and potential translational applications. Population subgroups with distinct metabolic and risk profiles can be identified but any incremental clinical utility should be specifically determined by rigorous testing against the most validated existing markers.

      Financial support

      PO is supported by the Emil Aaltonen Foundation . JK and MAK are supported by a research grant from the Sigrid Juselius Foundation, Finland . The cohorts and this work have also been supported by funding from the Academy of Finland , Novo Nordisk Foundation and EU .

      Author contributions

      All listed authors meet the requirements for authorship. Concept and design: PO, MAK. Clinical data: MP, MRJ, MAK. Lipoprotein analyses: MAK. Analysis plan and interpretations: PO, SK, JK, VPM and MAK. Statistical analyses: PO and SK. Draft manuscript: PO, MAK. All authors commented the manuscript and agreed to its content. Overall responsibility: PO, MAK.

      Declaration of competing interest

      The authors declared they do not have anything to disclose regarding conflict of interest with respect to this manuscript.

      References

        • Ala-Korpela M.
        Data-driven subgrouping in epidemiology and medicine.
        Int. J. Epidemiol. 2019; 48: 374-376https://doi.org/10.1093/ije/dyz040
        • Lithovius R.
        • Toppila I.
        • Harjutsalo V.
        • Forsblom C.
        • Groop P.H.
        • et al.
        Data-driven metabolic subtypes predict future adverse events in individuals with type 1 diabetes.
        Diabetologia. 2017; 60: 1234-1243https://doi.org/10.1007/s00125-017-4273-8
        • Ahlqvist E.
        • Storm P.
        • Käräjämäki A.
        • Martinell M.
        • Dorkhan M.
        • et al.
        Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables.
        Lancet Diabetes Endocrinol. 2018; 6: 361-369https://doi.org/10.1016/S2213-8587(18)30051-2
        • Seymour C.W.
        • Kennedy J.N.
        • Wang S.
        • Chang C.H.
        • Elliott C.F.
        • et al.
        Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis.
        J. Am. Med. Assoc. 2019; 321: 2003-2017https://doi.org/10.1001/jama.2019.5791
        • Gao S.
        • Mutter S.
        • Casey A.
        • Mäkinen V.-P.
        Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.
        Int. J. Epidemiol. 2019; 48: 369-374https://doi.org/10.1093/ije/dyy113
        • Mäkinen V.P.
        • Forsblom C.
        • Thorn L.M.
        • Wadén J.
        • Gordin D.
        • et al.
        Metabolic phenotypes, vascular complications, and premature deaths in a population of 4,197 patients with type 1 diabetes.
        Diabetes. 2008 Sep; 57: 2480-2487https://doi.org/10.2337/db08-0332
        • Kumpula L.S.
        • Mäkelä S.M.
        • Mäkinen V.P.
        • Karjalainen A.
        • Liinamaa J.M.
        • et al.
        Characterization of metabolic interrelationships and in silico phenotyping of lipoprotein particles using self-organizing maps.
        J. Lipid Res. 2010; 51: 431-439https://doi.org/10.1194/jlr.D000760
        • Ala-Korpela M.
        • Davey Smith G.
        Metabolic profiling-multitude of technologies with great research potential, but (when) will translation emerge?.
        Int. J. Epidemiol. 2016; 45: 1311-1318https://doi.org/10.1093/ije/dyw305
        • Würtz P.
        • Kangas A.J.
        • Soininen P.
        • Lawlor D.A.
        • Davey Smith G.
        • et al.
        Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on -omic technologies.
        Am. J. Epidemiol. 2017; 186: 1084-1096https://doi.org/10.1093/aje/kwx016
        • Kettunen J.
        • Demirkan A.
        • Würtz P.
        • Draisma H.H.M.
        • Haller T.
        • et al.
        Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA.
        Nat. Commun. 2016; 7: 11122https://doi.org/10.1038/ncomms11122
        • Locke A.E.
        • Steinberg K.M.
        • Chiang C.W.K.
        • Service S.K.
        • Havulinna A.S.
        • et al.
        Exome sequencing of Finnish isolates enhances rare-variant association power.
        Nature. 2019; 572: 323-328https://doi.org/10.1038/s41586-019-1457-z
        • Tukiainen T.
        • Kettunen J.
        • Kangas A.J.
        • Lyytikäinen L.P.
        • Soininen P.
        • et al.
        Detailed metabolic and genetic characterization reveals new associations for 30 known lipid loci.
        Hum. Mol. Genet. 2012; 21: 1444-1455https://doi.org/10.1093/hmg/ddr581
        • Ference B.A.
        • Kastelein J.J.P.
        • Ray K.K.
        • Ginsberg H.N.
        • Chapman M.J.
        • et al.
        Association of triglyceride-lowering LPL variants and LDL-C–lowering LDLR variants with risk of coronary heart disease.
        J. Am. Med. Assoc. 2019; 321: 364-373https://doi.org/10.1001/jama.2018.20045
        • Sniderman A.D.
        • Pencina M.
        • Thanassoulis G.
        ApoB: the power of physiology to transform the prevention of cardiovascular disease.
        Circ. Res. 2019; 124: 1425-1427https://doi.org/10.1161/CIRCRESAHA.119.315019
        • Ala-Korpela M.
        The culprit is the carrier, not the loads: cholesterol, triglycerides and apolipoprotein B in atherosclerosis and coronary heart disease.
        Int. J. Epidemiol. 2019; https://doi.org/10.1093/ije/dyz068
        • Borén J.
        • Williams K.J.
        The central role of arterial retention of cholesterol-rich apolipoprotein-B-containing lipoproteins in the pathogenesis of atherosclerosis: a triumph of simplicity.
        Curr. Opin. Lipidol. 2016; 27: 473-483https://doi.org/10.1097/MOL.0000000000000330
        • Järvelin M.R.
        • Sovio U.
        • King V.
        • Lauren L.
        • Xu B.
        • et al.
        Early life factors and blood pressure at age 31 years in the 1966 Northern Finland birth cohort.
        Hypertension. 2004; 44 (ee): 838-846https://doi.org/10.1161/01.HYP.0000148304.33869
        • Borodulin K.
        • Vartiainen E.
        • Peltonen M.
        • Jousilahti P.
        • Juolevi A.
        • et al.
        Forty-year trends in cardiovascular risk factors in Finland.
        Eur. J. Public Health. 2015; 25: 539-546https://doi.org/10.1093/eurpub/cku174
        • van Smeden M.
        • Harrell Jr., F.E.
        • Dahly D.L.
        Novel diabetes subgroups.
        Lancet Diabetes Endocrinol. 2018; 6: 439-440https://doi.org/10.1016/S2213-8587(18)30124-4
        • Skålén K.
        • Gustafsson M.
        • Rydberg E.K.
        • Hultén L.M.
        • Wiklund O.
        • et al.
        Subendothelial retention of atherogenic lipoproteins in early atherosclerosis.
        Nature. 2002; 417: 750-754https://doi.org/10.1038/nature00804
        • Tabas I.
        • Williams K.J.
        • Borén J.
        Subendothelial lipoprotein retention as the initiating process in atherosclerosis: update and therapeutic implications.
        Circulation. 2007; 116: 1832-1844https://doi.org/10.1161/CIRCULATIONAHA.106.676890
        • Proctor S.D.
        • Vine D.F.
        • Mamo J.C.
        Arterial retention of apolipoprotein B(48)- and B(100)-containing lipoproteins in atherogenesis.
        Curr. Opin. Lipidol. 2002; 13: 461-470https://doi.org/10.1097/00041433-200210000-00001
        • Goldstein J.L.
        • Brown M.S.
        A century of cholesterol and coronaries: from plaques to genes to statins.
        Cell. 2015; 161: 161-172https://doi.org/10.1016/j.cell.2015.01.036
        • Holmes M.V.
        • Ala-Korpela M.
        • Smith G.D.
        Mendelian randomization in cardiometabolic disease: challenges in evaluating causality.
        Nat. Rev. Cardiol. 2017; 14: 577-590https://doi.org/10.1038/nrcardio.2017.78
        • Dennis J.M.
        • Shields B.M.
        • Henley W.E.
        • Jones A.G.
        • Hattersley A.T.
        Disease progression and treatment response in data-driven subgroups of type 2 diabetes compared with models based on simple clinical features: an analysis using clinical trial data.
        Lancet Diabetes Endocrinol. 2019; 7: 442-451https://doi.org/10.1016/S2213-8587(19)30087-7