History of creating the FinnGen endpoints

Concept, definitions and format, register data processing and actual endpoint algorithms

Dr. Aki Havulinna, MD Tuomo Kiiskinen, Dr. Susanna Lemmelä, Sami Koskelainen, Dr. Tero Hiekkalinna, Dr. Elisa Lahtela, Prof. Hannele Laivuori

The vision by Dr. Havulinna: create a comprehensive set of harmonized endpoints on diseases and health related conditions, covering the whole ICD-10. Provide tools to harmonize and preprocess the data and plan and implement the algorithm that creates the actual endpoints from the definitions and register data. This work has been and will be openly available to benefit the whole Finnish and international clinical and medical research community, not only the FinnGen project.

Prior to FinnGen

The root of the endpoint concept lies in the work of Dr. Havulinna and prof. Veikko Salomaa since 2006. We needed some harmonized, multiregister-based cardiometabolic endpoints for our research work with the FINRISK data (N=30 000). The multiple registers included were register for healthcare (hospital discharges, special care outpatient visits, surgical operations and procedures), causes-of-death, KELA registers of medicine purchases and drug reimbursements, and cancer register. These registers cover almost half a decade of data, during which, e.g., Finnish specific ICD versions 8,9 and 10 have been used. Therefore, harmonization of the data and endpoint definitions was required.

The few cardiometabolic endpoints were soon expanded to a dozen. Havulinna created a set of SAS macros with which the endpoint rules for each endpoint were manually programmed in, and the macros would be applicable to different data sets (e.g., separate FINRISK survey years). Soon, even this approach was too tedious for continuously adding new endpoints.

Havulinna decided to create a systematic approach where endpoint definitions would be entered in a simple structured document and processed automatically by an R-script to create the actual register-based endpoints. This was the beginning of the endpoint definition excel and the original FINRISK endpoint scripts. Around the year 2016 these concepts and scripts were used in the FIMM/THL/Pharma collaboration project which was a pilot/prequel to the FinnGen project.

Some 200 endpoints were drafted by prof. Hannele Laivuori and prof. Markus Perola, based on the suggestions by participating pharma companies. Havulinna formalized the endpoint definitions in the excel format, and together with Dr. Mervi Kinnunen and bioinformatician Elina Kilpeläinen we ran the R-scripts to create the endpoints. Havulinna heavily improved and modified the scripts to gain speed and cope with various data related issues. The endpoint concept and scripts were now ready for a major new challenge.

FinnGen

In FinnGen the endpoint goals were set as follows (by Havulinna and Laivuori as leaders of the Clinical team):

1. The primary endpoints: Pharmaceutical companies each was asked to provide a list of ~10 of their main interest endpoints

  • a. We divided the listed endpoints into disease categories (ICD-10 chapters)

  • b. For each disease category we established a clinical expert group of Finnish medical scholars and pharma representatives. The expert groups are listed at https://www.finngen.fi/en/clinical_expert_groups, but the structure of the groups has changed; in the original format, there were about a dozen group members besides the lead and secretary: experts representing all Finnish university hospitals, and Pharma companies with interest in the diseases in question. Besides the six original expert groups – Neurology, Gastroenterology, Rheumatology, Pulmonary diseases, Cardiometabolic diseases, Oncology – several new groups have emerged later-on. Original member lists can be seen in the list of collaborators in earlier FinnGen publications.

  • c. The expert groups helped in creating and fine-tuning the endpoints of interest, with varying amount of contribution by each group.

2. PheWAS approach: This constitutes the bulk of the endpoints. Given the large amount of data to be collected, and overrepresentation of diseased individuals (due to FinnGen samples being based on a major part on hospital biobank samples) we wanted to create as wide a range of disease endpoints as possible – e.g., for a hypothesis free study of genetic association of diseases.

Next, doctoral researcher, MD Tuomo Kiiskinen joined the clinical team. This was the beginning of the huge job to create the FinnGen endpoint library for PheWAS. We proceeded by adding one ICD-10 chapter at a time, prioritizing more important (to FinnGen) chapters. The work by Kiiskinen for one chapter lasted 3-4 weeks, after which Havulinna made initial semi-automated checks to ensure the consistency of the hierarchical structure, and obvious errors in diagnosis codes or other things. The original approach (by Kiiskinen) for each chapter was as follows:

  1. Follow roughly the ICD-10 treelike structure https://icd.who.int/browse10/2019/en#/ - to the level of detail that would still make sense in the context of genetic analyses. For example, usually the .8 is "other specified", and .9 is "unspecified" so they were always combined into Other / unspecified (=non aliter specificatus, NAS), because if the "other" is not specified it is also equal to unspecified).

  2. Manually match the ICD-10 code with Finnish ICD-9 and ICD-8. A 1:1:1 match is usually impossible, so for every endpoint there was a decision whether the ICD-10 structure should be modified, usually by combining codes, or whether the earlier versions were so outdated compred to ICD-10 that the ICD-8/9 codes were dropped into the NAS category)

  3. For every endpoint this was NOT straightforward; besides medical knowledge it required studying the diseases (Terveysportti, Wikipedia, literature, etc.) to make the best decisions, so it really took a lot of time.

  4. See if there are any specific drug reimbursement codes that would cover these endpoints

  5. See if there are any disease specific drug purchase ATC-codes that would match these endpoints

  6. Check if clinical groups had any specific requests and either a) modify the already made endpoints b) create these requests as additional “custom/design” endpoints (=composite endpoints)

  7. Submit the work to Havulinna for initial checks, potential corrections and for running the actual endpoints in the FINRISK data, to see that everything works and how the endpoint case distributions look like

  8. Present the work (Kiiskinen, Havulinna) to the primary clinical expert group and to others.

At this point we did not have the Finnish ICD-8 or ICD-9 in an electronic format, which would have helped a lot. We only had PDF-copies of the original books, scanned and processed by Havulinna:

ICD-9: http://urn.fi/URN:NBN:fi-fe201701261356

ICD-8: http://urn.fi/URN:NBN:fi-fe201710058910

Our first release of the endpoint library (January, 2018), for FinnGen DF1 contained 2057 endpoints covering the ICD-10 Chapters 1-14. The endpoint algorithm already had the “INCLUDE” and “CONDITION” rules, and possibility to create sex-specific endpoints. We also provided for each endpoint the control exclusion/eligibility rules, which were determined by Havulinna, mainly algorithmically to exclude closely resembling diseases from controls.

Kiiskinen and Havulinna, with support from Laivuori, provided a unique combination of expertise, without which the FinnGen endpoint library would not exist.

FinnGen, further developments

For each consecutive data freeze (DF) we refined existing endpoints where problems were found, and added new endpoints based on the suggestions from the FinnGen community. After the first FinnGen DFs, we improved some endpoints from GWAS-based experience, e.g., exclusion of T1D from T2D cases (the overlap was detected because of a clear HLA signal in T2D GWAS; due to their autoimmune nature, HLA associations are known to be specific to T1D), and a similar thing happened with UC vs Crohn’s disease.

For each DF Havulinna also improved his endpoint algorithm written as R scripts, introduced some new concepts there, and usually also created the actual endpoints and preprocessed the register data. Register team lead at that time, Dr. Kati Kristiansson ran the endpoint scripts for some DFs, with Havulinna doing debugging. Dr. Susanna Lemmelä joined the Clinical team and the Register team in Feb. 2019, and she took over much of the register data processing and actual endpoint data creation, since DF3.

The original FinnGen study permission for several DFs was restricted so that we could only release the derived endpoints based on several registers, and not any original register data which was kept within the FinnGen register team at THL. Since DF2 we have released endpoint longitudinal data, containing all events and source register information, besides the original first-ever event release.

The original endpoint scripts utilized the original register data in the wide format, rather unchanged. Few things, such as EVENT_AGE, were added in the preprocessing. Starting in 2019, when the register permission became more liberal, so that the diagnosis codes could also be released to the FinnGen sandbox (a secure computing environment, which allows researchers from all over the world to do research and analyses on the enriched data of the FinnGen project), we discussed a new, harmonized longitudinal register data format which would contain only the essentials (e.g., source register, age of diagnosis, diagnosis codes) to allow easy browsing of the events even without endpoint scripts. Drs. Andrea Ganna and Juha Karjalainen participated in formulating the detailed longitudinal data concept, and Lemmelä quickly prepared the necessary detailed longitudinal data scripts with Havulinna helping. Detailed longitudinal data has been released using these detailed longitudinal data scripts since mid-2019 (DF4). It took some time before the longitudinal data format was adopted as the basis for endpoint creation, as it required a complete rewrite of the original endpoint scripts.

The Covid era from 2020 onwards led to several changes also for the FinnGen endpoints. Kiiskinen had left after DF5 to focus on pursuing his PhD. Havulinna did the DF6 endpoint definition update. He added the missing ICD-10 chapters, 15-22. Chapter 15 (Pregnancy, childbirth and the puerperium) was harmonized by Laivuori and Havulinna, while chapters 16-22 still remain unharmonized.

Also, during DF6, Dr. Tero Hiekkalinna started programming the Endpointter, an alternative implementation of Havulinna’s endpoint algorithm written in Python. Endpointter was written in the perspective of the detailed longitudinal source data, whereas the original R-scripts were written for the wide-format register data as received from the original register authorities. Havulinna has also prepared a new version of the R-scripts, adapted for the detailed longitudinal register data. Starting in Jan 2021 (DF7), the endpoints have been created using the Endpointter scripts. Starting with DF6v3, bioinformatician Sami Koskelainen took over the endpoint data creation process, running the endpoint scripts.

Dr. Elisa Lahtela joined FinnGen and the Clinical team in April 2020, and in DF7 took over the Endpoint definition updating tasks from Havulinna, who (along with Laivuori) switched mainly to an advisory role when FinnGen2 started (August 2020). Endpoint changes were for the major part frozen, and during DF7-8 we cleaned the endpoint set from redundant endpoints and decided on a core set of endpoints to avoid doing an expensive GWAS for closely correlated endpoints.

Since DF3 we have also committed to improved QC, in collaboration with the register and analysis teams. Lemmelä created quality control R-scripts for the endpoint and endpoint longitudinal data and ran case/control correlations, Jaccard indices and clusters for the FinnGen endpoints. Since DF7, Havulinna has written a set of R-scripts for automatically updating the endpoints by a structured Excel containing the changes; for checking the consistency of the endpoints, including the hierarchy; and for creating an endpoint definition change log. The register team now follows and updates a rigorous QC procedure to ensure the quality of the register data, the endpoint definitions, and the endpoints created from the register data. Starting in DF9-10, Koskelainen with help from Lemmelä, who is a product owner of the FinnGen Clinical endpoints, has taken the main responsibility for the whole FinnGen endpoint data creation process.

Last updated