Endpoint and endpoint longitudinal data
This page has been last updated for R12.
Sandbox directory
The endpoint and endpoint longitudinal data files are available in the following Sandbox directory:
/finngen/library-red/finngen_R[RELEASE]/phenotype_1.0/
Data files
This endpoint data is available in the following file:
data/finngen_R{RELEASE]_endpoint_1.0.txt.gz
This endpoint longitudinal data is available in the following file:
data/finngen_R{RELEASE]_endpoint_longitudinal_1.0.txt.gz
Endpoints
FinnGen endpoints have been created to comprehensively cover the spectrum of medical diagnoses from all medical areas. They follow the treelike subtyping system of the ICD-10 classification system, where classification starts from the anatomy-based upper-level chapters and then divides step-by-step into more detailed subtypes (here is an example tree for ICD10).
In FinnGen, this treelike subtyping was followed to a level of detail that was seen as reasonable in the context of GWAS and PheWAS.
The endpoints are named starting with the first character of the ICD-10 diagnosis code, followed by the number of the ICD-10 chapter and ending with an endpoint name that is:
as short as possible
explanatory enough to enable one to reason what the endpoint is
For example, Parkinson's disease:
6th ICD-10 chapter
Diseases of the nervous system
ICD-10 code G20
endpoint name
G6_PARKINSON
In addition to endpoints using names derived from the ICD-classification as explained above, we also created endpoints according to specific requests. For example, specific definitions according to the age of onset, or combinatory endpoints covering entities with subtypes from different chapters of the ICD-system like autoimmune diseases or alcohol-related disorders. The final set of medical speciality-specific endpoint definitions was approved by the FinnGen clinical expert groups of leading experts in their respective medical fields. These clinical groups are formed of Finnish physicians and in some cases include subject experts from pharma as well.
The ICD-10 codes were then matched to the Finnish versions of the ICD-9 and ICD-8, when possible. When the ICD-9 and ICD-8 versions didn’t meet the level of detail of the ICD-10 version, but the level of these older systems was detailed enough, the subtyping of the endpoints was left at this level. When the unmatched more detailed level of the ICD-10 was reasoned, the endpoint was created according to ICD-10 and the endpoint was left unharmonized to ICD-9 and/or ICD-8.
In most cases, the same ICD-codes were used for the hospital and cause-of-death registries, but when differences were well reasoned, differentiating classifications were used for these registries.
In addition to the ICD-codes, and ATC-codes, KELA drug reimbursement codes and operation codes were used in the endpoint definitions. These were used only when the codes were specific to the endpoint. Medications or operations were not used to define the endpoints if they also have other indications.
For cancers, to maximize specificity, we used the cancer registry-specific ICD-0-3 system, and ICD-10 codes were only used in the cause-of-death registry.
In general, we prefer specificity over sensitivity in the endpoint definition parameters, as in GWAS with highly unbalanced case-control ratios with more controls than cases: we do not want to dilute the control groups with lower probability cases.
If needed age or sex-specific “pre-condition” parameters were used in the endpoint definition. Also, “condition”-parameters, excluding participants with or without other specific endpoints were used when appropriate. For example, all type 1 diabetes/ulcerative colitis cases were excluded from the type 2 diabetes/Crohn disease cases and vice versa.
Finally, the treelike form of the endpoint catalog was formed by including all the subtypes of the upper-level endpoint to the INCLUDE
column, making cases of the included lower-level subtype endpoints cases in the upper-level endpoint.
Creating the endpoint catalog is an ongoing process, and new endpoints can be added to or removed from the catalog.
Endpoint data file
ENDPOINT_X_AGE; starting from DF12: ENDPOINT_X_FU_AGE
cases: individual's age at the first event variable for the ENDPOINT
controls: age at the end of the follow up;
age at the time when register follow-up ends in registers
age of death if individual has died
age of age of emigration if a person has moved abroad
ENDPOINT_X_NEVT is the number of events for the individual for the ENDPOINT X
ENDPOINT_X_EXALLC and ENDPOINT_X_EXMORE. Some endpoints have a stricter delimitation of the controls where controls are removed based on certain criteria. For example, in the
_EXALLC
endpoints, all cancers have been removed from the controls, and in the_EXMORE
endpoints, a stricter control cut-off has been used. Excluded individuals will be “NA” in the data.FU_END_AGE / AGE_AT_DEATH_OR_END_OF_FOLLOWUP / DEATH_AGE is the individual's age at the end of the follow-up. It is the age of death if the individual has died, the age of emigration if a person has moved abroad, or the age at the time when register follow-up ends in most of the registers.
The endpoint data file contains the following columns:
Column | Description |
| FinnGen ID |
| Year of DNA sample collection |
| Age at DNA sample collection |
| Age at the end of the follow-up
|
| Gender (male/female/NA) |
| Endpoint name |
|
|
| Year of onset |
| Number of events |
| Endpoint-specific control definition, only for selected endpoints: More stringent control definition |
| Endpoint-specific control definition, only for selected endpoints: All cancer cases have been excluded from controls |
*In the DF12 endpoint data: AGE_AT_DEATH_OR_END_OF_FOLLOW_UP and DEATH_FU_AGE columns replaced column FU_END_AGE containing the same information.
Endpoint longitudinal data file
In the endpoint data file, the first recorded events for each endpoint are given. Ergo, the event age is the age at the first event e.g., the first ICD10 code of the endpoint definition.
In the endpoint longitudinal data, all events of the endpoint are given instead, meaning that all event ages and source registries of the individual codes are recorded.
Below are described all columns in the endpoint longitudinal file.
Field | Description |
---|---|
| FinnGen ID |
| Source register |
| Age at the event |
| Year at the event |
| ICD version |
| Endpoint name |
In the endpoint longitudinal data, you may see the following register abbreviation (equivalent abbreviation in the detailed longitudinal data file shown in parenthesis):
AVO = Primary health care outpatient visits register (PRIM_OUT)
ERIK_AVO = Specialist outpatient Hilmo register (OUTPAT)
HILMO = Inpatient Hilmo register (INPAT)
ERIK_OPER = Outpatient Hilmo register - Operations (OPER_OUT)
OPER = Inpatient Hilmo register - Operations (OPER_IN)
KELAEK = Kela drug reimbursement register (REIMB)
LAAKE_ATC = Kela drug purchase register (PURCH)
CANCER = Cancer register (CANC)
DEATH = Cause of death register (DEATH)
Last updated