Chapter 6: Data Sources for Registries

1. Introduction

Identification and evaluation of suitable data sources should be completed within the context of the registry purpose and availability of the data of interest. A single registry may have multiple purposes and integrate data from various sources. (See Case Example 14.) While some data in a registry are collected directly for registry purposes (primary data collection), the push to use real-worlddata and real-world evidence in decision making has resulted in an increased focus on the incorporation of data collected primarily for other purposes. Examples include demographic information from a hospital admission, discharge, and transfer system; medication use from a pharmacy database; disease and treatment information, such as details of diagnostic and therapeutic procedures or treatment plans from ancillary clinical systems (e.g., cardiology information systems, radiology information systems, surgical systems, oncology systems, etc.); electronic health records (EHRs); and medical claims databases. In addition, observational studies can generate as many hypotheses as they test, and other sources of data can be merged with the primary data collection to allow for analyses of questions that emerge during the course of the registry.

The burden of registry participation can be significantly reduced with broader use of these sources. However, high standards for quality, including documentation of transformations and traceability of data in the registry to the source, are important considerations. This chapter will review the various sources of data, comment on their strengths and weaknesses, and provide some examples of how data collected from different sources can be integrated to help answer important questions. Information on the technical aspects of linking or integrating existing data sources into registries can be found in the supplemental eBook on Tools and Technologies for Registry Interoperability.¹

2. Intended Uses for Data Elements

The types of data to be collected are guided by the registry purpose, design, and data collection methods. The form, organization, and timing of required data are important components in determining appropriate data sources. Data elements can be grouped into categories that identify the specific variable or construct they are intended to describe. One framework for grouping data elements into categories follows:

Identify patients—Rather than incorporate all possible data of interest, many registries use patient identifiers to link data from secondary sources in order to support a specific analysis. In these registries, data elements are linked to the specific patient through a unique patient identifier or registry identification number. However, the potential for mismatch errors and duplications must be managed. The use of patient identifiers may not be possible in all registries due to the additional legal requirements that usually apply to the use and disclosure of such data. (See Chapter 7.)
Determine eligibility—The eligibility criteria in a registry protocol or study plan determine the group that will be included in the registry. These criteria may be very broad or restrictive, depending on the purpose. Criteria often include demographics (e.g., target age group), a disease diagnosis, a treatment, or diagnostic procedures or laboratory tests. Healthcare provider, healthcare facility or system, and insurance criteria may also be included in certain types of registries (e.g., following care patterns of specific conditions at large medical centers compared with small private clinics).
Describe treatments and tests—Treatments and tests are necessary to describe the natural history of patients. Treatments can include pharmaceutical, biological, or device therapies, or procedures such as surgery or radiation. Evaluation of the treatment itself is often a primary focus of registries (e.g., treatment safety and effectiveness over 5 years). Results of laboratory testing or diagnostic procedures may be included as registry outcomes and may also be used in defining a diagnosis or condition of interest.
Understand confounders—Confounders are elements or factors that have an independent association with the outcomes of interest. These are particularly important because patients are typically not randomized to therapies in registries. Confounders such as comorbidities (disease diagnoses and conditions) can confuse analysis results and interpretation of causality. Information on the healthcare provider, treatment facility, concomitant therapies, or insurance may also be considered. Unknown confounders, or those not recorded in the registry, pose particular challenges for the analysis of patient outcomes. If external, or linked, data sources may provide values for these confounder variables not included in the registry, they could ultimately help reduce bias in the analysis and interpretation of patient outcomes.
Measure outcomes—The focus of this document is on patient outcomes. Outcomes are end results and must be defined for each condition. Outcomes may include patient-reported outcomes (PROs). In some registries, surrogate markers, such as biomarkers, or other intermediate outcomes (e.g., hemoglobin A1c levels in diabetes) that are highly reflective of the longer-term end results are used.

Within this framing, a given type of data may be present in multiple categories. Consider, for example, diagnosis. One diagnosis (e.g., diabetes) may be used to determine eligibility for enrollment into a registry, while other diagnoses (e.g., heart failure, atrial fibrillation) may be captured as potential confounding variables. While both pieces of information may be present in the same secondary source, different quality requirements (e.g., more stringent requirements to verify eligibility for enrollment) may mean that the source does not satisfy both purposes.

3. Types of Data

Before considering the potential sources for registry data, it is important to understand the types of data that may be collected in a registry. Several types of data that may be gathered from other sources in some registries are described below. A given data source may contain data from more than one of these categories.

Patient identifiers—Depending on the data sources required, some registries may use certain personal identifiers for patients in order to locate them in other databases and link the data. For example, Social Security numbers (SSNs) in combination with other personal identifiers can be used to identify individuals in the National Death Index (NDI). Combinations of variables are also used by many hashing algorithms (e.g., gender, date of birth, last 4 digits of SSN, etc.).

Patient contact information, such as address and phone numbers, may be collected to support tracking of participants over time. Information for additional contacts (e.g., family members) may be collected to support followup in cases where the patient cannot be reached. In many registries, patient informed consent and appropriate privacy authorizations are required so that personal identifiers can be used for registry purposes. In some registries, the use of personal identifiers may not be possible. Chapter 7 discusses the legal requirements for including patient identifiers. Systems and processes must be in place to manage security and confidentiality of these data. Confidentiality can be enhanced by assigning a registry-specific identifier via a crosswalk algorithm, as discussed below. Demographics, such as date of birth (to calculate age at any time point), gender, and ethnicity, are typically collected and may be used to stratify the registry population.

Disease/condition—Disease or condition data include those related to the disease or condition of focus for the registry and may incorporate comorbidities. Elements of interest related to the confirmation of a diagnosis or condition could include the date of diagnosis and the specific diagnostic results that were used to make the diagnosis, depending on the purpose of the registry. Disease or condition is often a primary eligibility or outcome variable in registries, whether the intent is to answer specified treatment questions (e.g., measure effectiveness or safety) or to describe the natural history. This information may also be collected in constructing a medical history for a patient. In addition to “yes” or “no” to indicate presence or absence of the diagnosis, it may be important to capture responses such as “missing” or “unknown.”

Treatment/therapy—Treatment or therapy data include specific identifying information for the primary treatment (e.g., drug name or code, biologic, device product or component parts, or surgical intervention, such as organ transplant or coronary artery bypass graft) and may include information on concomitant treatments. Dosage (or parameters for devices), route of administration, and prescribed exposure time (such as daily or 3 times weekly for 4 weeks), should be collected. Pharmacy data may include dispensing information, such as the primary date of dispensation and subsequent refill dates. Data in device registries can include the initial date of dispensation or implantation and subsequent dates and specifics of required evaluations or modifications as well as the Unique Device Identifier (UDI). Compliance data may also be collected if pharmacy representatives or clinic personnel are engaged to conduct and report pill counts or volume measurements on refill visits, or return visits for device evaluations and modifications.

Anthropometrics/vital signs—Measurements about the registry participant, such as height, weight, body mass index (BMI), pulse, temperature, and oxygen saturation can be important data in a registry. While these data may be obtained through primary data collection, they are increasingly available through secondary sources such as the electronic health record (EHR) or from patient monitors. If incorporating via a secondary source, it is important to understand whether all measurements or a subset of measurements (median, mean, maximum, minimum) are required, as the vital signs documented within some secondary sources can be both sparse (e.g., height not recorded in an adult if they had a visit within the last two weeks) and overly abundant (e.g., oxygen saturation levels recorded every five seconds on a patient monitor).

Laboratory/procedures—Laboratory and ancillary data include a broad range of testing, such as blood, tissue, catheterization, and radiology. Specific test results, units of measure, and laboratory reference ranges or parameters are typically collected. National efforts towards interoperability mean that laboratory data are becoming increasingly standardized, making electronic transfer more feasible. A few specialized types of laboratory testing (e.g., imaging, genomics) are called out below. Diagnostic testing or evaluation may include procedures such as psychological or behavioral assessments. Results of these procedures and clinician examinations may be difficult to obtain through data sources other than the patient medical record.

Imaging—The result of a given imaging test (e.g., echocardiogram, x-ray, CT scan) may be an important element for a registry, but a distinction should be made as to whether the actual images are needed or simply the interpretations or variables derived from the images themselves (e.g., left ventricular volume from an echocardiogram). A registry that seeks to develop more sophisticated image processing or automated interpretation algorithms may need the former, while the latter is likely sufficient for a registry that uses imaging results as part of their inclusion criteria.

Biosamples/Genomic results—The increased collection, testing, and storage of biological specimens as part of a registry (or independently as a potential secondary data source such as those described further below) provides another source of information. Biorepositories are employed to store information about the specimens themselves, while genomic repositories handle the sequencing results, which may range from a handful of genetic variants or specific genotypes all the way to entire individual genomes. Increasingly with genomic data, a distinction is made between the underlying “raw” results and those specific, limited findings that might be reported out back as part of a given genomic test. Operationally, a registry may need the results of the test itself, for instance, to stratify patients based on a specific genotype, but a given research analysis may need access to all of the unreported raw results. Due to the size and specialized nature of these data, it is often more feasible to link to these datasets as part of a specific analysis, as opposed to incorporating all of the “raw” data directly into the registry. See Chapters 7 and 8 for more information on the regulatory and ethical issues related to the use of biosamples and genomic results in registries.

Survey/questionnaire data—Surveys or questionnaires can be administered directly to patients, families/caregivers, or providers to collect data for the registry. In some cases, surveys or questionnaires are used to capture patient-reported outcomes (PROs). Surveys or questionnaires can be administered in many ways (e.g., paper forms, online surveys, mobile applications). Registries often use standardized, validated instruments that may also include computer-adaptive testing (CAT) to minimize the number of questions that are presented to the respondent.

Patient-generated data—Data from devices recording a range of data from patients can be incorporated into the registry. This can range from medical devices such as internet-connected scales or continuous glucose monitors, to consumer devices like activity trackers or readings from sensors contained within smartphones or smartwatches. As with patient monitors used in the medical setting, it is important to decide whether to include the raw values (which may be numerous) or some type of summary or derivation.

Healthcare provider characteristics—Information on the healthcare provider (e.g., physician, nurse, or pharmacist) may be collected, depending on the purpose of the registry. Training, education, or specialization may account for differences in care patterns. Geographic location has also been used as an indicator of differences in care or medical practice.

Hospital/clinic/health plan encounter details—System interactions include office visits, outpatient clinic visits, emergency room visits, inpatient hospitalizations, procedures, and pharmacy visits, as well as associated dates. The events that occurred within a given encounter are covered in the sections above (e.g., therapy/treatment, laboratory/procedure, disease/condition, etc.), but descriptive information related to the encounter itself may be useful in capturing differences in care patterns and can also be used to track patterns of referral (e.g., outpatient clinic, inpatient hospital, academic center, emergency room, pharmacy).

Cost/resource utilization—Cost and/or resource utilization data may be necessary to examine the cost-effectiveness of a treatment. Resource utilization data reflect the resources consumed (both services and products), while cost data reflect a monetary value assigned to those resources. Examples include the actual cost of the treatment (e.g., medication, screening, procedure) and the associated costs of the intervention (e.g., treatment of side effects, expenses incurred traveling to and from clinicians’ appointments). Costs that are avoided due to the treatment (e.g., the cost to treat the avoided disease) and costs related to lost workdays may also be important to collect, depending on the objectives of the study. Registries that collect cost data over long periods of time (i.e., many years) may need to adjust costs for inflation during the analysis phase of the study. The types of data elements included in this framework are further described in Chapter 5 and above with respect to their source or the utility of the data for linking to other sources. Many of these may be available through data sources outside of the registry system.

Insurance—The insurance system or payer claims data can provide useful information on interactions with the healthcare systems, including visits, procedures, inpatient stays, and costs associated with these events. When using these data, it is important to understand what services were covered under the various insurance plans at the time the data were collected, as this may affect utilization patterns, but it can be reasonable to assume that these data may represent the complete capture for all reimbursed health outcomes or exposures of interest.

Environmental factors/social determinants of health—The social or environmental factors related to a patient’s community are increasingly being recognized as important drivers of health disparities and a cause of variations in patient outcomes. Social determinants of health can be collected directly from patients but may also be available through secondary sources via proxy measures such as socioeconomic status, pollution levels, or community characteristics. These measures are typically assigned to the patient by geocoding their address information and linking at the appropriate geo-spatial level (neighborhood, census tract, zip code, etc.). Since address information is usually considered a patient identifier, additional regulatory approvals may be required to obtain these data.

Social media—An emerging type of data is information related to a patient’s social media activity. This could include the content of the posts themselves, or simply metadata about the time and date of the posting, who viewed or commented on it, etc. These data can be synthesized into measures of community engagement. Some social media companies have restrictions on how a member’s data may be obtained, so it is important to understand the potential terms of use.

4. Data Sources

Data sources are classified as primary or secondary based on the relationship of the data to the registry purpose. Primary data sources incorporate data collected for direct purposes of the registry (i.e., primarily for the registry). Primary data sources are typically used when the data of interest are not available elsewhere or, if available, are unlikely to be of sufficient accuracy and reliability for the planned analyses and uses. Primary data collection increases the probability of completeness, validity, and reliability because the registry drives the methods of measurement and data collection (See Chapter 5). Primary data collection can occur via patients/caregivers or clinicians. These data are prospectively planned and collected under the direction of a protocol or study plan, using common procedures and the same format across all registry sites and patients. The data are readily integrated for tracking and analyses. Since the data entered can be traced to the individual who collected them, primary data sources are more readily reviewed through automated checks or followup queries from a data manager than is possible with many secondary data sources (See Chapter 11).

Secondary data sources are comprised of data collected for purposes other than or in addition to the registry under consideration (e.g., routine medical care, insurance claims processing). Data that are collected as primary data for one registry are considered secondary data from the perspective of a second registry if linking was done. These data are often stored in electronic format and may be available for use with appropriate permissions. Data from secondary sources may be used in two ways: (1) the data may be transferred and imported into the registry, becoming part of the registry database, or (2) the secondary data and the registry data may be linked to create a new, larger dataset for analysis. (See Case Example 15.) This chapter primarily focuses on describing commonly used secondary sources. Chapter 11 discusses strategies for transferring secondary data into a registry (abstraction with double-data entry, direct import, transformation, algorithmic derivation).

When considering secondary data sources, it is important to note that health professionals are accustomed to entering data for defined purposes. Data in secondary sources are not constrained by a data collection protocol and therefore represent the diversity observed in real-world practice. Thus, there may be increased probability of errors and underreporting because of inconsistencies in measurement, reporting, and collection. Staff changes can further complicate data collection and may affect data quality. There may also be increased costs for linking the data from the secondary source to the primary source and dealing with any potential duplicate or unmatched patients.

The potential for data completeness, variation, and specificity must be evaluated in the context of the registry purpose and intended use of secondary data. It is crucial to have a solid understanding of the original purpose of the secondary data collection, including the processes for collection and submission, and any verification and validation practices. Questions to ask include:

Is data collection passive or active?

Are standard definitions or codes used in reporting data?

Are standard measurement criteria or instruments used (e.g., diagnoses, symptoms, quality of life)?

The existence and completeness of claims data, for example, will depend on insurance company coverage policies. One company may cover many preventive services, whereas another may have more restricted coverage. One company may cover a treatment without restriction, while another may require prior authorization or require that the patient must have first failed on a previous, less expensive treatment. Also, coverage policies can change over time. These variations must be known and carefully documented to prevent misinterpretation of use rates. Additionally, secondary data may not all be collected in the format (e.g., units of measure) required for registry purposes and may require transformation for integration and analyses.

An overview of some secondary data sources that may be used for registries is given below. Table 6-1 identifies some key strengths and limitations of these data sources.

Table 6-1. Key data sources—strengths and limitations

Data Source Category	Data Source	Strength and Uses	Limitations
Primary Sources	Patient-reported data/patient-generated data	Patient and/or caregiver patient-reported outcomes. Unique perspective. Obtaining information on treatments not necessarily prescribed by clinicians (e.g., over-the-counter drugs, herbal medications). Obtaining adherence information. Useful when timing of followup may not be concordant with timing of clinical encounter. Obtaining information about the patient not available otherwise (e.g., device data).	Literacy, language, or other barriers that may lead to under-enrollment of some subgroups. Validated data collection instruments may need to be developed. Loss to followup or refusal to continue participation. Non-response.
	Clinician-reported data	More specific information than available from coded data or medical record.	Limited confidence in reporting clinical information and utilization information. May not be usable in its raw form; may be necessary to compute a summary metric. Clinicians are highly sensitive to burden. Missing data. Consistency in capture of patient signs, symptoms, use of nonprescribed therapy varies.
Secondary Sources	Electronic health records (EHRs)	Information on routine medical care and practice, with more clinical context than coded claims. Potential for comprehensive view of patient medical and clinical history within a given health system, or from multiple health systems, if obtaining EHR data directly from patient. Efficient access to medical and clinical data. Use of data transfer and coding standards (including handling of missing data) will increase the quality of data incorporated into the registry.	The underlying information is not collected in a systematic way. For example, a diagnosis of bacterial pneumonia by one physician may be based on a physical exam and patient report of symptoms, while another physician may record the diagnosis only in the presence of a confirmed laboratory test. The movement to standardized value sets for electronic medical records addresses this issue, but such sets are not yet generally adopted. It is difficult to interpret missing data. For example, absence of a specific symptom in the visit record may indicate that the symptom was not present or that the physician did not actively inquire about this specific symptom or set of symptoms. Consistency of data quality and breadth of data collected varies across sites. Difficult to handle information that has been uploaded into the EHRs (e.g., scanned clinician reports) vs. direct entry into data fields. Historical data capture may require manual chart abstraction prior to implementation date of medical records system. Complete medical and clinical history may not be available (e.g., new patient to clinic). EHR systems vary widely. If data come from multiple systems, the registry should plan to work with each one individually to understand the quality of the underlying information and its suitability for use.
	Ancillary clinical information systems	May include more comprehensive information on laboratory results, diagnostic evaluations or treatments than what is available in the EHR. Harmonized information from the EHR and potentially other ancillary clinical systems. May include legacy clinical information not present in the EHR. Potential resource utilization (e.g., days in hospital). May incorporate cost data (e.g., billed and/or paid amounts from insurance claims submissions).	Important to be knowledgeable about coding systems used in entering data into the original systems. The use of ancillary clinical information systems varies by health system. The registry should plan to work with each system individually to understand the quality of the underlying information and its suitability for use.
	Clinical data warehouses (CDWs) or integrated data repositories (IDRs)	Harmonized information from the EHR and potentially other ancillary clinical systems. May include legacy clinical information not present in the EHR. Potential resource utilization (e.g., days in hospital). May incorporate cost data (e.g., billed and/or paid amounts from insurance claims submissions).	Important to be knowledgeable about the underlying data model, the coding systems used in the original source system(s) and the transformation processes used to populate the repository. The use of CDWs and IDRs varies widely by institution. The registry should plan to work with each system individually to understand the quality of the underlying information and its suitability for use.
	Administrative (claims) databases	Useful for tracking healthcare resource utilization and cost-related information. Range of data includes anything that is reimbursed by health insurance, generally including visits to physicians and allied health providers, most prescription drugs, many devices, hospitalization(s), if a lab test was performed, and in some cases, actual lab test results for selected tests (e.g., blood test results for cholesterol, diabetes). In some cases, demographic information (e.g., gender, date of birth from billing files) can be obtained. Potential for efficient capture of large populations.	Represents clinical cost drivers rather than complete clinical diagnostic and treatment information. Important to be knowledgeable about the process and standards used in claims submission. For example, only primary diagnosis may be coded and secondary diagnoses not captured. Important to be knowledgeable about data handling and coding systems used when incorporating the claims data into the administrative systems. Can be difficult to gain the cooperation of partner groups, particularly in regard to receiving the submissions in a timely manner. May require that data be purchased.
	Death indexes	Completeness—death reporting is mandated by law in the United States. Strong backup source for mortality tracking (e.g., patient lost to followup). National Death Index (NDI)—centralized database of death records from State vital statistics offices; database updated annually. NDI causes of death relatively reliable (93–96%) compared with State death certificates. Social Security Administration’s (SSA) Death Master File—database of deaths reported to SSA; database updated weekly.	Time delay—indexes depend on information from other data sources (e.g., State vital statistics offices), with delays of 12 to 18 months or longer (NDI). It is important to understand the frequency of updates of specific indexes that may be used. Absence of information in death indexes does not necessarily indicate “alive” status at a given point in time. Most data sources are country specific and thus do not include deaths that occurred outside of the country. As of November 2011, Death Master File no longer includes protected State records. Lack of complete patient identifier may pose challenge linking with data from other data sources.
	Aggregate/non-patient level databases (e.g., U.S. Census Bureau, Health Care Utilization Project, Area Health Resources File)	Each database targets population estimates or socio-economic characteristics of a given area or region (U.S. Census Bureau databases). Provide additional details on providers or medical facilities. Allow additional understanding of target registry population.	Does not provide subject-level data. Estimates vary across different survey sampling methodologies. May not be linkable with the registry database.
	Existing registries	Can be merged with another data source to answer additional questions not considered in the original registry protocol or plan. May include specific data not generally collected in routine medical practice. Can provide historical comparison data. Reduces data collection burden for sites, thereby encouraging participation.	Important to understand the existing registry protocol or plan to evaluate data collected for element definitions, timing, and format, as it may not be possible to merge data unless many of these aspects are similar. Creates a reliance on the other registry. Other registry may end. Other registry may change data elements (which highlights the need for regular communication). Some sites may not participate in both. Must rely on the data quality of the other registry.
	Distributed research networks	May have EHR and/or claims data available for large populations of patients in a standardized CDM. Can be used to augment registry data without having to work with each individual health system on data transformations.	Important to understand the processes used by the network to transform data into the CDM and to assess the quality of the underlying source data. Creates a reliance on another entity. Networks may change their underlying data model, which can affect the availability/quality of certain data. Some sites may not participate in both. Regulatory/legal requirements for data linkage.

Medical chart abstraction

When secondary sources are unstructured (e.g. notes) or registry variables require human interpretation for completion (several areas of the record need to be consulted to make a determination), abstraction may be utilized. Inter-rater reliability measurements of abstractors can assist in understanding the quality of the abstraction. Abstraction may be done manually or by using computational methods to extract information from free text that is stored electronically. Computational methods are referred to as natural language processing (NLP). Chapter 11 discusses abstraction methods in more detail.

Electronic health records

Electronic health records (EHRs) are computer systems that are used to document and manage patient care within and across health systems. The last decade has seen a tremendous uptake in the adoption and use of EHRs, due to the EHR Incentive Program (“Meaningful Use” incentives) that was included in the Health Information Technology for Economic and Clinical Health (HITECH) Act (part of the American Recovery and Reinvestment Act legislation passed by the United States Congress in 2009) as well as electronic reporting requirements for quality measures from the Centers for Medicare and Medicaid Services (CMS).²⁻⁴ These programs resulted in the creation of a certification program that is used to test the functionality of various EHR components. Over 75% of physicians (2015 data) and 96% (2016 data) of non-federal acute care hospitals within the United States report the adoption of a certified EHR.⁵ Whereas health systems would previously adopt “best of breed” clinical information systems to handle different components of the care process (ambulatory, inpatient, e-prescribing, laboratory results, etc.), the burden of attempting to integrate different systems has caused the industry to move towards enterprise solutions provided by a single vendor.

EHRs can be used to capture many different types of data – vital signs, patient history, diagnoses/conditions, treatments and therapies, laboratory results, surveys and questionnaires, etc.⁶ As such, they contain a wealth of potentially relevant information for a registry. Data in the EHR reflect the practice of medicine or healthcare within a health system and specialty. The use of standard medical practice data can be useful when looking at treatments and outcomes in the real world, including all of the confounders that affect the measurement of effectiveness (as distinguished from efficacy) and safety outside of the controlled conditions of a clinical trial. Documentation within the EHR is variable, and patients who are seen at multiple health systems will have multiple records. While there have been efforts to promote the interoperability of EHRs, there is still a wide variation in coding practices. In addition, while EHRs support the capture of structured text or coded fields, a large percentage of documentation still occurs as free text, which limits reuse without additional processing.

It is worth noting that, within the registry context, an EHR may function as both a primary and a secondary source. An EHR system may include condition-specific data collection forms that can be used to capture standard-of-care data elements that are equivalent to those collected in a patient registry.⁷ Completion of this form would constitute primary data collection, while electronically transferring laboratory results from the EHR to the registry would constitute use of secondary data.

Ancillary clinical information systems

Even with the widespread adoption of EHRs, many health systems still use ancillary clinical information systems to manage specialized workflows. Examples include radiology or other imaging systems, genomic repositories, pharmacy systems, and patient monitors. These systems may have an interface with the EHR, but they typically only transmit a small fraction of the information that they collect (e.g., interpretation of an echocardiogram vs. all of the data generated during the procedure). Due to their specialized nature, these systems may not be used for reporting or analysis to the same degree as enterprise systems like the EHR, making it more difficult to obtain the underlying source data.

Clinical data warehouses or integrated data repositories

Institutions or health systems also typically maintain one or more integrated data repositories that pull together data from the EHR and other systems into a common, standardized data model. Such systems may also be called a data warehouse or a data lake, depending on the level of standardization and harmonization. Institutions or health systems may develop their own data model, purchase one from a vendor, or adopt one of several common data models (CDMs) that have been developed to support clinical, observational and comparative effectiveness research. These models include i2b2, Sentinel, PCORnet, HSCRN and OMOP/OHDSI. While it can be appealing to obtain data that has been standardized into a common model, particularly if that model is utilized by many health systems, it is necessary to understand how the model relates to the way the information was captured in the source system and whether that representation changes the meaning of the data for the purpose of the registry. For instance, an EHR may have an encounter type of “Social Work,” while a CDM may only allow a handful of values for encounter type (Ambulatory visit, ED encounter, Inpatient stay, Other). Collapsing these values can make it difficult to separate out the relevant information, so additional steps must be taken in order to ensure that the information incorporated into the registry is correct. More information on CDMs and data repositories can be found in the eBook on Tools and Technologies for Registry Interoperability.¹

Administrative (claims) databases

Private and public medical insurers collect a wealth of information in the process of tracking healthcare, evaluating coverage, and managing billing and payment. Information in the databases includes patient-specific information (e.g., insurance coverage and copays; identifiers such as name, demographics, SSN or plan number, and date of birth) and healthcare provider descriptive data (e.g., identifiers, specialty characteristics, locations). Typically, private insurance companies organize healthcare data by physician care (e.g., physician office visits) and hospital care (e.g., emergency room visits, hospital stays). Data include procedures and associated dates, as well as costs charged by the provider and paid by the insurers. Amounts paid by insurers are often considered proprietary and unavailable. Standard coding conventions are used in the reporting of diagnoses, procedures, and other information. Coding conventions include the Current Procedure Terminology (CPT) for physician services and International Classification of Diseases (ICD) for diagnoses and hospital inpatient procedures. The databases serve the primary function of managing and implementing insurance coverage, processing, and payment. (See Case Example 13.)

Medicare and Medicaid claims files are commonly used administrative databases in the United States. Together, the programs cover nearly 133 million people in the United States. The Medicare program covers some 59 million individuals ages 65 and older, as well as younger individuals with end-stage renal disease or who qualify for Social Security Disability.⁸ Medicaid and Children’s Health Insurance Program (CHIP) together cover an additional 73.8 million individuals.⁹ Both programs are administered by the Centers for Medicare and Medicaid Services (CMS). Claim files for these programs can be obtained for inpatient and outpatient visits, skilled nursing facility stays, durable medical equipment, hospital services, and prescription drugs. These data, which are subject to privacy rules and regulations, can be linked to other databases with appropriate permissions. The Research Data Assistance Center (ResDAC) is a CMS contractor that supports researchers interested in using Medicare and/or Medicaid data for research purposes.¹⁰

Death and birth records

Death indexes are national databases tracking population death data (e.g., the NDI¹¹ and the Death Master File [DMF] of the Social Security Administration [SSA]¹²). Data include patient identifiers, date of death, and attributed causes of death. These indexes are populated through a variety of sources. For example, the DMF includes death information on individuals who had an SSN and whose death was reported to the SSA. Reports may come in to the SSA by different paths, including from survivors or family members requesting benefits or from funeral homes. Because of the importance of tracking Social Security benefits, all States, nursing homes, and mortuaries are required to report all deaths to the SSA. Prior to 2011, the DMF contained virtually 100-percent complete mortality ascertainment for those eligible for SSA benefits. As of November 2011, however, the DMF no longer includes protected State death records. In practical terms, this means that approximately 4.2 million records were removed from the historical public DMF (which contained 89 million records), and some 1 million fewer records will be added to the DMF each year.¹³ The NDI can be used to provide both fact of death and cause of death, as recorded on the death certificate. Cause-of-death data in the NDI are relatively reliable (93–96 percent) compared with death certificates.¹⁴'¹⁵ Time delays in death reporting should be considered when using these sources, and vital status should not be assumed to be “alive” by the absence of information at a recent point in time. These indexes are valuable sources of data for death tracking. Of course, mortality data can be accessed directly through queries of State vital statistics offices and health departments when targeting information on a specific patient or within a State. Likewise, birth certificates are available through State departments and may be useful in registries of children or births.

Aggregate/non-patient-level databases

Databases that provide aggregate, non-patient-level statistics may be valuable resources to augment an existing registry. These databases may contain area or population-level statistics, details about providers or medical facilities, or deidentified encounter details. The frequency with which these databases are updated varies by source. Depending on the level of aggregation, it may be possible to link these data to a registry database (i.e., generating neighborhood-level socioeconomic information via geocoding). Two sources of area-level data are the U.S. Census and the Area Health Resources Files (AHRF). The U.S. Census Bureau databases¹⁶ provide population-level data utilizing survey sampling methodology. The Census Bureau conducts many different surveys, the main one being the population census. The primary use of the data is to determine the number of seats assigned to each State in the House of Representatives, although the data are used for many other purposes. These surveys calculate estimates through statistical processing of the sampled data. Estimates can be provided with a broad range of granularity, from population numbers for large regions (e.g., specific States), to ZIP Codes, all the way down to a household level (e.g., neighborhoods identified by street addresses). Information collected includes demographic, gender, age, education, economic, housing, and work data. The data are not collected at an individual level but may serve other registry purposes, such as understanding population numbers in a specific region or by specific demographics. The AHRF is maintained by the Health Resources and Services Administration, which is part of the Department of Health and Human Services. The AHRF includes county-level data on health facilities, health professions, measures of resource scarcity, health status, economic activity, health training programs, and socioeconomic and environmental characteristics.¹⁷ The Environmental Protection Agency maintains datasets of air quality and other measures and has developed a number of methods for estimating exposure.¹⁸'¹⁹

Data on medical facilities and physicians may be important for categorizing registry data or conducting subanalyses. Two sources of such data are the American Hospital Association’s Annual Survey Data and the American Medical Association’s Physician Masterfile Data Collection. The Annual Survey Data is a longitudinal database that collects 700 data elements, covering organizational structure, personnel, hospital facilities and services, and financial performance, from more than 6,000 hospitals in the United States.²⁰ Each hospital in the database has a unique ID, allowing the data to be linked to other sources; however, there is a data lag of about 2 years, and the data may not provide enough nuanced detail to support some analyses of cost or quality of care. The Physician Masterfile Data Collection contains current and historic data on nearly one million physicians and residents in the United States. Data on physician professional medical activities, hospital and group affiliations, and practice specialties are collected each year. The National Plan and Provider Enumeration System (NPPES) also contains information about healthcare providers and can be used to provide additional details if the registry captures National Provider Identifiers (NPIs).²¹

Databases of individual patient encounters (e.g., physician office visits, emergency department visits, hospital inpatient stays), generally do not contain individual patient identifiers and thus may not be linkable to patient registries, but nevertheless provide valuable insight into the makeup of the registry’s target population. This is particularly true for data from nationally representative surveys, such as AHRQ’s Health Care Utilization Project (H-CUP), Nationwide Inpatient Sample (NIS), and the suite of surveys by the Centers for Disease Control and Prevention (CDC) and the National Center for Health Statistics (NCHS), including the National Ambulatory Medical Care Survey (NAMCS), the National Hospital Ambulatory Medicare Care Survey (NHAMCS), and the National Hospital Discharge Survey (NHDS).

Existing registries and other databases

There are numerous national and regional registries and other databases that may be leveraged for incorporation into other registries (e.g., disease-specific registries managed by nonprofit organizations, professional societies, or other entities). An example is the National Marrow Donor Program (NMDP),²² a global database of cord blood units and volunteers who have consented to donate marrow and blood cells. Databases maintained by the NMDP include identifiers and locators in addition to information on the transplants, such as samples from the donor and recipient, histocompatibility, and outcomes. NMDP actively encourages research and utilization of registry data through a data application process and submission of research proposals. Other registries may also be valuable sources of data. Resources such as ClinicalTrials.gov and HSRProj are useful for searching for and identifying relevant registries to contact about data sharing or research collaborations.

Distributed research networks

Distributed research networks (DRNs) may be a possible way to obtain EHR or claims data on a registry population. A number of established DRNs exist in the United States, including the Sentinel Initiative (formerly Mini-Sentinel),²³'²⁴ the Health Care Systems Research Network (HCSRN) (formerly the Health Maintenance Organization Research Network (HMORN)),²⁵⁻²⁷ and the National Patient-Centered Clinical Research Network (PCORnet).²⁸'²⁹ These networks are used to support a range of research activities, including pharmacovigilance studies, pragmatic clinical trials, and studies of treatment effectiveness.³⁰⁻³⁴

Within a DRN, partners (sites) typically standardize their data into a CDM, with the data refreshed at a specified frequency (i.e., quarterly). After each refresh, partners will usually execute a data curation package to assess the underlying data quality.³⁵ Partners whose data pass the required checks can then respond to network queries. Data in a DRN typically remain at the local level (behind the network partner’s firewall), with analyses done at the local level and only results, in the form of aggregate counts or summary statistics, returned to the requestor. However, many DRNs have provisions to allow the exchange of patient-level data in some contexts.³⁶'³⁷ Registries that maintain patient identifiers may be able to link to DRN data to obtain greater detail on their population than can reasonably collected within the registry itself, but technical and governance issues must be resolved before any linkage can actually occur.

5. Other Considerations for Secondary Data Sources

The discussion below focuses on logistical and data issues to consider when incorporating data from other sources. Chapter 11 fully explores data collection, management, and quality assurance for registries.

In accessing data from one registry for the purposes of another, it is important to recognize that data may have changed during the course of the source registry, and this may or may not have been well documented by the providers of the data. For example, in the United States Renal Data System (USRDS),³⁸ a vital part of personal identification is CMS 2728, an enrollment form that identifies the incident data for each patient as well as other pertinent information, such as the cause of renal failure, initial therapy, and comorbid conditions. Originally created in 1973, this form is in its third version, having been revised in 1995 and again in 2005. Consequently, there are data elements that exist in some versions and not others. In addition, the coding for some variables has changed over time. For example, race has been redefined to correspond with Office of Management and Budget directives and Census Bureau categories. Furthermore, form CMS 2728 was optional in the early years of the registry, so until 1983 it was filled out for only about one-half of the subjects. Since 1995, it has been mandatory for all people with end-stage renal disease. These changes in form content, data coding, and completeness would not be evident to most researchers trying to access the data.

Before incorporating a secondary data source into a registry, it is critical to consider the potential impact of the data quality of the secondary data source on the overall data quality of the registry. The potential impact of quality issues in the secondary data sources depends on how the data are used in the primary registry. For example, quality would be significant for secondary data that are intended to be populated throughout the registry (i.e., used to populate specific data elements in the entire registry over time), particularly if these populated data elements are critical to determining a primary outcome. Quality of the secondary data will have less effect on overall registry quality if the secondary data are to be linked to registry data only for a specific analytic study. For more information on data quality, see Chapter 11.

The importance of patient identifiers for linking to secondary data sources cannot be overstated. Multiple patient identifiers should be used, and primary data for these identifiers should not be entered into the registry unless the identifying information is complete and clear. While an SSN is very useful, high-quality probabilistic linkages can be made to secondary data sources using various combinations of such information as name (last, middle initial, and first), date of birth, and gender. For example, the NDI will make possible matches when at least one of seven matching conditions is met (e.g., one matching condition is “exact month and day of birth, first name, and last name”). However, the degree of success in such probabilistic and deterministic matching generally is enhanced by having many identifiers to facilitate matching. As noted earlier, the various types of data (e.g., personal history, adverse events, hospitalization, and drug use) have to be linked through a common identifier.

The best identifier is one that is not only unique but has no embedded personal identification, unless that information is scrambled and the key for unscrambling it is stored remotely and securely. The group operating the registry should have a process by which each new entry to the registry is assigned a unique code and there is a crosswalk file to enable the system to append this identifier to all new data as they are accrued. The crosswalk file should not be accessible by people or entities outside the management group.

In addition, consideration should be given to the fact that a registry may need to accept and link datasets from more than one outside organization. Each institution contributing data to the registry will have unique requirements for patient data, access, privacy, and duration of use. While having identical agreements with all institutions would be ideal, this may not always be possible from a practical perspective. Yet all registries have resource constraints, and decisions about including certain institutions have to be determined based on the resources available in order to negotiate specialized agreements or to maintain specialized requirements. Agreements should be coordinated as much as possible so that the function of the registry is not greatly impaired by variability among agreements. All organizations participating in the registry should have a common understanding of the rules regarding access to the data. Although exceptions can be made, it should be agreed that access to data will be based on independent assessment of research protocols and that participating organizations will not have individual veto power over access.

When data from secondary sources are used, agreements should specify ownership of the source data and clearly permit data use by the recipient registry. The agreements should also specify the roles of each institution, its legal responsibilities, and any oversight issues. It is critical that these issues and agreements be put in place before data are transferred so that there are no ambiguities or unforeseen restrictions on the recipient registry later on.

Some registries may wish to incorporate data from more than one country. In these cases, it is important to ensure that the data are being collected in the same manner in each country or to plan for any necessary conversion. For example, height and weight data collected from sites in Europe will likely be in different units than height and weight data collected from sites in the United States. Laboratory test results may also be reported in different units, and there may be variations in the types of pharmaceutical products and medical devices that are approved for use in the participating countries. Understanding these issues prior to incorporating secondary data sources from other countries is extremely important to maintain the integrity and usefulness of the registry database.

When incorporating other data sources, consideration should also be given to the registry update schedule. A mature registry will usually have a mix of data update schedules. The registry may receive an annual update of large amounts of data, or there could be monthly, weekly, or even daily transfers of data. Regardless of the schedule of data transfer, routine data checks should be in place to ensure proper transfer of data. These should include simple counts of records as well as predefined distributions of key variables. Conference calls or even routine meetings to go over recent transfers will help avoid mistakes that might not otherwise be picked up until much later.

An example of the need for regular communication is a situation that arose with the United States Renal Data System a few years ago. The United Network for Organ Sharing (UNOS) changed the coding for donor type in their transplant records. This resulted in an apparent 100-percent loss of living donors in a calendar year. The change was not conveyed to USRDS and was not detected by USRDS staff. After USRDS learned about the change, standard analysis files that had been sent to researchers with the errors had to be replaced.

6. Summary

In summary, a registry is not a static enterprise. The management of registry data sources requires attention to detail, constant feedback to all participants, and a willingness to make adjustments to the operation as dictated by changing times and needs.

References for Chapter 6

1. Gliklich RE, Leavy MB, Dreyer NA, (eds). Tools and Technologies for Registry Interoperability, Registries for Evaluating Patient Outcomes: A User’s Guide. 3rd Edition, Addendum II. (Prepared by L&M Policy Research, LLC under Contract No. 290-2014-00004-C.) AHRQ Publication No. 17(18)-EHC017-EF. Rockville, MD: Agency for Healthcare Research and Quality; June 2019. https://effectivehealthcare.ahrq.gov/.	2. Blumenthal D. Launching HITECH. N Engl J Med. 2010;362(5):382-5. PMID: 20042745. DOI: 10.1056/NEJMp0912825.
3. Blumenthal D, Tavenner M. The “meaningful use” regulation for electronic health records. N Engl J Med. 2010;363(6):501-4. PMID: 20647183. DOI: 10.1056/NEJMp1006114.	4. Centers for Medicare and Medicaid Services. Physician Quality Reporting System (PQRS). https://www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment-Instruments/PQRS/Downloads/PQRS_OverviewFactSheet_2013_08_06.pdf. Accessed June 20, 2019.
5. The Office of National Coordinator of Health Information Technology. U.S. Department of Health and Human Services. Health IT Dashboard. https://dashboard.healthit.gov/index.php. Accessed June 10, 2019.	6. Marsolo K, Spooner SA. Clinical genomics in the world of the electronic health record. Genet Med. 2013;15(10):786-91. PMID: 23846403. DOI: 10.1038/gim.2013.88.
7. Marsolo K, Margolis PA, Forrest CB, et al. A Digital Architecture for a Network-Based Learning Health System: Integrating Chronic Care Management, Quality Improvement, and Research. EGEMS (Wash DC). 2015;3(1):1168. PMID: 26357665. DOI: 10.13063/2327-9214.1168.	8. Centers for Medicare & Medicaid Services. Medicare Enrollment Dashboard. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Dashboard/Medicare-Enrollment/Enrollment%20Dashboard.html. Accessed June 10, 2019.
9. Centers for Medicare and Medicaid Services. March 2019 Medicaid & CHIP Enrollment Data Highlights. https://www.medicaid.gov/medicaid/program-information/medicaid-and-chip-enrollment-data/report-highlights/index.html. Accessed June 10, 2019.	10. Research Data Assistance Center (ResDAC). https://www.resdac.org/cms-data. Accessed June 10, 2019.
11. Centers for Disease Control and Prevention. National Death Index. https://www.cdc.gov/nchs/ndi/index.htm. Accessed June 20, 2019.	12. Social Security Administration. Death Master File. National Technical Information Service. https://dmf.ntis.gov/. Accessed June 20, 2019.
13. National Technical Information Service. Important Notice: Change in Public Death Master File Records. https://classic.ntis.gov/assets/pdf/import-change-dmf.pdf. Accessed June 20, 2019.	14. Doody MM, Hayes HM, Bilgrad R. Comparability of national death index plus and standard procedures for determining causes of death in epidemiologic studies. Ann Epidemiol. 2001;11(1):46-50. PMID: 11164119.
15. Sathiakumar N, Delzell E, Abdalla O. Using the National Death Index to obtain underlying cause of death codes. J Occup Environ Med. 1998;40(9):808-13. PMID: 9777565.	16. U.S. Bureau of the Census. http://www.census.gov. Accessed June 20, 2019.
17. Health Resources and Services Administration. Area Health Resources Files (AHRF). https://data.hrsa.gov/topics/health-workforce/ahrf. Accessed June 20, 2019.	18. U.S. Environmental Protection Agency. RSIG Data Inventory. https://www.epa.gov/hesc/rsig-data-inventory. Accessed June 20, 2019.
19. McMillan NJ, Holland DM, Morara M, et al. Combining numerical model output and particulate data using Bayesian space–time modeling. Environmetrics. 2010;21(1):48-65. DOI: 10.1002/env.984.	20. American Hospital Association. AHA Data and Directories. https://www.aha.org/other-resources/2018-01-08-aha-data-and-directories. Accessed June 20, 2019.
21. U.S. Department of Health and Human Services. Centers for Medicare & Medicaid Services. Data Dissemination. https://www.cms.gov/Regulations-and-Guidance/Administrative-Simplification/NationalProvIdentStand/DataDissemination.html. Accessed June 20, 2019.	22. National Marrow Donor Program. http://www.marrow.org. Accessed June 20, 2019.
23. Carnahan RM, Moores KG. Mini-Sentinel’s systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned. Pharmacoepidemiol Drug Saf. 2012;21 Suppl 1:82-9. PMID: 22262596. DOI: 10.1002/pds.2321.	24. Curtis LH, Weiner MG, Boudreau DM, et al. Design considerations, architecture, and use of the Mini-Sentinel distributed data system. Pharmacoepidemiol Drug Saf. 2012;21 Suppl 1:23-31. PMID: 22262590. DOI: 10.1002/pds.2336.
25. Ross TR, Ng D, Brown JS, et al. The HMO Research Network Virtual Data Warehouse: A Public Data Model to Support Collaboration. EGEMS (Wash DC). 2014;2(1):1049. PMID: 25848584. DOI: 10.13063/2327-9214.1049.	26. Steiner JF, Paolino AR, Thompson EE, et al. Sustaining Research Networks: the Twenty-Year Experience of the HMO Research Network. EGEMS (Wash DC). 2014;2(2):1067. PMID: 25848605.
27. Vogt TM, Lafata JE, Tolsma DD, et al. The Role of Research in Integrated Health Care Systems: The HMO Research Network. Perm J. 2004;8(4):10-7. PMID: 26705313.	28. Califf RM. The Patient-Centered Outcomes Research Network: a national infrastructure for comparative effectiveness research. N C Med J. 2014;75(3):204-10. PMID: 24830497.
29. Fleurence RL, Curtis LH, Califf RM, et al. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 2014;21(4):578-82. PMID: 24821743. DOI: 10.1136/amiajnl-2014-002747.	30. Toh S, Platt R, Steiner JF, et al. Comparative-effectiveness research in distributed health data networks. Clin Pharmacol Ther. 2011;90(6):883-7. PMID: 22030567. DOI: 10.1038/clpt.2011.236.
31. Hernandez AF, Fleurence RL, Rothman RL. The ADAPTABLE Trial and PCORnet: Shining Light on a New Research Paradigm. Ann Intern Med. 2015;163(8):635-6. PMID: 26301537. DOI: 10.7326/m15-1460.	32. Johnston A, Jones WS, Hernandez AF. The ADAPTABLE Trial and Aspirin Dosing in Secondary Prevention for Patients with Coronary Artery Disease. Curr Cardiol Rep. 2016;18(8):81. PMID: 27423939. DOI: 10.1007/s11886-016-0749-2.
33. Block JP, Bailey LC, Gillman MW, et al. PCORnet Antibiotics and Childhood Growth Study: Process for Cohort Creation and Cohort Description. Acad Pediatr. 2018;18(5):569-76. PMID: 29477481. DOI: 10.1016/j.acap.2018.02.008.	34. Toh S, Rasmussen-Torvik LJ, Harmata EE, et al. The National Patient-Centered Clinical Research Network (PCORnet) Bariatric Study Cohort: Rationale, Methods, and Baseline Characteristics. JMIR Res Protoc. 2017;6(12):e222. PMID: 29208590. DOI: 10.2196/resprot.8323.
35. Qualls LG, Phillips TA, Hammill BG, et al. Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet®). EGEMS (Wash DC). 2018;6(1):3. PMID: 29881761. DOI: 10.5334/egems.199.	36. Brown JS, Holmes JH, Shah K, et al. Distributed health data networks: a practical and preferred approach to multi-institutional evaluations of comparative effectiveness, safety, and quality of care. Med Care. 2010;48(6 Suppl):S45-51. PMID: 20473204. DOI: 10.1097/MLR.0b013e3181d9919f.
37. Maro JC, Platt R, Holmes JH, et al. Design of a national distributed health data network. Ann Intern Med. 2009;151(5):341-4. PMID: 19638403.	38. United States Renal Database. https://www.usrds.org/. Accessed June 20, 2019.

Case Examples for Chapter 6

Case Example 13. Using Claims Data Along With Patient-Reported Data To Identify Patients

Description	The National Amyotrophic Lateral Sclerosis (ALS) Registry is a rare disease registry created by the Agency for Toxic Substances and Disease Registry (ATSDR) within the U.S. Centers for Disease Control and Prevention (CDC), Department of Health and Human Services (HHS). The purpose of the registry is to quantify the incidence and prevalence of ALS in the United States, describe the demographics of people with ALS, and examine potential risk factors for the disease.
Sponsor	U.S. Department of Health and Human Services and Agency for Toxic Substances and Disease Registry, through funding from the “ALS Registry Act” (U.S. Congress Public Law 110-373).
Year Started	2010
Year Ended	Ongoing
No. of Sites	All 50 States, including U.S. territories; data from national administrative databases are combined with patient self-enrollment data.
No. of Patients	16,583 as of 2015; prevalence estimates are released annually for the successive calendar year

Challenge

Amyotrophic lateral sclerosis (ALS) is a progressive, fatal neurodegenerative disorder of both the upper and lower motor neurons. Many knowledge gaps exist in the understanding of ALS, including uncertainty about the disease’s incidence and prevalence, misdiagnosis of ALS in patients with other motor neuron disorders, and the role of environmental exposures in the etiology of ALS. Because ALS is a non-reportable disease in the United States (except for the Commonwealth of Massachusetts), previous attempts to estimate ALS incidence and prevalence using nonspecific mortality data have faced many challenges and at best overestimated disease frequency. Identifying patients through site recruitment for research purposes poses additional challenges, as access to patient medical records can be limited, costly, and time-consuming to obtain. Patient recruitment issues are compounded by the complexities of this rare disease, in which the average timeframe from diagnosis to death is 2–5 years. U.S. governmental agencies acknowledged that a national, structured data collection program for ALS was greatly needed, and that alternative data sources and recruitment strategies would need to be identified.

Proposed Solution

In 2008, President Bush signed the ALS Registry Act into law, allowing ATSDR to create the National ALS Registry. The registry is the only Congressionally mandated population-based ALS registry in the United States. As a first step in developing the registry, a workshop of international experts in neurological and autoimmune conditions was convened to discuss approaches to creating a national database. Based on feedback from these experts, the registry uses a two-pronged approach to identify all U.S. cases of ALS. The first approach uses national administrative databases, including those of Medicare, Medicaid, the Veterans Health Administration, and the Veterans Benefit Administration, to identify prevalent cases based on an algorithm developed through pilot projects. These administrative databases cover approximately 90 million Americans, and the algorithm identifies 80 to 85 percent of all true ALS cases when applied to these databases. The second approach uses a secure Web portal to allow patients to self-enroll voluntarily. Data from the two approaches are combined into the registry database, and duplicate patients are identified and removed so that each person with ALS is counted only once in the registry.

Results

The National ALS Registry has funded over 15 research projects, such as evaluating environmental risk factors and possible etiologies for ALS. In addition, the Registry has published 65 articles and more than 50 abstracts in peer-reviewed publications. The Web portal for self-enrolled participants contains 17 brief surveys that collect information on potential risk factors, such as socio-demographic characteristics, occupational history, military history, cigarette smoking, alcohol consumption, physical activity, family history of neurodegenerative diseases, and disease progression. To date, over 80,000 surveys have been completed by Registry enrollees. ATSDR also funded active surveillance projects that allowed population-based case estimates of ALS in certain smaller geographic areas (i.e., at the State and metropolitan levels) to help ATSDR evaluate the completeness of the registry. In addition, ATSDR has developed a system to inform people with ALS about new research (e.g., clinical trials, epidemiological studies) for which they may be eligible. To date, the Registry has helped to recruit for over 45 research projects. Lastly, the Registry now includes a national biorepository that is designed to help researchers better understand the disease by pairing biospecimens (e.g., blood, brain tissue) with existing risk-factor data from patients. Thousands of biospecimens are currently available to researchers for analysis.

Key Point

Combining multiple data sources, such as administrative databases and patient-reported information, is a novel and effective way to successfully identify patients with a rare disease and to better understand the prevalence, incidence, and etiology of the disease. However, using alternative approaches requires a strong understanding of the nuances of the individual data sources; pilot testing is also helpful to identify potential issues with data sources prior to registry launch.

For More Information

http://wwwn.cdc.gov/als

Mehta P, Kaye W, Raymond J, et al. Prevalence of Amyotrophic Lateral Sclerosis — United States, 2015. MMWR Morb Mortal Wkly Rep 2018;67:1285–1289. DOI: http://dx.doi.org/10.15585/mmwr.mm6746a1.

Mehta P, Horton DK, Kasarskis EJ, et al. CDC Grand Rounds: National Amyotrophic Lateral Sclerosis (ALS) Registry Impact, Challenges, and Future Directions. MMWR Morb Mortal Wkly Rep 2017;66:1379–1382. DOI: http://dx.doi.org/10.15585/mmwr.mm6650a3.

Kaye W, Wagner L, Wu R, Mehta P. Evaluating the completeness of the national ALS registry, United States. Amyotroph Lateral Scler Frontotemporal Degener; 2018 Feb;19(1-2):112-117. PMID: 29020837. DOI: 10.1080/21678421.2017.1384021.

Horton DK, Kaye W, Wagner L. Integrating a Biorepository into the National Amyotrophic Lateral Sclerosis Registry. J Environ Health. 2016 Nov;79(4):38-40. PMID: 28935999.

Horton DK, Graham S, Punjani R, et al. A spatial analysis of amyotrophic lateral sclerosis (ALS) cases in the United States and their proximity to multidisciplinary ALS clinics. Amyotroph Lateral Scler Frontotemporal Degener. 2018 Feb;19(1-2):126-133. PMID: 29262737. DOI: 10.1080/21678421.2017.1406953.

Case Example 14. Using a Patient-Centered Study Design To Collect Informed Consent, Maximize Recruitment and Retention, and Provide Meaningful Clinical Data

Description	Function and Outcomes Research for Comparative Effectiveness in Total Joint Replacement (FORCE-TJR) is a prospective research registry tracking and studying long-term outcomes of elective total joint replacement (TJR) surgery. The registry seeks to understand patient-reported and clinical outcomes by collecting data on baseline patient attributes, procedure approach and technology, inpatient hospital stay, surgeon and institutional characteristics, longitudinal patient pain and function, and post-procedure complications and revisions. A diverse patient cohort allows the generation of aggregate severity-adjusted national and regional data against which participating surgeons can compare their own practice.
Sponsor	Funded in part by grants from the Agency for Healthcare Research and Quality and the Patient-Centered Outcomes Research Institute to the University of Massachusetts Medical School.
Year Started	2011
Year Ended	Ongoing
No. of Sites	Over 200 orthopedic surgeons in 28 states
No. of Patients	>50,000

Challenge

Total joint replacement (TJR) is a common procedure, with more than 700,000 primary hip and knee replacements performed in the United States each year. Although TJR can result in significant pain relief, physical function and activity levels can vary widely after surgery. FORCE-TJR collects data to track patient, provider, and site characteristics in order to evaluate their contributions to patient-reported and clinical outcomes of TJR over time.

TJR patients often have limited contact with their surgeons immediately after making the decision to have surgery, instead interacting with office and hospital staff to complete insurance or anesthesia pre-operative paperwork. Administrative site staff often do not have the time or training to effectively inform patients about the risks and benefits of participating in patient-centered studies. Further, clinical information that may contain important data about adverse events resides in various, disconnected points of care. Patients may return to the hospital in which TJR was performed or they may present at another hospital, urgent care center, or doctor’s office. Often these disparate sites of care are not linked with the same electronic medical record, making data difficult to collect. Collecting informed consent, patient reported outcomes, and other followup data from TJR patients can therefore be challenging and requires an innovative approach.

Proposed Solution

Successful approaches to maximizing patient participation in research are based in creating a relationship with each patient and minimizing the burden on site staff. Patients who schedule a TJR are asked by administrative staff at the participating site to sign a short study contact form, giving permission for registry staff to contact them. Site staff give the patient an informational packet and send the signed contact forms to the registry. To collect informed consent, registry research staff contact patients at their convenience via telephone to review the study procedures, informed consent form, and medical release forms in the informational packet. At this point, patients have the opportunity to ask questions of registry staff and discuss with them any concerns, facilitating a deep understanding of the registry and their role in its success. Patients return the signed informed consent and medical release forms to registry staff via U.S. mail. At the same time, they complete a standardized, patient-reported outcome (PRO) to quantify pain and function before surgery. PROs are repeated at annual intervals after surgery to quantify pain relief and functional gain. Patients also answer brief questions to screen for post-surgical events, including revision surgery or other complications. Collecting clinical data that does not reside in a single medical record also relies upon patient engagement. Upon enrollment in the registry, patients are asked to authorize release of their medical records; at each contact following surgery, patients are asked if they sought medical care since their last contact with the registry. If registry staff determine the medical care could be related to the TJR, the related medical records are obtained and analyzed.

Results

The model described above uses registry staff to enroll patients, obtain informed consent, and deliver longitudinal information and motivation, enhancing participant enrollment and commitment over the long term. This procedure facilitates the longitudinal collection of patient-reported outcomes and medical records data, thus enabling more precise severity adjustment than relying on administrative data. Sites report high satisfaction with the model, contributing to an 80 percent overall patient recruitment ratio in the registry.

Key Point

Registries and other patient-centered research can benefit from a study design that engages patients at enrollment, thereby increasing their participation over the life of the study. For registries that require clinical data from patients who may not access all their care within one system, or that require patient-reported outcome measures, an approach that follows the patient across settings can be beneficial. Contacting patients at their convenience rather than in a healthcare setting can allow them more time to have their questions answered, increasing patient commitment.

For More Information

https://force-tjr.org/

Case Example 15. Linking a Procedure-Based Registry With Claims Data To Study Long-Term Outcomes

Description	The CathPCI Registry measures the quality of care delivered to patients receiving diagnostic cardiac catheterizations and percutaneous coronary interventions (PCI) in both inpatient and outpatient settings. The primary outcomes evaluated by the registry include the quality of care delivered, outcome evaluation, comparative effectiveness, and postmarketing surveillance.
Sponsor	American College of Cardiology Foundation through the National Cardiovascular Data Registry. Funded by participation dues from catheterization laboratories.
Year Started	1998
Year Ended	Ongoing
No. of Sites	1,698 catheterization laboratories
No. of Patients	23.3 million patient records; 9.6 million PCI procedures

Challenge

The registry sponsor was interested in studying long-term patient outcomes for diagnostic cardiac catheterizations and PCI, but longitudinal data were not collected as part of the registry. Rather than create an additional registry, it was determined that the most feasible option was linking the registry data with available third-party databases such as Medicare.

Before the linkage could occur, however, several legal questions needed to be addressed, including what identifiers could be used for the linkage and whether institutional review board (IRB) approval was necessary.

Proposed Solution

The registry developers explored potential issues relating to the use of protected health information (under the Federal HIPAA [Health Insurance Portability and Accountability Act] law) to perform the linkage; the applicability of the Common Rule (protection of human subjects) to the linkage; and the contractual obligations of the individual legal agreement with each participating hospital with regard to patient privacy. The registry gathers existing data, including direct patient identifiers collected as part of routine healthcare activities. Informed consent is not required. The registry sponsor has business associate agreements in place with participating catheterization laboratories for which the registry conducts the outcomes evaluations.

After additional consultation with legal counsel, the registry sponsor concluded that the linkage of data could occur under two conditions: (1) that the datasets used in the merging process must be in the form of a limited dataset (see Chapter 7), and (2) that an IRB must evaluate such linkage. The decision to implement the linkage was based on two key factors. First, the registry participant agreement includes a data use agreement, which permits the registry sponsor to perform research on a limited dataset but also requires that no attempt be made to identify the patient. Second, since there was uncertainty as to whether the proposed data linkage would meet the definition of research on human subjects, the registry sponsor chose to seek IRB approval, along with a waiver of informed consent. The registry sponsor has a policy that requires that all registry research be conducted consistent with the requirements of the Common Rule.

Results

The registry data were linked with Medicare data, using probabilistic matching techniques to link the limited datasets. A research protocol describing the need for linkage, the linking techniques, and the research questions to be addressed was approved by an IRB. Researchers must reapply for IRB approval for any new research questions they wish to study in the linked data.

Results of the linkage analyses were used to develop a new measure, “Readmission following PCI,” for the Centers for Medicare & Medicaid Services’ hospital inpatient quality pay-for-reporting program, and researchers have used the linked data to address other questions.

Key Point

There are many possible interpretations of the legal requirements for linking registry data with other data sources. The interpretation of legal requirements should include careful consideration of the unique aspects of the registry, its data, and its participants. In addition, clear documentation of the way the interpretation occurred and the reasoning behind it will help to educate others about such decisions and may allay anxieties among participating institutions.

For More Information

https://cvquality.acc.org/NCDR-Home/registries/hospital-registries/cathpci-registry