Get the latest inspection trends and ideas right in your inbox.

All Resources

Data Integrity: Data Flow Diagrams, Part 1

As we noted in our previous wrap-up on changes in ICH GCP E6 R3, the revised guidance highlights data integrity, which is also front and center of the revised EMA Guideline on Computerized Systems and Electronic Data in Clinical Trials.  “Integrity” sounds like an ineffable concept, but it is in fact quite concrete.  Although R3 does not provide us with a definition, the EMA Guideline tells us that data integrity is achieved when data are collected, accessed, and maintained so that ALCOA++ principles (attributability, legibility, contemporaneousness, originality, accuracy, completeness, consistency, endurance, availability, and traceability) are maintained.

If integrity is the end result, then controls on systems and processes are the enablers. All the regulatory authorities agree that the controls should be risk-based, but of course it’s up to the sponsor to assess the risk and then determine the appropriate controls.  A logical first step in the risk assessment is a data flow diagram or table. The recently-updated FDA Bioresearch Monitoring Guideline for sponsor and CRO inspections now states an expectation that the inspector will determine the data flow “from initial source generation to reporting in clinical study reports, as applicable.”  This is how an inspector will identify data integrity risks after submission, so it only makes sense that sponsors would do it prospectively to identify and mitigate risks before the study starts.

A data flow diagram provides a useful visual tool for the study team.  We recommend that the diagram utilize a swim lane for each type of data that is handled in a distinct way.  The diagram should identify how the source is generated, starting with the observation, procedure, or verbal report, and then trace the data through every transcription and transformation.  For example, the first part of the data flow diagram that traces the flow of inclusion criteria data might look like this:

To assess risk, we ask the following questions:

  • How reliable is each data source?
  • How easy would it be for a bad actor to falsify a data source?
  • Where data are being transcribed, transformed, or transferred, how easy would it be to introduce inaccuracies or to lose data?
  • Given the answers to our questions, what mitigations should we put in place to guard against risks to data integrity?

Let’s start with reliability of the source data.  We have two sources of data on this diagram:  The participant’s verbal report and a genetic test. This diagram calls attention to the fact that 9 out of 10 inclusion criteria are dependent upon the participant’s verbal report. Depending on the safety of the product, the severity of the indication, and the ability to independently verify the criterion, this could be appropriate, or not. For example, if a participant is required to have had three bone fractures in the last five years to be eligible for the study, a faulty memory could result in an ineligible entry; the study team might decide to require corroboration via hospital records as a mitigating step. If a participant is required to have had migraine headaches for 5 out of the last 30 days to be eligible, there is unlikely to be corroborating documentation; however, the team might decide to train the site to ask prospective participants an open-ended question (“How many days over the past 30 days did you suffer from a migraine headache?”) rather than a leading one (“Have you had a migraine headache for five out of the last 30 days?”).

How reliable is the genetic test? Here we would have to obtain data on the rate of false positives and false negatives as well as the consistency of results across laboratories, if we’re planning on using local labs.  If we decide to mitigate the risk by using a central lab, then we need to consider backup methods if the lab is unable to process results during the screening period or if a sample is lost during shipment.

Could either of these sources be falsified? Verbal reports are the easiest sources to falsify, because an unscrupulous investigator could simply capture the desired report rather than the actual report. If the study participant is a long-standing patient of the investigator, then the site monitor might be able to corroborate the verbal report in the subject’s medical record.  If the site is recruiting subjects off the street, there is more opportunity for falsification; for inclusion criteria that are critical to safety or to robustness of the analysis, the study team may determine, as a mitigating step, to require corroborating documentation.  Genetic test results could not be easily falsified if a central laboratory were used, particularly if results were available on a portal, but if local labs are planned, the risk is increased. In this case, site monitors might be trained to compare reports from the same lab and look for signs of editing.

How easy would it be to introduce inaccuracies during transcription? Most inclusion criteria will be captured in the electronic medical record (the source) and then transcribed twice:  once into the eligibility checklist, and a second time into the EDC system. Each transcription increases the risk of inaccuracy; if we can’t eliminate the eligibility checklist as an interim transcription, we may require site monitors to source verify it.  For the genetic test, the paper report from the laboratory is the source, so the site monitor must verify the source against three transcriptions:  EMR, eligibility checklist, and EDC. We should not assume that site monitors will implicitly understand this; requirements for verification against source and against interim transcriptions must be spelled out in the monitoring plan.

We also need to consider whether particular features of the data make it more likely to be transcribed inaccurately.  For example, if the information captured about the genetic test is Yes or No (participant does or does not have the condition), the risk of inaccuracy is low. If the database requires capture of specific genetic typing data that must be pulled out of a lengthy report, the risk increases.  In this case we may decide to implement additional training for site staff or site monitors – or both – or even implement an additional QC review by an expert third party for the first few participants at each site.

In our next blog post, we’ll look at the role of “external data” in the data flow.

Related Posts