Ready Room Blog

Data Integrity: Data Flow Diagrams Part 3, The Back End

Data Integrity: Data Flow Diagrams Part 3, The Back End


5 minute read

Listen to article
Audio generated by DropInBlog's Blog Voice AI™ may have slight pronunciation nuances. Learn more

In our third and last post on data flow diagrams (see 1 on risk assessment and 2 on lab data), we consider the “back end” – what happens after data leave the data management team for analysis and reporting.

Regulatory authorities are showing increased interest in these back-end data movements. FDA’s updated Bioresearch Monitoring Compliance Program manual for inspections of sponsors and CROs includes new requirements to “determine the data flow from initial source generation to reporting in clinical study reports.” EMA’s updated guideline on computerized systems includes a section on data transfer that states that all transfers should be pre-specified and validated.  MHRA’s newly-published (albeit not exactly fresh) inspection metrics from 2019 and early 2020 include examples of critical data integrity findings, including failure to document and explain post-lock decisions; back-end data changes without confirmation from the Principal Investigator; and repeated statistical analyses generating different p-values from the same dataset, with no change in the analysis or dataset.

Sponsors typically exert control over data while it is being generated and cleaned, but that control sometimes flags after it leaves Data Management’s sphere.  Each time regulated data move from one location to another, we need to ensure the following:

  1. Records are transmitted from one adequately controlled system to a second adequately controlled system via a validated, secure method, OR
  2. Records are transmitted from one adequately controlled system to a second adequately controlled system via a method that includes manual or programmatic checks to verify that the record that was sent is identical to the record that was received.

For example, if a CRO moves SAS datasets from its validated statistical computing environment to a validated sFTP server, and the biostatistician downloads those documents directly into its own statistical computing environment, then Condition 1 is fulfilled.  If the CRO downloads SDTM datasets onto a CD (do they still make those?), hashes the data, and couriers the CD over to the sponsor, and the sponsor hashes the data and compares it to the CRO's hashed value to verify that the datasets are identical, then Condition 2 is fulfilled.

If a CRO attaches SAS datasets to an email, password-protects the file, and sends it to the sponsor with the password in a separate email, then the sponsor downloads the datasets into their statistical computing environment, then neither condition is met.  The password reduces the risk of a bad actor grabbing the data, but it doesn't necessarily protect it from corruption.  Email is not adequately controlled for data exchange, because emails are so easily lost, falsified, corrupted, etc. so other controls would be needed.

Data are frequently transferred to a second party who transforms it in some way.  Ideally, we should capture these transformations on our data flow diagram as well. Each time data are transformed, we need to ensure the following:

  1. The transformation is pre-specified.
  2. Execution of the transformation is documented.
  3. It is possible to compare the data pre-transformation to the data post-transformation.

For example, if a biostatistician develops and approves a Statistical Analysis Plan that specifies how SAS datasets will be analyzed, then Condition 1 is fulfilled.  If the statistical programmer maintains a log of all SDTM datasets generated and quality control activities performed, that fulfills Condition 2.  If the statistical programmer maintains a read-only copy of each SAS dataset before transforming it to SDTM, that fulfills Condition 3.

Another example: Let's imagine a biostatistical team decides to "hard code" changes to data after receiving SAS datasets.  In many cases, teams do this because a lack of clarity in the protocol or CRF completion guidelines led to inconsistent data collection, but the operations team feels it would take too much time to follow up with the sites to make the change. If the team documents the intended process to be followed before executing the change, that fulfills Condition 1. (Per the inspection finding cited above, this process should include confirmation from the PIs that the change can be made.)  If the team generates a signed memo at the time the change was made describing that the pre-defined process was followed, that fulfills Condition 2.  If the team retains the original SAS datasets and the hard-coded datasets AND performs a programmatic comparison to verify that ONLY the intended changes were made, that fulfills Condition 3.   

Most sponsors and CROs have controls in place for standard analysis deliverables.  We sometimes see gaps, however, in niche deliverables:  Excel spreadsheets from a PK lab, or data from a central imaging reader, or transfers from a small ePRO vendor.  Sometimes the sponsor serves as a conduit between one vendor and another but doesn't store datasets in a protected manner in an interim. Hard-coding is also a common cause of integrity issues.

« Back to Blog

Proven inspection management for the life Sciences industry

Biotech, pharmaceutical, medical device, CMOs, CROs, and laboratories big and small are getting ready with Ready Room.

Get a Demo