Ready Room Blog

Synclinical's TMF Reference Model Transformer Generates AI Embeddings

Synclinical's TMF Reference Model Transformer Generates AI Embeddings

Pete Lacey
4 minute read

Listen to article
Audio generated by DropInBlog's Blog Voice AI™ may have slight pronunciation nuances. Learn more

Last year, Synclinical published the TMF Reference Model Transformer under an open-source license. The Transformer is a tool to convert the CDISC Trial Master File (TMF) Reference Model from an Excel spreadsheet to JavaScript Object Notation (JSON), a more machine-readable format.

A lot can happen in a year.

A year ago, very few of us were talking about AI, ChatGPT, large language models, or embeddings. A year later, these technologies are fundamentally changing how businesses process information. Today, Synclinical is announcing it's first incremental step into artificial intelligence with an update to the TMF Reference Model Transformer that adds support for additional glossary elements, additional document aliases, and most importantly, OpenAI embeddings.

Embeddings allow users to search for information that’s semantically close to other information. For instance, embeddings make it possible to know that the sentence, “My courage always rises at every attempt to intimidate me,” is much the same as, “Don't make me angry. You wouldn't like me when I'm angry,” but very different from, “You’re killin’ me, Smalls.”

Using the embeddings generated by the transformer, you could conceivably build a system that takes natural language input such as, “I need the list of all safety letters sent to site with evidence of sending and receipt,” and have the system return where in the TMF to look. (Which is exactly what we did! But that’s next week’s announcement. Stay tuned.)

Generating these embeddings is non-trivial (and costly), so Synclinical has decided to not only release the code that generates the embeddings, but also the final embeddings themselves. You can find them here: https://github.com/synclinical/tmf_reference_model_transform/releases/tag/v0.3.0

When integrating this work into Ready Room, we found that while the search results were better than traditional search, they still weren’t great. The problem was that the embeddings were generated against a language model that “knows” very little about clinical trials. That is, the LLM places “TORO” a lot closer to sushi and lawn mowers than it does to regulatory obligations. To address this, we have added two additional features to the Transformer.

First, we added 70 additional acronyms to the 50 or so glossary entries in the reference model. These include common abbreviations, such as, TORO (Transfer of Regulatory Obligations) and SNL (Safety Notification Letter).

Next, we enhanced the list of possible document names for an artifact to include many common synonyms. For instance, the reference model lists “Evidence of Safety Information Distribution” and “Notification to Investigators of Safety Information” as sub-documents associated with artifact “05.04.09 - Notification to Investigators of Safety Information.” To this list the transformer adds "Safety Letter", "Safety Letter Receipt", "SNL", and "Safety Letter Notification.".

Then, when generating the embedding for this artifact, we send the following source to OpenAI:

Document names: Evidence of Safety Information Distribution, Notification to Investigators of Safety Information, Safety Letter, Safety Letter Receipt, SNL, Safety Letter Notification. Zone: Site Management.

(We send the zone information because, annoyingly, the reference model has some artifacts that are identical to other artifacts but in a different zone.)

It took some amount of tinkering to land on that source format. Earlier iterations included the artifact definition and more English-like sentences, but those too gave sub-optimal results. If your requirements differ, please feel free to fork the code and edit as needed.

Next week, we’ll show how reference model embeddings are (will be) surfaced in Ready Room. We’re pretty sure you’ll be pleased with our first minor foray into artificial intelligence. In the meantime, the transformer code can be found here: https://github.com/synclinical/tmf_reference_model_transform.

(By the way, the image accompanying this article was generated by ChatGPT using the prompt, "Generate an image that expresses the concept of text embeddings." Our Ready Room blog frequently employs AI-generated images, such as the girl writing a Christmas list, a Banksy-inspired stopwatch, a woodcut coffee cup, and - our favorite - the Greek statue with a laptop.)

« Back to Blog

Proven inspection management for the life Sciences industry

Biotech, pharmaceutical, medical device, CMOs, CROs, and laboratories big and small are getting ready with Ready Room.

Get a Demo