Cloud Object Storage
In this day and age, any application that is managing significant amounts of unstructured data is likely storing it in a cloud-based “object store,” such as AWS S3, Google Cloud Storage (GCS), or Azure Storage. These object stores have a number of interesting properties. First and foremost, they provide essentially infinite storage. You can throw terabytes of data at the things, and they will just suck it up. Furthermore, they are extremely durable; once a file is committed to an object store it is not likely to be lost due to hardware failure (human error, of course, is a whole other thing). In fact, Google touts “99.999999999% annual durability.” That’s 11 nines. Which means that if you store 10 million objects in GCS, you can expect to lose one of them every 10,000 years! Since Ready Room was already running on Google Compute Engine, we opted to go with Google Cloud Storage. A naive implementation of GCS-backed file management in Ready Room would be to simply modify the backend code to shoot the file over to the object store instead of writing it to the database. That would, indeed, be simple, but it suffers from two big problems. One, it almost doubles the time it takes to upload a file. That is, the file first has to go from the user to Ready Room and then from Ready Room to GCS. And two, it consumes gobs of system resources: RAM, CPU, and bandwidth, to process a file that the system is just going to turn around and jettison. Resources that cost money and are now not available to other users. And all this is true when retrieving a file as well. No, we needed a way to get these files to GCS without proxying them through Ready Room’s servers.
A naive implementation of GCS-backed file manangem
Authentication via Cryptographically Signed URLs
On request, cloud vendors can provide their customers with a private key; a long string of seemingly random bytes that can be used to gain access to the object store. When that key is generated, the vendor will also create a corresponding public key that they hold on to. This public/private key-pair has an interesting mathematical property: data encrypted using the private key can be decrypted with the public key. In computing there is also the notion of a secure hash function. A secure hash function can generate a fingerprint unique to any piece of data. Here, for instance, is a fingerprint for the phrase “Ready Room:” 347058ed03730a16153a7526df80eea0fa3f5cdc419af569108c475c23d7edef The interesting thing about secure hash algorithms is that they are extremely sensitive to the original input. Change just one bit and you get an entirely different hash. Here’s the fingerprint of “ReadyRoom” (no space): 85fb501e73859297043608c1d96a3d14a89515906469789ea4593a96cdbcd517 With these two concepts, public key cryptography and secure hash algorithms, we can solve our problem of authenticating to GCS from a client browser. When a Ready Room user wants to store a file, information about the inspection, the request, and the file is sent to the backend. Ready Room uses this information to construct a URL, such as: https://storage.googleapis.com/6c398d6f-345b-462c-8726-a96558bb99eb/23/somefile.pdf Then it hashes that URL (and a few other choice pieces of information) to generate a fingerprint and encrypts that fingerprint with the private key. The resulting “signed” URL can now be safely sent back to the browser. It looks something like this: https://storage.googleapis.com/6c398d6f-345b-462c-8726-a96558bb99eb/23/somefile.pdf?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=user%40project.iam.gserviceaccount.com%2F20200609%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200614T115359Z&X-Goog-Expires=900&X-Goog-SignedHeaders=host&X-Goog-Signature= 15fbdd7d7b2dd60ddcae409169f85243da2e201c4ff5d7bd1fcdb199d06c29bad8ee8 1c9baa4da2310aae16eee436fa54055e7d8ab4c5fdc11c84c5ea65075c8a6bfb6383df18b4f27f6ddc1a9c2774a4d4de4a84cb18e3d954f a311c6e2c494243725a9890f6291d84269fe6aea1555194790020049492c2006dce44a674fb35857433c0111857f8ff6d3b88c77118500 daeb16b9274f3d10ecc36f8eed695376e0280b00c772ab0d2c6753314acc80dad6a077956231ba313c6ad214adcfe14db6f217bdaa410b 26c5ceace2021b2e4af9ac9bb58e95f83af7c391cb937fa67aff8e07f0d4fe5e98ac130a7ba5ed4a302faca743e73d3643318dc565c06fa9 That string of seemingly random characters at the end is the signature; the magnetic ink if you will. It is the encrypted and encoded fingerprint of the URL components that precede it. Don’t worry, that URL won’t actually resolve to a file. Not only did I monkey with the URL components, thus invalidating the fingerprint, but these signed URLs can also be set to expire. In this case, as you can see if you squint, after 900 seconds, i.e., 15 minutes. It’s also a “write” URL and cannot be used to read a file.
Cutting out the middleman