Taking the Limits Off An S3 REST API
Uploading Large Arbitrary Files in Chunks Into An S3 Object Store via RESTful API
Our hybrid cloud presents an S3-like storage solution, but does not furnish an S3 CLI like AWS does. In order to make use of the buckets, we needed a bound app (such that it handled credentials, auth, etc). We wrote a simple gateway API (called 3direct, because I like bad wordplay) in Kotlin, and complemented it with a java jar we wrote to upload the files (we called it Pitching Machine, because the whole project has baseball-themed names). It works pretty well, but the initial setup posed a risk: the API’s cloud instances only had allowances for 2GB of memory, while most files we needed to upload would be up to 200MB. Since the files were saved into memory during the transfer, we would be limited to maybe a dozen simultaneous transfers before the whole thing crashed. Thankfully, the underlying java-aws-sdk exposes an API for multi-part uploads.
The Plan, front to back:
- Take an arbitrarily large file on local, split it into x number of split files, of y size each. For us, Pitching Machine used native java to split the file into 6MB chunks.
- Initiate an upload via POST from Pitching Machine, containing the filename and a UUID (our process generates one earlier upstream). 3direct would cCreate a new entry in S3 via InitiateMultipartUploadRequest, set the to-be-created S3 object key to the UUID, save the original filename as an amz-metadata tag (so that, when downloaded, the file will have the same name), and return an uploadId to Pitching Machine.
- Pitching Machine would then take all x split file chunks, and upload each file via PUT, specifying the part number in question, and the uploadId against which to file it. The API, receiving those chunks, invokes a UploadPartRequest and then returns a PartETag to Pitching Machine. That PartETag contains a part number, reflecting the 1-based index, bound to a small hash derived from the corresponding part itself.
- Once all parts are uploaded, Pitching Machine collates the list of PartETags returned from each upload part, and sends a PATCH request containing them and the uploadId to 3direct. In turn, the API calls CompleteMultipartUploadRequest, returning a 200 and the file’s key (the initial UUID) to Pitching Machine. The file is now downloadable via a GET request to the API, referencing it by UUID.
Since implementing the split uploads, we’ve seen near 100% stable up-time of the API, with uploads of 2 to 3 200MB files at a time completing quickly and stably in less than 5 minutes. As the storage system becomes more robust, we’re planning on sharing Pitching Machine with other teams, opening the uploads up to a wider variety of sources.