Shane Brinkman-Davis Delamore is a master of the craft and passionate about UX design and programmer productivity.
Updated Jun 23, 2020
S3P is an open source, massively parallel tool for listing, comparing, copying, summarizing and syncing AWS S3 buckets.
AWS S3 offers industry-leading scalability, data availability, security, and performance. It’s also relatively easy to work with, at least when working with one file at a time. However, once you load a bucket with terabytes of data and millions of files, doing anything over the whole bucket becomes unwieldy. Even getting statistics about file counts and file sizes becomes challenging.
We created S3P to solve two problems: large-bucket content discovery and cross-bucket copying. In our tests S3P was 15x to 100x faster than AWS CLI. We sustained item listing speeds of 20,000 items per second and copy speeds of nearly 9 gigabytes per second.
We needed to refactor a client's AWS infrastructure into a HIPAA compliant environment. As part of work, we had to move about 500 terabytes of data into new S3 buckets. Since we only had a month to finish the whole project. We knew right away this was going to be trouble.
After some preliminary tests with
aws s3 sync we found we could get a max of about 150 megabytes/second throughput. A little back-of-the-envelope math shows it would take over 40 days to complete. Clearly this wasn’t going to work.
We also had another problem. The client had dozens of buckets and no good information on the contents of those buckets. Not only did we need to transfer a large amount of data in a small amount of time, we had no idea exactly what data needed transferred and no idea what the actual file counts were. By default, S3 doesn’t tell you anything about bucket size or file count. S3 + CloudWatch tells you a little, but it isn’t very detailed.
The S3 API is structured around listing items in the bucket sequentially. Just listing 5 million files with
aws s3 ls at about 1400 files per second would takes an hour.
We had two problems to solve:
We solved the listing problem with a cleaver use of divide-and-conquer and AWS S3 SDK's startAfter option introduced with listObjectsV2. This was the hardest part of the project. Doing efficient bisection over an unknown key distribution can be tricky, but we made it work. It’s blazing fast, performing 100 or more S3 list-object requests in parallel.
Copying S3 items was more straightforward. We just needed to fire off simultaneous calls to copy-object. We overcame copy-object's lack of support for objects larger than 5gigabytes by shelling out to
aws s3 cp for these larger files. What seemed like a cludge at first ended up being a performance win since it allowed us to leverage multiple CPU cores.
All of this aggressive parallel copying needed limits however. Without some control over the concurrency, the poor NodeJS instance ran out of memory and crashed. There is a sweet spot for parallel listing as well. Too many parallel list-object requests and performance declined. We built tools and options for controlling the concurrency of listing, copying ‘small files’ with the SDK and copying ‘large files’ with the aws-cli.
Though the current implementation of S3P tops out at about 20,000 items/second for listing and 9 gigabytes/second for copying, we believe there is room for even more performance improvements by forking NodeJS or even distributing the work across multiple machines. There is no documentation about the maximum performance S3 can provide, but whatever it is, S3P’s techniques should allow us to reach it.
If you have S3 data in the petabytes range, and you want an even faster S3P to solve your problems, let us know, we’d love to help you.
Instead of a 40 day aws-sync, with S3P we were able to copy our client's 500 terabytes of S3 data in less than 24 hours.
Before building our own tool, we investigated several alternative solutions to accelerating same-region cross-bucket copying. This Stack Overflow thread nicely sums up the options:
The two main pure-Amazon options were:
I thought EMR would do the trick. Throwing a bunch of parallel hardware at a problem seemed like the right idea. I followed the steps and did a test with 10 large EC2 instances. Shockingly, it still only produced about 150 megabytes/second throughput. It should have been at least 10x faster. I’m not an expert on Amazon-flavored map reduce (EMR), and the Amazon-provided “Command Runner” tooling had almost no doc. I suspect the problem was in part due to the fact that listing S3 buckets is slow. Regardless, we had a short deadline on this project and didn’t have time to dive in and debug this option. It turns out that was a very good choice, because S3 speed, in this case, doesn’t require huge hardware.
I also experimented with S3 transfer acceleration endpoints, but I was never able to significantly improve on the 150 Mb/s transfer rate. The documentation of S3 transfer acceleration focuses almost exclusively on uploading data /into/ s3, not transferring between s3 buckets. Reading between the lines, I don’t think this actually accelerates transfers between buckets within the same region. Another dead end.
The remaining suggestions basically amounted to parallelizing the copy operations within a single machine. These techniques do accelerate copying, but without solving the serial listing speed problem, they, too, have limitations.
# run s3p and get the top-level help page npx s3p # run s3p and get a summary of the contents of your bucket: npx s3p summarize --bucket my-bucket-name
NOTE: S3P uses the same credentials system as the AWS Cli. More on how to configure your AWS-CLI credentials: Configuring the AWS CLI - AWS Command Line Interface
S3P supports many useful operations right out of the box:
# copy an entire bucket and add .backup to the end of every file npx s3p copy --bucket my-from-bucket --to-bucket my-to-bucket --to-key "js:(key) => key + '.backup'"
You can also iterate over specific sub-ranges of the bucket either by specifying a prefix or a combination of start-after and stop-at keys:
# summarize all items with keys starting with "my-folder/" # Example match: "my-folder/2015-05-23.log" npx s3p summarize --bucket my-from-bucket --prefix my-folder/ # summarize all items with keys > "2018" and <= "2019". # Example match: "2018-10-12-report.txt" npx s3p summarize --bucket my-from-bucket --start-after 2018 --stop-at 2019
You can read more about alternative installation options and using S3P as a library in the S3P readme.
With all these massively parallel S3 operations you may be wondering how much this might drive up your S3 bill. The key thing to note is that S3 pricing is not based on speed. The cost is purely based on how much you do, not how fast you do it.
S3P uses two S3 operations: list-objects and copy-object. Currently both operations cost $.005 per 1000 requests, or 200,000 requests per $1 (US-East, June 2020).
List-objects can return up to 1000 items per request, but because we can't assume any a priori knowledge about the distribution of keys, S3P’s divide-and-conquer algorithm must make educated guesses which aren't perfect. S3P always requests 1000 items per list, but often requests overlap. In practice, S3P fetches a little more than 500 unique items per request overall.
Effectively, S3P costs about $.01 to scan one million items. To copy one million items costs about $5 (assuming they are all smaller than 5 gigabytes and staying within one AWS region). If each file was just under 5 gigabytes, that’s about $1 per petabyte.
In other words, S3P costs essentially the same as any other method of listing or copying files. It’s just a lot faster. The only real cost danger is it’s so fast, you might use it more, and that might drive up your S3 bill.
S3 pricing is based on how much you use, not how fast you use it.
S3P is an essential tool for working with large S3 buckets. The ability to scan buckets 15x faster and copy data 100x faster opens up all new capabilities. It allowed us to move half a petabyte of data between buckets in less than a day instead of more than a month. Perhaps even more important, it allowed us to inspect the client’s S3 buckets in detail to fully understand their data patterns and know with confidence what we needed to do to meet their requirements. Without S3P we would have been blindly doing large, slow aws-syncs, and often getting it wrong as we slowly learned about the data.
S3P’s key innovation is the ability to divide-and-conquer any arbitrary S3 bucket regardless of what keys were used (as long as they only use ASCII characters). This allowed us to break free of the standard, slow linear-list-object algorithm and open up a world of unbounded parallel possibilities.