3417 Evanston Ave N
Seattle, WA 98103
Shane Brinkman-Davis Delamore is a master of the craft and passionate about UX design and programmer productivity.
Updated Jan 20, 2022
Amazon Simple Storage Service (Amazon S3) is a popular object storage service because of its scalability, data availability, security, and performance. However, if you’ve worked with very large data sets in S3 you may have encountered certain limitations when it comes to timely copying and syncing. This is where our open-source tool S3P comes in.
S3P provides a radically faster way to copy, list, sync, and do other bulk operations over large Amazon S3 buckets by listing items in parallel. In fact, using algorithmic bisection and some intelligent heuristics, S3P can scan the contents of a bucket up to 15x faster than conventional methods.
We created the original version of S3P in the first half of 2020 to solve the problems of large-bucket content discovery and cross-bucket copying in Amazon S3. As more and more people discover and use S3P, the most common request we received was adding an API so S3P could more easily integrate into their workflow. We are grateful for the feedback and are pleased to release version S3P v3 with API support.
Now, you can use S3P as a command-line tool for common operations or as a library for just about anything. You can access the latest S3P code at GitHub or on npmjs.com.
S3P was born from a need to do a major refactor of a client's AWS infrastructure–including moving about 500 terabytes of data into new S3 buckets–in a single month. When preliminary tests with aws s3 sync revealed it would take over 40 days to complete, we had to come up with a better solution.
The task was further complicated by the fact that the client had dozens of buckets and no good information on their contents, including file counts and file sizes. The S3 API is structured around listing items in the bucket sequentially and listing the 5 million files from one of these buckets with aws s3 ls takes about an hour.
We needed a way to list S3 items at least 10x faster to aid discovery and copy S3 items between buckets in the same region 10x to 100x faster.
The solution included the following:
By using divide-and-conquer and Amazon S3 SDK's startAfter option introduced with listObjectsV2 we were able to solve the listing problem, performing 100 or more S3 list-object requests in parallel.
S3’s SDK supports copying directly between buckets without the need to stream the data through the controller (e.g., S3P). By issuing 100s of simultaneous copies, fed by our parallel listing algorithm, we were able to stream gigabytes of data a second while running S3P itself either locally or on modest EC2 instances.
In addition, we also added support for summarizing and syncing files to facilitate deep discovery and fast re-syncing after copying respectively.
Instead of a 40-day aws-sync, with S3P we were able to copy our client's 500 terabytes of S3 data in less than 24 hours. To help others with similar tasks, we made S3P open-source.
The latest version of S3P adds a robust API to the already great CLI capabilities of S3P. We created it by essentially taking the CLI and making a mirror image of it for an API. The best part is you can learn and explore the API using the CLI.
To learn the API call for a specific CLI command, run that command on the command-line with the
For example, if you run:
> s3p list-buckets --api-example
The output is:
You have the option to install S3P locally, which can allow it to run faster. To install it on your current machine, enter
npm install s3p -g. Running
npx s3p help will return a list of commands and provide overall help, and running
npx s3p cp --help outputs detailed help and examples for each command.
In addition to adding an API, we addressed several other commonly requested features. The new S3p also adds the following common S3 copy options:
The canned ACL to apply to the object. Possible values include: private, public-read, public-read-write, authenticated-read, aws-exec-read, bucket-owner-read and bucket-owner-full-control
Specifies caching behavior along the request/reply chain.
Specifies presentational information for the object.
Specifies what content encodings have been applied to the object and thus what decoding mechanisms must be applied to obtain the media-type referenced by the Content-Type header field.
The language the content is in.
A standard MIME type describing the format of the object data.
The date and time at which the object is no longer cacheable. e.g., new Date or 'Wed Dec 31 1969 16:00:00 GMT-0800 (PST)' or 123456789
Confirms that the requester knows that they will be charged for the request. Bucket owners need not specify this parameter in their requests.
By default, Amazon S3 uses the STANDARD Storage Class to store newly created objects. The STANDARD storage class provides high durability and high availability. Depending on performance needs, you can specify a different Storage Class. Amazon S3 on Outposts only uses the OUTPOSTS Storage Class. Possible values include: STANDARD, REDUCED_REDUNDANCY, STANDARD_IA, ONEZONE_IA, INTELLIGENT_TIERING, GLACIER, DEEP_ARCHIVE and OUTPOSTS
As a quick rehash of what makes S3P special—both in terms of its original capabilities and the recent updates—here is a summary of its current performance capabilities:
S3P can now be run using a CLI or API and works well both in the cloud and locally. S3P’s S3-bucket-listing performance can hit almost 20,000 items per second and its S3-bucket-copying performance can exceed 8 gigabytes per second. In fact, I’ve seen 9 gigabytes per second sustained on a bucket with an average file size slightly larger than 100 megabytes. By comparison, I've never seen aws-s3-cp get more than 150mB/s. That's over 53x faster.
I developed S3P to operate on buckets with millions of items and 100s of terabytes. At present, it is still only a single-core NODE application, but there are opportunities for even more massively parallel S3 operations by forking workers or even distributing the work across instances with something like Elastic-Queue, which could lead to solutions that are 100x to 1000x faster than AWS CLI.
Since its inception, S3P has proven essential for rapidly listing, copying, and syncing large S3 buckets, with scanning speeds up to 15x faster, and copying speeds up to 100x faster than other solutions. This capability is achieved viaan innovative divide-and-conquer approach to parallel S3 bucket listing combined with massively parallel S3 file copying.
I’d like to thank the growing S3P community for their ongoing feedback and support. You can learn more about how to get started with S3P on GitHub.com/generalui/s3p.
Can we help you apply these ideas on your project? Send us a message! You'll get to talk with our awesome delivery team on your very first call.