AWS Lambda (part III): How to create zip from files
Hello,
After:
AWS Lambda: How to create zip from files
AWS Lambda (part II): How to create zip from files
Here the updated (part III) version:
AWS Lambda (part III): How to create zip from files
You can find source of this Lambda available on my Github here.
Yesterday I just rewrote some part of it, remove useless things, make it more reliable and it still works and even in a better way.
What you need to know about Zip over Lambda
I already talked about it before in my previous article, but Lambda has constraints of:
- Memory (From 128 Mb to 30008 Mb)
- Space disk (500Mb)
- Execution time (Max 300s)
So basically you need most of the time to generate a zip from files which will be larger than your space disk, so there are absolutely no way to get your files, store them on your lambda, zip them and upload them back to a bucket. To do this you'll have to use STREAMS.
I wrote this back 1 year ago and I had already planned to rewrite it properly and enhance it if needed. Recently I discovered a bug on my previous version, each time I was working with a bunch of files arround 20 more or less with a total of 1-2gb of zipped output it was ok.
I started to get some trouble when each files started to increase and getting to have 7-10gb of zip output. The thing is that all files were into the zip, but some of them were corrupted..๐
First I briefly need to recap how all of this work:
- Creating a ReadStream from each file to zip
- Creating an archive object with Archiver
- Appending each Reastream to my archive
Once all stream are appended to my archive, finalize it and sent the uploadStream into your destination bucket.
As I said this was working smoothly however I have few issue in my code. Since I did not read properly the documentation of Archiver I was triggering my callback in my async loop right after the append operation, so I thought I could come from this. At the begining I tried to track down, when the append operation was ending. You can do it in 2 ways either you track down the:
on.finish()
event from the readStream you are currently appendingon.progress()
event from the append operation
There is also on.entry see archiver events however the on.progress gave me more accurate information about where I am and it also helped me to be sure that everything is done, I even add a boolean to be sure to call my callback only once.
Anyway once I have done this, everything was still working as usual, but my bug was still there. I continue to investigate, and read from the really good s3-upload-stream package that there are AWS limits. It means you can make a multipart upload to AWS bucket in 10000 parts maximum, so I thought it could come from this even if at the end I realized it has no sense, since my issues come from files into my zip and I always had my output zip available on my bucket. So you need to know that the default part size is 5mb from s3-upload-stream, also you can increase concurency it can help if you want to maximise the way you are uploading your data. Obviously when I changed this nothing changed and I still had files corrupted into my zipped file.
After a while I tried something, I changed async.each for a async.eachSeries into this loop I was appending all of my files. The different between both is that async.each will do your async tasks in parallel and eachSeries is the same with a concurency set at 1, so 1 by 1. This make sense, I was trying to append multiple files at once and it seems, archiver does not handle it, if I'm not mistaken as long as your files get bigger it could get you into trouble and generate corrupted files.. Since I changed this I was able to generate big zip without any issue.
From my research and discoveries, I updated my repo, and you are free to use and modify this program for your use. I added the concurency and chunkSize as a configuration. I added a check (head on file) before appending a file to your zip in order to make sure that the file really exist, and some refacto. I hope you enjoy ๐.
Happy zipping over AWS Lambda, ciao.