AWS Whitepapers: Easy Download for re:Invent!

AWS re:Invent is almost upon us and for anyone outside of the Vegas area – probably the majority of the reported 24,000 attendees – it means a long flight. But what better way to prepare than to catch up with reading AWS whitepapers on the plane studying for the beta exams?The AWS page was recently reformatted to consolidate all links (hurrah!). You could click the 150+ links manually, but at Cloudreach we like to automate everything, so let’s try grab the PDFs more effectively. With a few minutes hacking, we come up with the following single line:


curl | grep -o ‘<a .*href=.*>’ | sed -e $’s/<a /\n<a /g’ | sed -e ‘s/<a .*href=[‘"‘"‘"]//’ -e ‘s/["‘"‘"‘].*$//’ -e ‘/^$/ d’ | egrep -i "whitepapers.*pdf" | awk ‘{ print ""https:" $0 """ }’ | xargs -P5 -n 1 curl -LO


This works on my Cloudreach Mac, but flavours of Linux may have slight variations.

Essentially, we grab the whitepaper listing page, and look for href links. Then we stream edit (sed) the content to get it down to a list of URLs with "whitepaper" and "pdf", being mindful of a few filenames with spaces. We have to prefix each list item with https, then individually curl each of the files. At this point we have a flat list of URLs. The last section uses xargs in parallel (-P5) to speed things up, whilst being responsible and self throttling slightly.


I pull these into my Dropbox folder ready for offline reading, and I’m done!



You could tweak this approach to pull the Kindle files, and probably optimise a bit, but that is left as an exercise for the reader.

Enjoy your plane journey, enjoy re:Invent and come and meet me and the rest of the  Cloudreach team at booth 225 to see how we can help you operate your infrastructure with automation.