contrarianarchon | (no subject)

(no subject)

Achievement get! I have successfully used wget to download the entirety of a smallish wordpress site because it took my fancy and I had no other way to sensibly archive the entire thing. (And I spent so long mourning not knowing how to use it and it turned out to be pretty simple, thanks at least in part to

brin_bellway already having worked out many of the obvious pitfalls and then posting the details very helpfully.)

...

This is more power than I should have. I suspect I will be using up a worrying amount of data capacity on this... For a good cause, of course, but even so.

Flat | Top-Level Comments Only

\o/

---

That being said, it *is* a concerning amount of power sometimes, yeah. I, uh, did accidentally fuck over some server admins once. (Not a full-on denial-of-service, but apparently they struggled pretty hard under the sudden spike in load.)

I've learned how to throttle since then ("--wait={{insert number of seconds here}}"), and I recommend you do the same, at least with large websites run by small-timers. (Unfortunately the things they did on their end to keep scraper bots from digging too deep (and ending up downloading one zillion immaterially-different page variants) don't seem to have worked, and I have still never successfully downloaded their site.)

(I...*guess* I could use my shiny new SingleFile and act as a manual web scraper? But the site is so big--even to an entity capable of seeing past the zillion variants--that that doesn't seem very feasible.)

woo, nicely done

wget has many options. I noted the ones that might be of interest to me

-o logfile
-nd #no directories
-nH #no host directory
-P prefix directory, #i.e. where to save to, default .
--header=header-line
--referer=url
-U agent-string or --user-agent=""
-r #recursive, default depth 5
-l depth #default 5
-k #--convert-links for local
-m #--mirror, -r -N -l inf --no-remove-listing #an FTP thing
-N #--timestamping
-p #--page-requisites, dependencies even if beyond depth
-H #--span-hosts
-np #--no-parent, never ascend

so for mirroring I often want wget -mk -nH or wget -mk -nH -np

Huh, didn't know depth defaulted to 5, that's a detail worth picking up. The rest of those I already know or seem arcane enough I'm comfortable not looking them up in more detail. It's nice to know more ways to actually interface with the world; my coding education has tended to be very internal and not cover much by way of actual interactions...

Flat | Top-Level Comments Only