Have you ever worked with WACZ Files? If not starting out is very easy! : https://ovarit.com/o/STEM/511014/simple-tutorial-for-people-starting-out-with-making-personal-offline-archives-of
In addition to the method I showed in that tutorial post with ArchiveExpress there are other ways to make WACZ Files!
Using the WebRecorder ArchiveWeb.page Chrome browser extension addon or Desktop App is very easy but requires going into each page manually which can be VERY tedious to do so I can only use it for my own stuff and for some small circles at most...
If anyone here could make the Browsertrix work it would be an IMMENSE help with the Ovarit archiving efforts!
WACZ Files can archive images and even videos... It can be used to replicate identical copies of websites pretty much...
WACZ Files can be loaded on ReplayWeb either here : https://replayweb.page/ Or Offline with the Desktop App... : https://github.com/webrecorder/replayweb.page/releases/tag/v2.3.4
I made my sample archive with a WACZ File... : https://ovarit.com/o/STEM/678680/small-sample-archive-of-ovarit-i-made
Tutorial to make an archive like my own... : https://ovarit.com/o/STEM/510775/host-your-own-web-archives-on-glitch
My archive is powered by ReplayWeb pretty much basically...
A full archive of Ovarit could be like that sample archive but MUCH bigger and with a more permanent URL with a different name!
There are people interested in hosting an archive of Ovarit already and some have even bought domains all ready for it but we need archive files!
I can’t stay and chat, but I will leave this here:
I might have a lead on automating WACZ collection without triggering Ovarit’s bandwidth restrictions. This weekend, I will test out my plan (below) to see if it’s doable. If I manage to export a WACZ file, I will send it to @femina, if she’s willing, and perhaps she can test if the file is actually useable since she’s familiar with replayweb?
Install archive box.
Write BASH script to export collected URLs (on a timed loop with pauses to not trigger Ovarit’s bandwidth restrictions) through archivebox’s CLI interface set at the lowest settings possible to save hard drive space and computer resources (RAM and CPU).
(Edit: it would really help if I knew what those bandwidth restrictions were. Should I pause every 5 seconds? 30 seconds? If I err on the side of too cautious, we won’t get much archived.)
Install webrecorder/py-wacz
Convert WARC files to WACZ using py-WACZ
If successful, I will share my scripts, and I will make them as easy to use as possible.
I’m in the middle of a project that collects all the URLs of each circle in chronological order. Once I have these lists (I will post them), it will be easier to coordinate group archival efforts.
I have a script you can DM me for, but I would rather hold off for what girl_undone has to offer because the last thing this website needs is a ton of gremlins running query requests and changing the sleep time to be incredibly low
Would need to trust that you just want it for knowledge, small personal loads, or for repurposing for other websites
Wondering, does browstertrix interact with the site for you? And retain http request information? If it crawls online for you it might not be that big of a deal but if it uses a server close to you and doesn't wipe identifying info, your approximate location will be known- you could probably edit the files to remove it, but this format is honestly a pain
I'm not too sure about your Browsertrix question because I don't know how to use it but I think it can be used Online on the cloud or deployed locally...
I accept your script! I never used one before but I wanna take a look... I won't use it right now!
Do you or anyone else know what size these WACZ files typically end up? I know that some posts have a ton more comments than others, or there are image posts, so what range of file size could be expected per file?
WACZ is like a compressed zip file of many individual pages. You can WACZ 1 page or WACZ a group of pages and that will affect how big it is.
Wait until after this weekend to worry about mass scraping. We’re going see about making quality flat files and someone promising might be able to host.
Thank you!
Are you planning on making the flat files publicly downloadable, or just giving it to those who host?
Not sure yet.
Ok then!