25
WACZ Files! A File Format that can help us archive Ovarit as faithfully as possible with images and even videos!
Posted March 25, 2025 by Femina in Ovarit

Have you ever worked with WACZ Files? If not starting out is very easy! : https://ovarit.com/o/STEM/511014/simple-tutorial-for-people-starting-out-with-making-personal-offline-archives-of

In addition to the method I showed in that tutorial post with ArchiveExpress there are other ways to make WACZ Files!

Using the WebRecorder ArchiveWeb.page Chrome browser extension addon or Desktop App is very easy but requires going into each page manually which can be VERY tedious to do so I can only use it for my own stuff and for some small circles at most...

  • There is also a tool called Browsertrix also created by the same folks that made the ArchiveWeb tool that can crawl websites automatically and produce WACZ Files but I don't know how to use it : https://webrecorder.net/browsertrix/ The Browsertrix can be used either paid on the cloud or be deployed locally for free... : https://docs.browsertrix.com/deploy/

If anyone here could make the Browsertrix work it would be an IMMENSE help with the Ovarit archiving efforts!

My archive is powered by ReplayWeb pretty much basically...

A full archive of Ovarit could be like that sample archive but MUCH bigger and with a more permanent URL with a different name!

  • Does anybody wanna contribute to this Ovarit WACZ Archiving Project efforts?

There are people interested in hosting an archive of Ovarit already and some have even bought domains all ready for it but we need archive files!

9 comments

girl_undone [speaking as mod]March 25, 2025 - sticky

Wait until after this weekend to worry about mass scraping. We’re going see about making quality flat files and someone promising might be able to host.

OnlyHumanMarch 25, 2025(Edited March 25, 2025)

Thank you!

Are you planning on making the flat files publicly downloadable, or just giving it to those who host?

girl_undoneMarch 25, 2025

Not sure yet.

Femina [OP]March 26, 2025

Ok then!

MaplefieldsMarch 25, 2025(Edited March 25, 2025)

I can’t stay and chat, but I will leave this here:

I might have a lead on automating WACZ collection without triggering Ovarit’s bandwidth restrictions. This weekend, I will test out my plan (below) to see if it’s doable. If I manage to export a WACZ file, I will send it to @femina, if she’s willing, and perhaps she can test if the file is actually useable since she’s familiar with replayweb?

Plan (automation of WACZ file download):

  1. Install archive box.

  2. Write BASH script to export collected URLs (on a timed loop with pauses to not trigger Ovarit’s bandwidth restrictions) through archivebox’s CLI interface set at the lowest settings possible to save hard drive space and computer resources (RAM and CPU).

(Edit: it would really help if I knew what those bandwidth restrictions were. Should I pause every 5 seconds? 30 seconds? If I err on the side of too cautious, we won’t get much archived.)

  1. Install webrecorder/py-wacz

  2. Convert WARC files to WACZ using py-WACZ

If successful, I will share my scripts, and I will make them as easy to use as possible.

About URLs

I’m in the middle of a project that collects all the URLs of each circle in chronological order. Once I have these lists (I will post them), it will be easier to coordinate group archival efforts.

OnlyHumanMarch 25, 2025(Edited March 25, 2025)

I have a script you can DM me for, but I would rather hold off for what girl_undone has to offer because the last thing this website needs is a ton of gremlins running query requests and changing the sleep time to be incredibly low

Would need to trust that you just want it for knowledge, small personal loads, or for repurposing for other websites

Wondering, does browstertrix interact with the site for you? And retain http request information? If it crawls online for you it might not be that big of a deal but if it uses a server close to you and doesn't wipe identifying info, your approximate location will be known- you could probably edit the files to remove it, but this format is honestly a pain

Femina [OP]March 26, 2025

I'm not too sure about your Browsertrix question because I don't know how to use it but I think it can be used Online on the cloud or deployed locally...

I accept your script! I never used one before but I wanna take a look... I won't use it right now!

beingMarch 25, 2025

Do you or anyone else know what size these WACZ files typically end up? I know that some posts have a ton more comments than others, or there are image posts, so what range of file size could be expected per file?

MaplefieldsMarch 25, 2025(Edited March 25, 2025)

WACZ is like a compressed zip file of many individual pages. You can WACZ 1 page or WACZ a group of pages and that will affect how big it is.