Archiving the Forums?

Just what it says.

Moderator: peterZ

Post Reply
jlev
Posts: 24
Joined: 04 Mar 2014, 00:52
Number of books owned: 0
Country: USA
Contact:

Archiving the Forums?

Post by jlev »

Hi duerig and all other site admins,

First, thank you so very much for your ongoing work running the site. This website, and the forums in particular, are, in my mind, one of the lasting gems of the internet. That's not only for the current value of the discussions happening here, but also for the historical value.

I'm wondering whether any ongoing work is being done to archive the materials on this site in a lasting way. For example, beyond daily or weekly backups that you might have as site admins, is there any ongoing process in place to put copies of the site materials in the hands of long-term archivists (for example, through submitting periodic database dumps to the Internet Archive, or through spidering the site and turning it into a locally-viewable HTML zip file / tarball)?

I see this being useful for two reasons: first, in case some sort of catastrophe happens in which all of the servers crash at once, or the domain name gets hijacked somehow. Second, to enable research on what's happened here (I do a lot of text mining in my research, for example, and could see other researchers wanting to know how communication networks here have looked over time, what phrases come up most frequently, etc.).

If there isn't a process in place, but you'd be open to the idea, I'd be really happy to help with this. I'm currently finishing a doctorate in psychology and will thereafter be starting work at a university library, and so could also facilitate working with university institutional repositories, if that would be preferable to using the Internet Archive's services.

Thanks for your consideration!

-Jacob
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Archiving the Forums?

Post by dtic »

Not an admin. Aren't the automatic internet archive wayback machine snapshots sufficient? https://web.archive.org/web/*/http://ww ... org/forum/
jlev
Posts: 24
Joined: 04 Mar 2014, 00:52
Number of books owned: 0
Country: USA
Contact:

Re: Archiving the Forums?

Post by jlev »

I would say not, for two reasons. First, they don't catch everything / they don't always spider the entire site. Second, if the admins are amenable to enabling research on the site, it would be much more straightforward to just make everything accessible directly (this could be automated, too -- there wouldn't need to be constant human action), either through partial database dumps (no user email addresses or passwords, etc.), or through an already-spidered version with all links turned into relative links (such that the site could be viewed in an offline mode on someone's local machine); I suspect that the Wayback Machine is even more difficult to spider, especially if posts include links that resolve back to this domain name (vs. the Wayback Machine's cached versions). That's my thought, but hearing input like yours, dtic, is also why I started the conversation :)

On another note, I've really admired your work here on the forums!
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Archiving the Forums?

Post by duerig »

I help Scann manage the forums. If somebody was interested in getting periodic backups of the current database (or just the post tables), I think we could pass it along pretty easily. The final say is down to Scann, though. And there would have to be somebody who wanted the backups. :-)

-D
jlev
Posts: 24
Joined: 04 Mar 2014, 00:52
Number of books owned: 0
Country: USA
Contact:

Re: Archiving the Forums?

Post by jlev »

Hi, duerig! Ha, in that case, I'd like a copy! :)
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Archiving the Forums?

Post by daniel_reetz »

This is an important topic - and one dear to my heart.

There is one important consideration which is that we need to be careful backing up/sharing the databases directly, because people's private messages and other stuff are in there.

jlev, did this discussion ever go further? Do you want to help out a bit with our archiving for the long term?
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Archiving the Forums?

Post by duerig »

There was some followup in email. Basically, we need a script to pull out the public information from the database periodically and save it off. This is different from a normal backup because we only want to pull some information from the database. I talked about this some with Jacob, but he hasn't had time as yet to work on this script. I haven't had the time to work on it myself either.

If there is a member of the community who wants to write a cron job to pull information from a database and stuff it into an S3 container, I'd be happy to work with them and then set up the script to automatically run.

-Jonathon Duerig
jlev
Posts: 24
Joined: 04 Mar 2014, 00:52
Number of books owned: 0
Country: USA
Contact:

Re: Archiving the Forums?

Post by jlev »

Hi Daniel and Jonathon,

I just remembered to come back to this, after a time out of contact. I hope that this message finds you both well!

I am still interested in contributing to this. I looked into it a bit more today, and found out about PHPBB-Static, which was recommended from here. That second linked page lists several other options for archiving a copy of a BBForum such as this, including using wget or httrack, which is what I originally had in mind, but which would presumably use a lot of bandwidth unless it were done on an offline copy. PHPBB-Static looks like the most promising way to me currently, but I haven't used it yet.

- Jacob
jlev
Posts: 24
Joined: 04 Mar 2014, 00:52
Number of books owned: 0
Country: USA
Contact:

Re: Archiving the Forums?

Post by jlev »

Ideally, in my mind, the archive would comprise a static HTML mirror of the entire archive, with internal URLs changed to be relative URLs within the archive. In addition, it could be useful to create a partial database dump of post content for anyone wanting to do, say, text analyses in the future.

There is an issue re: users having the right to delete their posts on the forum, and not having the ability to similarly delete from an archive. Having said that, that is also already the case with Internet Archive mirrors.
Post Reply