Golden
ArchiveBox

ArchiveBox

🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).



After installing the dependencies, just pipe some new links into the ./archive command to start your archive.



Running ./archive adds only new, unique links into output/ on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to run on a timer and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save None and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs).



Whether you want learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!



Learn why archiving the internet is important by reading the "On the Importance of Web Archiving" blog post.



ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).



After installing the dependencies, just pipe some new links into the ./archive command to start your archive.



Running ./archive adds only new, unique links into output/ on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to run on a timer and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save None and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs).



Whether you want learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!



Learn why archiving the internet is important by reading the "On the Importance of Web Archiving" blog post.

Timeline

People

Name
Role
LinkedIn

Nick Sweeting

Creator



Further reading

Title
Author
Link
Type
Date

Documentaries, videos and podcasts

Title
Date
Link





Companies

Company
CEO
Location
Products/Services









References