Online print archives for newspapers

or any other type of publication.

While many publications already have their archives digitized, they lack a customizable solution to make them available for their readers.

This open-source prototype hopes to enable all publications, from local to international stages, to place their print media archives online, focusing on three essential features:

  1. 1Visualization
  2. 2Text Extraction
  3. 3Search

An easy to use interface provides a responsive user experience to a reader, no longer requiring file downloads, enabling search and directly read any page in an archive.

Feature Highlights

Mobile and Desktop

Built on native web technologies, designed responsive from the start to adapt to mobile and other future devices, working anywhere there is a browser. The underlying technology approach also works on other platforms, such as native mobile apps and VR/AR, making it a future-proof investiment.

Search and Filter

You can search the archives by content and filter by date range. Search can be fine tuned to the publication language and other preferences, such as misspellings and synonims catch.

Social Sharing

Every single page of your print archive becomes a social object, with optimized embeds, ready to be discovered and shared on social networks.
You can also opt to have the archive indexed by search engines.

Monetization

A publication's print archive can and should be monetized, by becoming an added value that is offered to subscribers and can be easliy searched and shared with others.

Cloud Hosting

Developed to be a cost-effective solution, even for smaller publications. Choose your hosting provider for the application, with files stored by default on Amazon S3, with options to use other cloud storage providers such as Google Cloud Storage and DreamObjects.

API

A basic API is provided for authenticated bulk uploads and some functionality. It can be easily extendible to support more functions and also integration your publishing authentication systems, to restrict access to subscribers or other features

Text Extraction

Your print scans are processed with state of the art text recognition, powered by Microsoft Azure Cognitive Services Computer Vision. Options will also be available to use Google Cloud Vision or Tesseract.

Crowdsourced Text Correction

For text that isn't properly recognized, you can crowdsource the correction or engage your readers to help in the correction of any misspellings.

Documentation

To help setup and integrate with your infrastructure documentation is provided, outlining the software main features and how to can adapt to your archive and readers.

Open Source

Customize the software to your needs and contribute back. Developed on proven technologies such as Ruby On Rails, Leaflet and Elasticsearch, with the source code available on GitHub, under an MIT license. Localizations available in English and Portuguese. Feel free to use it on your archives.

need a better punchline...

Contact