This digital history project applied natural language processing in an R language computer program and stored the results in a database to create a new and useful index of this corpus of digitized content despite OCR related errors. The program extracted the names of all the person, location and organization entities that appeared in each edition. Each of the entities was cataloged in a database and related to the edition of the newspaper it appeared in. The database was published to a public website to allow other researchers to use it.
The resulting index or finding aid allows researchers to access the Equity in a different way than just full text searching. People, locations and organizations appearing in the Equity are listed on the website and each one of them links to a page that lists all of the issues that entity appeared in as well as the other entities that may be related to it. Entities with spelling errors due to OCR are listed along with correctly spelled entities so that they are taken into account for research.
Rendering the text files of each scanned newspaper into entities and indexing them in a database allows the content of the newspaper to be interacted with by entity name and type rather than just a set of large text files.Github repository for this project.