I noticed this on MyHeritage earlier today – MyHeritage Adds Huge Collection of Historical U.S. City Directories: https://blog.myheritage.com/2020/02/myheritage-adds-huge-collection-of-historical-u-s-city-directories/. A lot more information in the link, but trying to keep it around 600 words.
by Esther February 28, 2020
We are pleased to announce the publication of a huge collection of historical U.S. city directories — an effort that has been two years in the making. The collection was produced exclusively by MyHeritage from 25,000 public U.S. city directories published between 1860 and 1960. It comprises 545 million aggregated records that have been consolidated from 1.3 billion records, many of which included similar entries for the same individual. This addition brings the total number of historical records on MyHeritage to 11.9 billion records.
The new city directories collection on MyHeritage is a rich source of information for anyone seeking to learn more about their family in the United States in the mid-19th to mid-20th century. The directories contain valuable insights on everyday American life spanning the time period from the Civil War to the Civil Rights Movement.
What are City Directories?
Cities in the United States have been producing and distributing directories since the 1700s as an up-to-date resource to help residents find local individuals and businesses. City directories typically list names (and spouses), addresses, occupations, and workplaces. Sometimes they include additional information.
Thanks to their level of detail, city directories can provide a viable alternative to U.S. census records during non-census years, as federal censuses are taken once every ten years, and in many cases city directories were published annually. They can also fill in the gaps in situations where census records were lost or destroyed. In 1921, a fire at the U.S. Department of Commerce destroyed most of the records from the 1890 census. Despite the loss of the records in the fire, much of the data can be reconstructed using the 1890 city directories on MyHeritage, which consist of directory books from 344 cities across the country, including 88 of the 100 most populated cities during that year.
Unique processing by MyHeritage
The city directories in this collection were published by thousands of cities and towns all over the U.S., and each directory is formatted differently. The huge amount of content and its variety made the project more challenging and required the development of special technology to process the city directories.
We first used Optical Character Recognition (OCR) to convert the scanned images of the directories into text. This process can result in errors in the output, and we created algorithms to detect and correct some of these errors.
Then, we needed to parse the records to identify the different fields in each record: names, occupations, addresses, and more. The differences in formatting between the books presented an additional challenge. Our team employed methods such as Name Entity Recognition (NER) and Conditional Random Field (CRF) to train an algorithm using a per-book model — meaning that for each of the 25,000 books, we manually labeled a sample of the records and used it to train the algorithm how to parse that directory. Using this model, the algorithm was able to parse the entire book into a structured index of valuable historical information.
In the example below of a city directory record for Ralph McPherran Kiner, an American Major League Baseball player and broadcaster, we see how our system overcame and corrected an OCR error. The incorrect address in the 1957 record is 55801 Yorkshire av, whereas the 1958 and 1960 records list the address as h5801 Yorkshire av, and the “h” implies that Ralph is the homeowner. We inferred that the first “5” in the first record was an OCR error and should actually be an “h”, and were therefore able to determine that Ralph lived at the same address during these years.