About 88 million pages of original and authentic handwritten documents belonging to the past three-and-a-half centuries, line the tiled halls of a simple 16th-century trading house located right in the middle of Seville, Spain. These are stored here, incompletely transliterated, where some of them are almost indecipherable. A few of them were carried back on armadas from the Americas while a few have undergone scanning and digitisation.
These documents contain the answers and the context for the innumerable questions pertaining to the Conquistadors, the European history, the New World contact and colonialism, politics, law, economics and ancestry. However, it is unfortunate that hardly some of these carefully kept pages were ever read or interpreted since they were written and brought to Seville centuries before and it is highly unlikely that most of them never will be.
All hope is not lost as a researcher from the Stevens Institute of Technology is trying to get computers to read these documents, before we are out of time, while the documents are still readable. A Stevens computer science professor, Fernando Perez-Cruz asks “What if there was a machine, or a software, that could transcribe all of the documents?”.
Perez-Cruz, who’s expertise lies in the research area of machine learning also says “What if there was a way to teach another machine to combine into groups those 88 million pages and convert them into searchable text categorised into topics? Then we can start understanding the themes in those documents and then will be aware where to look in this storehouse of documents for our answers”. Thus Perez-Cruz is working on both factors of this two-fold approach which, if right, could then be applicable to many other new age and futuristic data analysis queries such as independent transport and analysis of medical data.
Pricing on Amazon, medical study, text reading machines
Perez-Cruz, who is a veteran of Amazon, Bell Labs, Princeton University and University Carlos III of Madrid, has had a very interesting career dealing with scientific challenges.In 2016, he joined Stevens and contributed to the growing asset of the computer science department of the university. Stevens aims at making this a strong research department which in turn is drawing more talent and resources. Perez-Cruz is using this to his advantage in his work. Currently, at Stevens, he is working to develop something called as ‘interpretable machine learning’ which is a systematized intelligence that humans can still work on.
As far as the problem of the historical document analysis is concerned, Perez-Cruz is in the hopes that he will be able to develop improved character-recognition engines. With the help of short excerpts of documents written in varied styles, which have been earlier transliterated by experts, he aims to teach software to identify both the forms of characters and often correlated associations between letters and words, thus constructing a growing recognition engine over time that is absolutely precise. The only question remains, he says, is that how much data or how much handwriting that is transcribed, is sufficient to do this well. The work on this concept is still developing.
Perez-Cruz states that he believes even though it is a technical challenge, it may still be achievable. He is even more fascinated about the next part which is organisation of large quantities of transcribed matter into topics that can be used in a glance. He says that the machine should be able to give us information right away from these three-and-a-half centuries of data when transcribed and should itself learn from the locations of the words and sentences. This is, what he calls, topic modelling.
A key link: Systematically grouping large data into easily accessible topics
After sufficient data has been entered into the algorithm, it begins to spot the most vital identifying and organizing forms and designs in the data. Very often, it so happens that various cues from the human researchers are vital and are searched for.Perez-Cruz notes that eventually, we might discover that there are, let’s say, a few hundred topics or descriptions that run through the whole of this archive and then all of a sudden there may be 88-million-document problems that have been scaled-down to 200 or 300 ideas.
If algorithms can consolidate 88 million pages of text into a few hundred lots, a huge progress in systematisation and efficiency can be achieved by historians and researchers who need to make choices about which particular document, theme or time periods are to be searched, reviewed and analysed in the formerly unmanageable archive. The same concept could be used to find styles, themes and concealed meaning in other vast unread databases.
He concludes saying that one begins with a huge quantity of unorganised data and in order to understand what material does that data contain and how it can be used, a kind of a structure needs to be brought to that data. Once the data is comprehended, one can begin to read it in a particular way, understand better what questions are to be asked pertaining to that information and make better conclusions.