Searching documents using artificial intelligence
Over and over again, the AI software retrieves familiar patterns and applies them to new scenarios requiring a decision. That’s how it learns – and how it gets better and better.
06.2021 | Text: Thorsten Rienth
Thorsten Rienth writes as a freelance journalist for AEROREPORT. In addition to the aerospace industry, his technical writing focuses on rail traffic and the transportation industry.
Suppose someone in the development department urgently needs a certain piece of information within a clearly defined context—say, information on a specific material for a specific turbine blade in a specific engine program. “That person would first gather the technical documents on the material, component and engine program from previous years and then wade through dozens of documents, each up to 20 or 30 pages long,” says Thomas Piprek, a specialist in PLM (product lifecycle management) data exchange at MTU Aero Engines. By the end of the day, the developer might have a folder full of PDFs with key points highlighted or brief notes in the margins. Or that folder might be empty, simply because the technical documents turned up nothing relevant.
Pre-sorting and classifying documents
“The trove of information contained in PLM technical documents is enormous,” Piprek says. But so is the sheer volume of documents, unfortunately. “The more specific the question, the more time-consuming the search for answers,” he says. But what if the documents came to the developer’s desk already pre-sorted? If relevant resources were separated from irrelevant ones? What if documents likely to lead to the answer were filtered out from those likely to lead nowhere?
This might save hours, if not days, of work for the developer. “In the best-case scenario,” Piprek explains, “the computer would provide them with a selection of those technical documents that are most likely to help with the question—in a matter of seconds.” Piprek set out to develop a program that would do just that. One that uses artificial intelligence, or AI, to pre-sort and classify documents.
On-the-job training for the AI software model
Piprek’s work began with an ordinary off-the-shelf AI software program. With a few tweaks, he and his team adapted it to the characteristics of the technical documents. “A good way to think of the package is like the brain of a toddler: huge potential, but still at a low level.”
Children are encouraged to develop this potential in school, but AI software needs to be trained. In Piprek’s project, the software applies deep learning techniques and is fed with countless texts from MTU’s PLM system. From a software perspective, it’s kind of like on-the-job training. “We know precisely what knowledge is available in the PLM system and how it is structured. That provides us with an excellent basis for training and evaluating the AI.” Such background knowledge is necessary because the AI relies on verified training data as it runs through the decision scenarios.
“In simple terms, what happens is that the AI module first builds itself a statistical model,” Piprek explains. “The next step is for us to give it rules for evaluating the content in a specific way.” Equipped with this combination of statistics and rules, the software gradually figures out a way to make sense of the documents. Over and over again, it retrieves familiar patterns and applies them to new scenarios requiring a decision. By matching the calculated result with the desired target result, the AI model learns. With each training run, it gets a little smarter, a little faster, a little more accurate. The more data available and the higher its quality, the better the training. The technical term for what emerges is a neural network.
In this network, the software searches for terms that it recognizes from a similar context or that it identifies as synonyms. At some point, it will even be able to make sense of the content of documents that have little structure—and it will no longer be confused by barely legible letters on a scan of yellowed paper.
Precise overview: Thomas Piprek’s AI software has already scanned more than 15,000 documents. This training data helps the software build up its neural network as part of a continuous learning process.
The more documents to search, the more time saved
To put the model to the test, Piprek set about creating a large-scale proof of concept. He ran the software on about 15,000 documents. The idea was for its heuristics and algorithms to gather terms and data that initially did not seem obvious and could not be logically classified at first glance. “Depending on the specific question, the software came up with a hit rate of 90 to 95 percent,” Piprek says. “That’s already pretty good.”
But naturally he wants to get these rates as close to 100 percent as possible and make the application available to other projects. “Once trained for our purposes, we could run the AI model on other partially structured or unstructured data sources outside the PLM system. The project drive for a new turbine blade development, for example.” Wherever there are a lot of documents, the model would save huge amounts of time—regardless of whether the input formats are scans, PDFs or e-mails.