Large Scale Document Processing & Tagging
How do you read thousands of documents a day and extract pertinent information?
Problem Statement
Mining companies file thousands of public filings on a daily basis. The most important of these are often annual or quarterly reports, press releases, and technical reports about specific mining projects. These reports contain critical information for investors and stakeholders, including details about mining projects, mineral resources, and reserves, production, cost of production, environmental impact, and business case for the mine.
However, with the increasing volume of these filings makes it nearly impossible to categorize and sort through them efficiently.​Without an effective system for categorizing and sorting through these public filings, important information may be missed or overlooked, leading to potential risks and missed opportunities. Additionally, manually reviewing each filing is time-consuming and can lead to inconsistencies in analysis.
As a company Prospector needed a system or tool that could efficiently categorize and sort through public filings with a focus on mining technical reports like 43-101 (Canadian standard report) and JORC (Australian standard report), providing accurate and timely information for investors and stakeholders.
Defining the Approach
How do you read millions of documents?
For each document, we needed a way to filter them by various factors such as minerals, companies, mines, environmental factors, social issues, acquisition details, deposit types, stage, mining methods, start and end points of sections within documents, and type of document.
We also needed readily available summaries of the documents and a means to create a backlog of tasks for analysts to review and collect and quality control with an ultimate process of summarizing documents then classifying, comparing, extracting KPIs, and transforming data for distribution to data-consuming entities and the Prospector end-user platform.
​
To Satisfy these demands we decided to take a human centered artificial intelligence (AI) approach where AI models were used to recommend data to extract and leverage review cycles from analysts to valid and retrain the models.
Below is a high level diagram demonstrating the human loop of the approach where Machine Learning/ Natural Language Processing (ML/NLP) models recommend data to extract. High confidence items are directly submitted to our database, while low confidence items are reviewed by specialists to help retrain models. This allows mining analysts to focus on complex data collection, as the AI system constantly improves and becomes more efficient, ultimately taking over and shortening more and more data collection tasks.
Documenting Desired Workflow
What are the inputs and outputs we want to see?
Once we decided to use a human-centered AI approach, we walked through the entire process with our software engineers, data scientists, and analysts. We documented a workflow that captured the initial steps of summarizing each document and preparing them for human-centered AI analysis.
Then we outlined the classification steps to ensure that each document went through the appropriate process. For example, quarterly filings and annual filings go through different processes than mining technical reports. The output of this classification then needed to look at existing data that had been previously collected and make appropriate linkages. Based on these comparisons, review and collection tasks needed to be outlined for analysts. Finally, based on all the data collected and aggregated, it needs to be transformed and made available both internally and externally to Prospector.
​
Below is a high level summary diagram that was created as the output of the workflow design process we undertook talking through classification, comparison, and KPI extraction as it relates to the documents filed by publicly traded mining companies.
Preprocessing Documents
How do we get the documents ready for human centered AI?
Once we had outlined our goals, we began by developing mechanisms for preprocessing documents. We quickly realized that it would be more efficient to obtain documents from a single provider rather than from each of the stock exchanges globally. To meet this need, we established a contract with Factset to provide all filings from mining companies on a daily basis.
As these documents were received by Prospector, we stored them in AWS S3 storage along with metadata such as publication date, headline, filing tags from exchange, and file path location of the document in our AWS S3 storage.
We then sent the documents through a data extraction process using AWS Lambda functions which stored and indexed the data in a Snowflake data warehouse where the metadata in SQL could be compared to the extracted text more easily.
Finally, we extracted page images from each document and stored them in AWS S3 to allow for more complicated image analysis at a later date with the same indexed metadata on file paths stored in the MySQL DB as before. This created a situation where we had text stored for analysis, entire PDFs that could be returned to various applications, images of pages stored for analysis, and key metadata already stored to assist with classification process.
​
Below is a diagram highlighting the four types of data created and stored for each document once they are received from the document provider. This includes storing the document itself, its text, images of its pages, and the metadata provided by the document provider.
Classifying Documents
How do we determine which models should run on which documents?
After documents have been preprocessed, we need to classify them appropriately. The first step in doing this is using the metadata collected from the document data provider. This data can help us identify items like the company that filed the document and the type of document for a number of documents. Once this pass is done, we send them to the appropriate queue.
​
A second level analysis then uses the text to classify documents that could not be classified based on metadata. A key example of this is with 43-101 technical reports filed in Canada versus JORC technical reports filed in Australia. In Canada, there is a specific metadata tag and consistent headline naming convention used that can easily identify these documents. However, in Australia, we need to run models that look for citations of the JORC standard within the document in order to classify these documents.
​
In addition to type classification, we also leverage previously done manual work by analysts on technical reports where they established datasets that included where each section of the documents starts and ends, minerals discussed, authors of technical report, and location to classify the documents further using training data along with the text of these documents. This helps apply tags so you could filter all reports related to a specific mineral, country, author, or other mining specific criteria.
​
Below is a diagram outlining this tagging processing showing the high confidence versus low confidence analyst review loops fit within the context of the ML/NLP models that categorize the documents and then applyy various tags. All documents start with the metadata and text from preprocessing and follow this high level process to better categorize documents for action.
Comparing Documents
Does the document talk about any projects we are have in our database?
To check if a document relates to another piece of data we track like a mine, author, or company, we run the text of the document through a complex model that leverages transformers, name entity recognition, and vector analysis.
​
A transformer model is used to process the public disclosure documents and identify the named entities such as authors, companies, properties, mine locations, and projects mentioned in the text.
The named entities identified by the transformer model are passed through a named entity recognition (NER) algorithm which classifies the entities into specific categories such as “person", “company", “project”, and “location.“
Finally, Vector analysis is used to represent the named entities as numerical vectors capturing the meaning and relationships between the entities and the context. By using mathematical operations such as dot products and cosine similarity on the vectors, the algorithm determines which of the entities are semantically the most relevant and thus represent the correct list of authors, unique project, and company-owner.
​
Below is a high level walk through of how this technique is applied to identify if the author of a technical report has written other technical reports in our database by applying this technique.
End Result
The output of this entire document process generates a series of review tasks and data collection tasks and provides data to make this quicker. With the model recommended sections, pages, and tags along with the linked projects and companies, analysts now know the exact document and pages they need to review and collect, minimizing time it would otherwise have taken going through thousands of documents a day.
​
Below is summary of all the data that has been captured on a document once it has finished this process. The document is stored and ready to be accessed at any time. The document metadata provides some general information that allows us to search through headlines and filter by dates. The NLP/ML models for metadata apply a series of tags and categorization that make it easier to isolate any particular documents for analysis. The sections, tables, and figures ML/NLP models identify specific sections of the document and index specific pages with tables and figures that are important for follow-on analysis. Finally, a series of collection and review tasks are automated for analyst to collect data from specific pages of the documents.