Data extraction – case study
Explore the top informations matching your needs. Receive recommendations for similar documents that may be useful for you faster than the blink of an eye.
FROM PROBLEMS TO SOLUTIONS
PROBLEM (A): SCRAPPING THE FILES
We already knew that it is really important to fully understand the abilities and limitations of extraction of informations from specific file types before building an app around it.
Our initial idea for this project was to create a dashboard that inform a user of the useful informations that match their needs and aggregate recommendations of similar files that contain the informations related to the user’s context.
Operating System APIs by default do not allow for effective search of informations saved in a lot of file types “as it is”, especially when it comes to analysis of MS Office files or PDF files.
The very first thing we did, was to get all the files to analyze.
SOLUTION (A): SCRAPPING THE FILES
We landed on the idea of the system. With given user permissions we implemented file indexing, starting from the root path to build the specific b-trees and graphs that contains all of them. This allowed us to get the proper data for the next stages of development in a fast way.
RESULTS (A): SCRAPPING THE FILES
The extracting system is much more intelligent when it comes to getting files, it also filters out the ones that user does not have an access to read. Our complex system can automatically get all the files with common file types, including MS Office and PDF ones. The scrapping subsystem is going to be customized to precisely match the client’s data infrastructure architecture.
PROBLEM (B): MULTIPLE MIDDLEWARE: AUTOMATION OF DATA EXTRACTION
The challenge here was to implement a properly working data extraction module when analyzing the files.
SOLUTION (B): MULTIPLE MIDDLEWARE: AUTOMATION OF DATA EXTRACTION
When you look at the files on your filesystem, what do you see?
… a big number of different file types.
How to deal with them?
Can the one data extraction method be universal and suit to most of the popular file types?
The answer is: NO. That’s why we decided to build a fast and reliable system to cover analysis of most of the popular file types.
To get all the things properly, what needs to happen?
1. Authenticate as a user
2. Fetch all the available files and index them
3. Filter out the data types that are not useful for the recommendations (such as OS files, extensions and so on)
4. Work on those files.
This all happens in the background.
There is a nice way to implement asynchronously working middleware. This can be done in redux-multi way – by creating an action that dispatches an array of actions.
It looks like a good starting point for moving the things further, but… it is not. There you can see a next problem straight along the way. When the action initSystem is being called, each action in the array is being dispatched immediately. It requires a change to avoid returning the recommendations based on the incomplete context.
To avoid this problem using multiple middleware, it is important to make sure they complete in order:
1. Authenticate the user
2. Fetch all the available files and index them
3. Filter out the data types that are not useful for giving the recommendations
4. Analyze and summarize files and get recommendations based on the files context and user context
In order to make it looking more like a sequence, the middleware listen for a moment when the previous middleware returns a status containing that the action is completed. For example, when working with Python, you can do it this way:
if (action.type == ‘FETCH_USER_CONTEXT’ && action.status == 200):
elif (action.type == ‘FETCH_USER_CONTEXT’ && action.status == 400):
RESULTS (B): MULTIPLE MIDDLEWARE: AUTOMATION OF DATA EXTRACTION
By implementing this kind of sequencing within the middleware, we were able to organize the middleware and assure that the whole sequence run in the correct order. We always choose the best, fast and complete solutions for our clients. We also work only on the highest standard, that’s why we made sure everything is well separable, so we can start working on parallelizing tasks.
PROBLEM (C): PDF PARSING
The challenge there was to check the extraction of informations from the PDF files.
PDF file can contain at least two types of data and the way to parse it is going to be very different.
SOLUTION (C): PDF PARSING
When working with PDF file, you could see that PDF files are created in various ways.
Some of them contain text, some of them contain images that needs to be summarized and (if needed) OCRed. We built a subsystem that is verifying the specific flags and – if applicable – add the PDF file (or the specific range) to the OCR queue.
In the case of OCR, there are times when you need to communicate between the specific modules of your project. Especially without creating the individual API calls. When the modules are separated from each other very well (i.e. utilizing the SOLID principles), you can use tokens for accessing the data at specified stages or confirming some other asynchronous actions that are working on the data from previous stages.
A good way to implement this is to use the message queues. Thankfully, there are some good tools out there for this exact problem – like RabbitMQ, ZeroMQ, Apache ActiveMQ or Celery.
Using messaging is a fast, useful, and lightweight way to move information.
RESULTS (C): PDF PARSING
We do know that in today’s environment, an app has to come out of the box immediately supporting multiple file types – whether it is an office file type, the PDF or other.
Starting out with general middleware and going through the ones more specific is one of the ways of having a good quality and non-redundant code.
We do know that in the priorities of businesses the speed of initial time-to-market of the solution is crucial and that it gives them a lot of benefits. That’s why we built a fast, complete and easy to use solution for our client.