How to Easily Import Data from Word Documents into Your App: A Complete Guide

How to Easily Import Data from Word Documents into Your App: A Complete Guide

Programmatically extract text data from Word files

Introduction

Recently, I was involved in data migration for a client. The data mainly consists of exam questions and their explanations. The data was structured in (.xlsx) format but there was one problem with the content of the data.

Some of the questions included mathematical equations which was a problem for us as it could not be saved as a text format in the cell of the Excel document. Usually, the equations are added in shape format which was difficult to read programmatically.

Some of the equations were very complex e.g.

$$\displaystyle P_\lambda = \frac{2 \pi h c^2}{\lambda^5 \left(e^{\left(\frac{h c}{\lambda k T}\right)} - 1\right)}$$

So, instead of saving the mathematical questions in Excel format, they used DOC format which was way easy as compared to adding equations in an Excel format.

There were around 1000 mathematical questions that included equations in it. One way was to copy/paste manually and the other way was to do it programmatically.

Being a software developer I preferred the second way and you will find that in the next few minutes how I was able to import the data from a Word document, but before that, we should understand the importance of importing and exporting data.

Why Import/export data?

Import and export of data play a significant role in today's world. Almost all businesses require a set of data to formulate growth strategies and enhance operations, analysis, and decision-making.

The key benefits of importing and exporting data are

  • Enhanced decision making

  • Data sharing and Collaboration

  • Data Visualization and Reporting

  • Accuracy and Reliability

  • and many more...

The most common formats to import data are CSV, XML, or JSON as they ensure compatibility across different systems and platforms.

However today we will discuss a different format i.e. Doc files or Word Documents which is mainly a word processing document format. DOC files are used to store data such as formatted text, images, tables, and charts. The most common DOC formats widely used are .doc or .docx and they are merely not just a text file but larger than that.

Setting up the environment

We will be using a Python library python-docx to read and write DOC files. This library works well with the .docx format. If you are having .doc format, you might first need to convert them to .docx formats either by using Microsoft Word or some conversion tool.

One good thing that was done was the Doc file was formatted properly since there were 1000 questions to be imported, and each question with its details was added to a separate table. This not only helped organize the document but also did help in importing the document.

Below is the sample image of the Word document which was used to be imported.

A table with two columns labeled "Question" and "Solution" under the heading "No". The table contains five rows of mathematical questions and their corresponding solutions.

Install python-docx

python-docx is a library for reading, creating, and updating (.docx) files. Let's first install the library.

pip install python-docx

The whole task of extracting data from a Word document was to

  1. Read the Document i.e. use the library python-docx to extract text from the document.

  2. Parse the extracted text into structured JSON

  3. Generate JSON object.

Extracting Data

Read the Document

Below is an example of how to open a document file and read its content.

import docx

# Load the document
doc = docx.Document('sample.docx')

Structure of the Document

The basic structure of the document was in a tabular format, and it looks something like below.

NoQuestionSolution
1cos -640° =?cos 80°
2\= ?½ log 2
3If tan2 x + (1 - ) tan x - = 0 then x isn - /4 or n + /3
4The distance between the foci of the ellipse 4x2+5y2\=20 is2
5If the two circles x2+y2+2gx+2fy\=0 and x2 + y2+2g1x+2f1y\=0f/f1\=g/g1

Now based on the above structure we can easily identify that the first row is the header row and the rest of the rows are the data rows. In the next part, we will parse the data based on the tabular format.

💡
Keep in mind that if your mathematical equations are saved as plain text, they will appear as a regular string. However, if they are stored as LaTeX content, they will display exactly as they were saved.

Parse the Document

Let's parse the data based on the structure. Since our data is stored in tabular format, it is good to make sure that the count of tables in the document is 1 and not more than that.

# Count the No of Tables in the document.
table_count = len(doc.tables)
print("Number of tables in the document:", table_count)

In case it is more than 1 then you need to make sure about the structure of the table and whether it is to be extracted or not.

Let's extract the questions from the table, assuming we have multiple tables and the table to be used is at the index 0.

# Extract questions from the document
questions = extract_questions_from_table(doc)
print(questions)

def extract_questions_from_table(doc):
    return [
        {
            "index": row.cells[0].text,
            "question": row.cells[1].text,
            "solution": row.cells[2].text,
        }
        for row in doc.tables[0].rows[1:]
    ]

In the above code, we are using nested list comprehension. We can extract the data from the table. The outer loop for the table in doc.tables iterates over each table in the document, and the inner loop for row in table.rows[1:] iterates over each row in the current table, starting from the second row.

For each row, a dictionary is created with the text from the first three cells and added to the list.

The output should be something like this

[{'index': '1', 'question': 'cos -640° = ?', 'solution': 'cos 80°'}, 
{'index': '2', 'question': '= ?', 'solution': '½ log 2'}, 
{'index': '3', 'question': 'If  tan2 x + (1 - ) tan x -  = 0 then x is', 'solution': 'n - /4  or n + /3'}, 
{'index': '4', 'question': 'The distance between the foci of the ellipse 4x2+5y2=20 is', 'solution': '2'}, 
{'index': '5', 'question': 'If the two circles x2+y2+2gx+2fy=0 and x2 + y2+2g1x+2f1y=0', 'solution': 'f/f1=g/g1'}]

Generate the JSON from extracted data

Next, we can create the JSON format from the output we received earlier, by using the below code. Make sure the native module json is imported initially.

# Generate JSON based on the output response.
import json

def generate_json(questions, filename):
    questionnaires = []
    for question in questions:
        questionnaire = {
            "index": question["index"],
            "title": question["question"],
            "explanation": question["solution"],
        }
        questionnaires.append(questionnaire)
    with open(filename, "w") as f:
        json.dump(questionnaires, f, indent=4)

# Generate JSON File
jsonfile = generate_json(questions, "sample.json")

And voilà! We have successfully extracted the data from a Word document into JSON format. It is important to note that the more structured the data is, the easier it will be to extract.

The above data can be easily stored in any type of persistent storage, such as RDBMS or NoSQL databases.

Conclusion

That's it for today, and congratulations to everyone who has followed this blog! You've successfully imported data from a structured Word document into JSON format. Awesome job! 🎉

I hope you have learned something new, just as I did. If you enjoyed this article, please like and share it. Also, follow me to read more exciting articles. You can check out my social links here.