Welcome to Fuzzy Table Extractor’s documentation!

Fuzzy Table Extractor is a library for extracting tables from documents. In the current version only Microsoft Word documents are supported, but in the future the library will be expanded to cover pdf, web and other types of documents.

What can Fuzzy Table Extractor do?

It can finds tables in documents based on a target header, searching all tables in document and returing the one that has the most similar header, using a fuzzy search algorithm.

It’s available on PyPI, so the installation can be done with pip:

pip install fuzzy-table-extractor

To extract a table from a simple document like this:

_images/basic_document.png

We need just a few lines of code:

from pathlib import Path

from fuzzy_table_extractor.handlers.docx_handler import DocxHandler
from fuzzy_table_extractor.extractor import Extractor, FieldOrientation

path = r"path_to_document.docx"

file_path = Path(path)
handler = DocxHandler(file_path)

extractor = Extractor(handler)
df = extractor.extract_closest_table(["name", "age"])
print("This is the result extraction of a very simple document:")
print(df)

And the result is a pandas dataframe with the following content:

name age
0  Paul  25
1  John  32

Note that the dataframe returned only has the coluns specified as search headers. The closest column in the original table is selected and renamed to value passed to the function.

Note

This is a development version of Fuzzy Table Extractor.

Contents