Welcome to Fuzzy Table Extractor’s documentation!¶
Fuzzy Table Extractor is a library for extracting tables from documents. In the current version only Microsoft Word documents are supported, but in the future the library will be expanded to cover pdf, web and other types of documents.
What can Fuzzy Table Extractor do?¶
It can finds tables in documents based on a target header, searching all tables in document and returing the one that has the most similar header, using a fuzzy search algorithm.
It’s available on PyPI, so the installation can be done with pip:
pip install fuzzy-table-extractor
To extract a table from a simple document like this:
We need just a few lines of code:
from pathlib import Path
from fuzzy_table_extractor.handlers.docx_handler import DocxHandler
from fuzzy_table_extractor.extractor import Extractor, FieldOrientation
path = r"path_to_document.docx"
file_path = Path(path)
handler = DocxHandler(file_path)
extractor = Extractor(handler)
df = extractor.extract_closest_table(["name", "age"])
print("This is the result extraction of a very simple document:")
print(df)
And the result is a pandas dataframe with the following content:
name age
0 Paul 25
1 John 32
Note that the dataframe returned only has the coluns specified as search headers. The closest column in the original table is selected and renamed to value passed to the function.
Note
This is a development version of Fuzzy Table Extractor.