Extractor¶

class fuzzy_table_extractor.extractor.Extractor(doc_handler: fuzzy_table_extractor.doc_handlers.BaseHandler)¶

The Extractor class has the functions to extract data from tables in document. It receives a document handler on initialization, this handler follows an interface, so the extraction is agnostic of the document type.

extract_closest_table(search_headers: typing.List[str], validation_funtion: typing.Callable[[typing.List[str]], bool] = <function Extractor.<lambda>>, minimum_proximity_ratio: float = 0) → pandas.DataFrame¶

Extract the table in document that has the closest header to search_headers

Parameters

search_headers (List[str]) – list of itens to search in header.
validation_funtion (Callable[[List[str]], bool], optional) – function to validate if the table is valid. This function receives the table header as argument and must return True if the table is valid. Defaults to lambda x: True.
minimum_proximity_ratio (float, optional) – minimum proximity ratio to consider there is a match in header. Value must be between 0 and 100. Defaults to 0.

Returns

best match

Return type

pd.DataFrame

extract_single_field(field: str, orientation: fuzzy_table_extractor.extractor.FieldOrientation, regex: List[str] = [''], title_regex: List[str] = [''], return_multiple: bool = False) → str¶

Extract single field of a word document based on a input string. The data will be extracted from tables in document

Parameters

field (str) – search field
orientation (FieldOrientation) – orientation to search the content of field
regex (List[str], optional) – list of regex to apply to content. To be a valid content there must be at least one match of regex in list. Defaults to [‘’].
title_regex (List[str], optional) – list of regex to apply to title. To be a valid title there must be at least one match of regex in list. Defaults to [‘’].
return_multiple (bool, optional) – if True, will return all matches that has the same proximity ratio. Defaults to False.

Returns

best match

Return type

str

static get_columns_fuzzy(df: pandas.DataFrame, columns: List[str], threshold=0) → pandas.DataFrame¶

Get columns that hat the closest match with supplied columns names The columns will be renamed to match the closest column name

Parameters

df (pd.DataFrame) – dataframe to search columns
columns (List[str]) – columns to search
threshold (int, optional) – minimum score to consider a match. Defaults to 0.

Returns

columns that match

Return type

List[str]

static headers_association(document_headers: List[str], search_headers: List[str]) → List[fuzzy_table_extractor.extractor.TableMatch]¶

Determine the best association of two headers

Parameters

document_headers (List[str]) – headers in document
search_headers (List[str]) – search headers

Returns

list of table headers matches

Return type

List[TableMatch]

static headers_proximity_ratio(document_headers: List[str], search_headers: List[str]) → int¶

Calculates a proximity ratio of two headers

Parameters

document_headers (List[str]) – headers in document
search_headers (List[str]) – search headers

Returns

proximity ratio

Return type

int