Extractor¶
- class fuzzy_table_extractor.extractor.Extractor(doc_handler: fuzzy_table_extractor.doc_handlers.BaseHandler)¶
The Extractor class has the functions to extract data from tables in document. It receives a document handler on initialization, this handler follows an interface, so the extraction is agnostic of the document type.
- extract_closest_table(search_headers: typing.List[str], validation_funtion: typing.Callable[[typing.List[str]], bool] = <function Extractor.<lambda>>, minimum_proximity_ratio: float = 0) pandas.DataFrame¶
Extract the table in document that has the closest header to search_headers
- Parameters
search_headers (List[str]) – list of itens to search in header.
validation_funtion (Callable[[List[str]], bool], optional) – function to validate if the table is valid. This function receives the table header as argument and must return True if the table is valid. Defaults to lambda x: True.
minimum_proximity_ratio (float, optional) – minimum proximity ratio to consider there is a match in header. Value must be between 0 and 100. Defaults to 0.
- Returns
best match
- Return type
pd.DataFrame
- extract_single_field(field: str, orientation: fuzzy_table_extractor.extractor.FieldOrientation, regex: List[str] = [''], title_regex: List[str] = [''], return_multiple: bool = False) str¶
Extract single field of a word document based on a input string. The data will be extracted from tables in document
- Parameters
field (str) – search field
orientation (FieldOrientation) – orientation to search the content of field
regex (List[str], optional) – list of regex to apply to content. To be a valid content there must be at least one match of regex in list. Defaults to [‘’].
title_regex (List[str], optional) – list of regex to apply to title. To be a valid title there must be at least one match of regex in list. Defaults to [‘’].
return_multiple (bool, optional) – if True, will return all matches that has the same proximity ratio. Defaults to False.
- Returns
best match
- Return type
str
- static get_columns_fuzzy(df: pandas.DataFrame, columns: List[str], threshold=0) pandas.DataFrame¶
Get columns that hat the closest match with supplied columns names The columns will be renamed to match the closest column name
- Parameters
df (pd.DataFrame) – dataframe to search columns
columns (List[str]) – columns to search
threshold (int, optional) – minimum score to consider a match. Defaults to 0.
- Returns
columns that match
- Return type
List[str]
- static headers_association(document_headers: List[str], search_headers: List[str]) List[fuzzy_table_extractor.extractor.TableMatch]¶
Determine the best association of two headers
- Parameters
document_headers (List[str]) – headers in document
search_headers (List[str]) – search headers
- Returns
list of table headers matches
- Return type
List[TableMatch]
- static headers_proximity_ratio(document_headers: List[str], search_headers: List[str]) int¶
Calculates a proximity ratio of two headers
- Parameters
document_headers (List[str]) – headers in document
search_headers (List[str]) – search headers
- Returns
proximity ratio
- Return type
int