Extractor

class fuzzy_table_extractor.extractor.Extractor(doc_handler: fuzzy_table_extractor.doc_handlers.BaseHandler)

The Extractor class has the functions to extract data from tables in document. It receives a document handler on initialization, this handler follows an interface, so the extraction is agnostic of the document type.

extract_closest_table(search_headers: typing.List[str], validation_funtion: typing.Callable[[typing.List[str]], bool] = <function Extractor.<lambda>>, minimum_proximity_ratio: float = 0) pandas.DataFrame

Extract the table in document that has the closest header to search_headers

Parameters
  • search_headers (List[str]) – list of itens to search in header.

  • validation_funtion (Callable[[List[str]], bool], optional) – function to validate if the table is valid. This function receives the table header as argument and must return True if the table is valid. Defaults to lambda x: True.

  • minimum_proximity_ratio (float, optional) – minimum proximity ratio to consider there is a match in header. Value must be between 0 and 100. Defaults to 0.

Returns

best match

Return type

pd.DataFrame

extract_single_field(field: str, orientation: fuzzy_table_extractor.extractor.FieldOrientation, regex: List[str] = [''], title_regex: List[str] = [''], return_multiple: bool = False) str

Extract single field of a word document based on a input string. The data will be extracted from tables in document

Parameters
  • field (str) – search field

  • orientation (FieldOrientation) – orientation to search the content of field

  • regex (List[str], optional) – list of regex to apply to content. To be a valid content there must be at least one match of regex in list. Defaults to [‘’].

  • title_regex (List[str], optional) – list of regex to apply to title. To be a valid title there must be at least one match of regex in list. Defaults to [‘’].

  • return_multiple (bool, optional) – if True, will return all matches that has the same proximity ratio. Defaults to False.

Returns

best match

Return type

str

static get_columns_fuzzy(df: pandas.DataFrame, columns: List[str], threshold=0) pandas.DataFrame

Get columns that hat the closest match with supplied columns names The columns will be renamed to match the closest column name

Parameters
  • df (pd.DataFrame) – dataframe to search columns

  • columns (List[str]) – columns to search

  • threshold (int, optional) – minimum score to consider a match. Defaults to 0.

Returns

columns that match

Return type

List[str]

static headers_association(document_headers: List[str], search_headers: List[str]) List[fuzzy_table_extractor.extractor.TableMatch]

Determine the best association of two headers

Parameters
  • document_headers (List[str]) – headers in document

  • search_headers (List[str]) – search headers

Returns

list of table headers matches

Return type

List[TableMatch]

static headers_proximity_ratio(document_headers: List[str], search_headers: List[str]) int

Calculates a proximity ratio of two headers

Parameters
  • document_headers (List[str]) – headers in document

  • search_headers (List[str]) – search headers

Returns

proximity ratio

Return type

int