The table selector focuses on table recognition and extraction. It is functional enough to fit all the needs of table extraction by itself, so you don`t need to add any other selectors to the parsing flow for it to work.

Automatic table mode

The table selector can automatically detect a table structure in automatic mode, when a portion of a table lies inside the data field`s region.

In automatic mode, there are 3 properties that allow you to filter out extracted data:

  1. Select row
    Defines rows to be extracted based on their indices from top to bottom. Indices start with 1. If present, a table header's row will also be indexed. 
    You can select a single number or a range. Using a negative number, it is possible to specify a backwards index where -1 is the last row of a table.
    For instance. the  2:-2  range means that all table rows except for the first and the last ones will be extracted
  2. Select column
    Defines columns to be extracted based on their indices from left to right. Indices start with 1. 
    You can select a single column using its index or name (that is the same as its header) as well as a continuous range [start: end].
    With a negative number, it is possible to specify a backwards index of a column.  where -1 is the rightmost column of a table.
    For instance. the  2:-2  range means that all table columns except for the left and the right ones will be extracted.
  3. Column for table building
    Sometimes, when not all table rows have horizontal borders or there are empty cells present, rows may be detected inaccurately.
    In this case, the "column for table building" is being used by the selector as the main reference to determine the table rows.
    For this parameter, it is recommended to use the number of a column in which all cells will always be filled. 
    So, in the case of invoices it can be the "Total" column.

Advanced table mode

To extract tables with a complex structure, when table headers aren`t static for example it can be useful to use the advanced table mode by deselecting the "Automatic headers" option.



This mode allows you to set up headers for the columns you desire.
It can be done semi-automatically, by selecting the document area with headers or fully-manually, where you need to type the desirable column names in the "Headers" parameter. 
Headers should be typed, one below the other, starting from the leftmost. If any header consists of two or more lines, these lines must be concatenated into one, separated by spaces.

One can also specify headers using Regular expressions by selecting the "regexp" option as the "Format of header" parameter.

Multipage tables

If a table spans more than one page, the selection algorithm will also select it as a single page, where all table columns have the same width on subsequent pages.
The repeated header and footer (if any) are filtered out from the final results, so that only the first header and the last footer are retained.

The multipage selection algorithm also detects and ignores any page headers or footers.

For better results in the case of multipage tables, we recommend using advanced table mode and specifying all table headers explicitly.

Expert mode

Expert mode keywords

Automatic table mode

table: fit

If the keyword fit is specified, the table headers and the cell position (or row and column range if more than a single cell is selected) in the table is automatically determined by the selected region on the page.

Advanced table mode

table: selectRow=1:2, selectColumn=1, numberOfColumns=2, mainColumn=2, format=simple|regexp

  • numberOfColumns is a mandatory property to specify the number of columns in a table.
  • selectColumn and selectRow define which columns and rows should be added to the output.
    You can use positive numbers to reference a column`s index from left to right, or a row's index from top to bottom; negative number for backwards order; and range, using start:end syntax.
  • mainColumn number of a column that is being used as main reference to determine the table rows.
  • format specifies how the names of the columns will be matched. By default (format=simple) the standard string comparison is used. However, one can also specify format=regexp and specify regular expressions for column names.
  • In case only a single cell (n,m) needs to be extracted from the table, one can use a property selectCell=n;m, which is equivalent to specifying selectRow=n, selectColumn=m.

The keyword is followed by the names of the column headers, each on a separate line.

Output data format: 

lines or tables

Example

table: selectRow=2, selectColumn=1, numberOfColumns=2, mainColumn=2
DATE
INVOICE

This example will select the contents of the cell (2,1) in the 2-column table with column headers "DATE" and "INVOICE".


List of selectors