Skip to content

Commit

Permalink
Add Page.find_table(...) (#873)
Browse files Browse the repository at this point in the history
Previously, `pdfplumber.Page` had these table-getting methods:

- `.find_tables(...)`
- `.extract_tables(...)`
- `.extract_table(...)`

For consistency/completeness's sake, this commit adds:

- `.find_table(...)`

... which, analogous to `.extract_table(...)`, returns the largest table
on the page.

Indeed, `.extract_table(...)` now uses `.find_table(...)` beneath the
hood.

Thanks to @pdille for the suggestion, here:
#864 (reply in thread)
  • Loading branch information
jsvine committed Jul 4, 2023
1 parent 57d51bb commit 3772af6
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 11 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,8 +346,9 @@ If you're using `pdfplumber` on a Debian-based system and encounter a `PolicyErr
| Method | Description |
|--------|-------------|
|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
|`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.|
|`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.|
|`.extract_table(table_settings={})`|Returns the text extracted from the *largest* table on the page, represented as a list of lists, with the structure `row -> cell`. (If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.)|
|`.extract_table(table_settings={})`|Returns the text extracted from the *largest* table on the page (see `.find_table(...)` above), represented as a list of lists, with the structure `row -> cell`.|
|`.debug_tablefinder(table_settings={})`|Returns an instance of the `TableFinder` class, with access to the `.edges`, `.intersections`, `.cells`, and `.tables` properties.|

For example:
Expand Down
30 changes: 20 additions & 10 deletions pdfplumber/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,16 +307,9 @@ def find_tables(
tset = TableSettings.resolve(table_settings)
return TableFinder(self, tset).tables

def extract_tables(
self, table_settings: Optional[T_table_settings] = None
) -> List[List[List[Optional[str]]]]:
tset = TableSettings.resolve(table_settings)
tables = self.find_tables(tset)
return [table.extract(**(tset.text_settings or {})) for table in tables]

def extract_table(
def find_table(
self, table_settings: Optional[T_table_settings] = None
) -> Optional[List[List[Optional[str]]]]:
) -> Optional[Table]:
tset = TableSettings.resolve(table_settings)
tables = self.find_tables(tset)

Expand All @@ -329,7 +322,24 @@ def sorter(x: Table) -> Tuple[int, T_num, T_num]:

largest = list(sorted(tables, key=sorter))[0]

return largest.extract(**(tset.text_settings or {}))
return largest

def extract_tables(
self, table_settings: Optional[T_table_settings] = None
) -> List[List[List[Optional[str]]]]:
tset = TableSettings.resolve(table_settings)
tables = self.find_tables(tset)
return [table.extract(**(tset.text_settings or {})) for table in tables]

def extract_table(
self, table_settings: Optional[T_table_settings] = None
) -> Optional[List[List[Optional[str]]]]:
tset = TableSettings.resolve(table_settings)
table = self.find_table(tset)
if table is None:
return None
else:
return table.extract(**(tset.text_settings or {}))

def _get_textmap(self, **kwargs: Any) -> TextMap:
defaults = dict(x_shift=self.bbox[0], y_shift=self.bbox[1])
Expand Down

0 comments on commit 3772af6

Please sign in to comment.