Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract XLSX logic + Add Data Dictionary + Add Artifacts #1263

Merged
merged 9 commits into from
Dec 9, 2024

Conversation

alexrichey
Copy link
Contributor

@alexrichey alexrichey commented Nov 21, 2024

TLDR: I'm working on a real writeup for this, and adding docstrings to make this more readable. If you get there beforehand, the commits are quite atomic and could be easily read that way. Also, there are two tests that we'd expect to fail atm, since they rely on changes in the metadata repo here

There are a few things going on here:

  1. A rewrite of the XLSX code to pull our business logic out, and make it a generalized generator of worksheets with tables in them. This bit I feel pretty good about, although I think we need to add back logic about table headers, rather than generating them abstractly.

  2. (Here's where things get muddy) Facilities to convert Dataset attributes into a table format / abstract of a table. This includes design elements like Tables/Rows/Cells/Styles in design.elements.py, as well as abstract_doc.py which will turn a Pydantic model into a tabular_report'ish thing. I wouldn't spend too much time on details here but would love big thoughts about how to simplify this. Here's a comment outlining my thoughts.

  3. Adds a data dictionary for fields on our metadata. This concept feels solid, so nitpicks very welcome.

  4. Adds declarative instructions for generating XLSX files, including what columns to include, and what objects to table'ify. Again, I think this is a solid concept that needs some help settling in. Please pick those nits.

  5. Modifications to existing scripts to start from the full OrgMetadata instead of specific Datasets. In general, we should really never use the dataset metadata directly, as it will not inherit Org attributes, the data dictionary, or artifacts.

I general, I'd love to get input to clean up 1, 3, and 4 above, then merge.

Then for 2, I think we can rewrite this to accommodate the HTML/PDF generation. So for 2) big picture thoughts welcome, but detailed review isn't needed, unless you notice something glaringly wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think with formatted cells and rows, we're really talking about a Workbook or Sheet or Worksheet here more so than a doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know part of the idea is to abstract from that a bit, but I don't think we can quite escape it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its a "formatted table"? "tabular report"? The crux of it seems to be these things that are undoubtedly worksheet tables, with some decoration/metadata that then a specific implementation gets to decide exactly how to dump it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like tabular report. However, I do think you could pretty easily render this as HTML, so while it's currently just Excel workbooks, I think that might be overly specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when we're talking about "rows" of "cells" in models.design.elements? That seems distinctly tabular in terms of language choice. But I think that's more me getting caught up on semantics - Is the thought that these still might be displayed in a more nested (json-like) format in an html/pdf? Where each "column" is say a header, each "row" a subheader, and each "cell" and element under that

@fvankrieken
Copy link
Contributor

Not to muddy the waters but have you looked at xlsxwriter? Seems to have a bit more support for formatting. Though it seems like you've managed to make things work pretty well with openpyxl

Copy link

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 89.19753% with 35 lines in your changes missing coverage. Please review.

Project coverage is 70.41%. Comparing base (65b2420) to head (a4e1633).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
dcpy/lifecycle/package/abstract_doc.py 84.61% 9 Missing and 7 partials ⚠️
dcpy/lifecycle/package/xlsx_writer.py 88.34% 6 Missing and 6 partials ⚠️
dcpy/lifecycle/package/assemble.py 16.66% 5 Missing ⚠️
dcpy/lifecycle/package/_cli.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1263      +/-   ##
==========================================
+ Coverage   69.67%   70.41%   +0.74%     
==========================================
  Files         111      114       +3     
  Lines        5913     6129     +216     
  Branches      659      700      +41     
==========================================
+ Hits         4120     4316     +196     
- Misses       1661     1669       +8     
- Partials      132      144      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexrichey alexrichey force-pushed the ar-new-metadata-tab branch 4 times, most recently from f1e3a7a to b88fd84 Compare December 3, 2024 15:28
@alexrichey alexrichey marked this pull request as ready for review December 3, 2024 18:43
@alexrichey alexrichey changed the title Data Dictionary WIP for XLSX Extract XLSX logic + Add Data Dictionary + Add Artifacts Dec 3, 2024
@alexrichey alexrichey force-pushed the ar-new-metadata-tab branch 2 times, most recently from 8564331 to 2109c2c Compare December 3, 2024 22:06
@alexrichey
Copy link
Contributor Author

Not to muddy the waters but have you looked at xlsxwriter? Seems to have a bit more support for formatting. Though it seems like you've managed to make things work pretty well with openpyxl

@fvankrieken Yeah, I'd started with openpyxl because we were starting from an XSLX provided by OTI, whereas xlsxwriter can't do that. This is less of a consideration now that we're effectively generating everything, though I think it's nice to enable a template XLSX that could be used as a starter. On the flipside, openpyxl hasn't had a commit in nine years (maybe it didn't need any) however, it'd be pretty trivial at this point to switch over, as almost all the logic now lives in abstract_docs.py

@alexrichey
Copy link
Contributor Author

@fvankrieken I'm currently fixing a few things, and documenting. If you happen to start reviewing imminently, here's what I'm thinking kind of broadly:

I like your idea of a tabular_report - specifically I'm thinking of it as an object you can generate for any of our Pydantic models (though just starting with Metadata), that will translate a pydantic class into information displayable in a table, with the field metadata if it's available. Then we can pass the tabular_report to different renderers (ie XLSX, or HTML for @damonmcc). It would also carry around some suggested styling for cells, (e.g. if you render the column.values rows as a str, you should use a monospaced font) that renderers can use or ignore.

So that'd mean probably removing Rows and Cells from design.elements, and making the XLSX writer a little more prescriptive about generating tables (as opposed to constructing them entirely abstractly)

But I think the DataDictionary and Artifact side of things are pretty solid concepts that need some sharpening. So would love feedback on that.

Attached is a sample XLSX for Zoning Map Amendments. It's a good example for the table formatting in the Column Information tab.

zoning_map_amendments_data_dictionary.xlsx

MONOSPACED_FONT = "Consolas"


def _make_title_subtitle_cell(title: str, subtitle: str):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fvankrieken some additional context on a previous comment: this type of logic is what I'm thinking we'll move into the XLSX module, along with basically everything in _make_table_top. I think it's really only useful when made concrete in a renderer.



# TODO: move to pydantic models
def get_data_source(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fvankrieken and this would be folded into the tabular_report, which could probably be a Pydantic class from which we'd inherit. (would probably move get_field_metadata into that class as well)

@@ -26,6 +26,19 @@ class SortedSerializedBase(BaseModel):
_exclude_falsey_values: bool = True
_head_sort_order: list[str] = PrivateAttr(default=["id"])
_tail_sort_order: list[str] = PrivateAttr(default=["custom"])
_repr_functions: dict[str, typing.Callable[[typing.Any], str]] = {}

def field_repr(self, field_name: str):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking we'll move this functionality into a TabularReport class

class DataDictionary(CustomizableBase, TemplatedYamlReader):
org: dict[str, dict[str, FieldDefinition]] = {}
product: dict[str, dict[str, FieldDefinition]] = {}
dataset: dict[str, dict[str, FieldDefinition]] = {}
Copy link
Contributor

@fvankrieken fvankrieken Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slowly working through this. What are your thoughts on this vs each of these being a list of FieldSets, and then FieldSet could also have a label field. If these are analagous to tabs in an excel sheet, at first thought

datasets:
  columns:
    column_name:
      summary: column_name
      extra_description: Name of ...
  attributes:
     ...

is less intuitive than

datasets:
- name: columns # or name: "Column Information" since that's the tab label in the file
  fields:
    column_name:
      summary: column_name
      extra_description: Name of ...

You can have a similar argument for the inner dictionary as well. It's definitely more of a usual pattern for our codebase, so just wanted to know what drew you to this.

Also a little curious how you imagine org and product playing in here, but can talk about that in person

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's sort of open how org and product will work here, since everything is kinda boiled down to a dataset level. I've sort of been assuming that's not the case, but let's def chat about that, as well as the distinction in yaml. (not sure I'm quite following, but I'm sure a quick chat will make it clear)

@@ -123,6 +124,7 @@ class OrgMetadata(SortedSerializedBase, extra="forbid"):
template_vars: dict = Field(default_factory=dict)
metadata: OrgMetadataFile
column_defaults: dict[tuple[str, COLUMN_TYPES], DatasetColumn]
data_dictionary: DataDictionary = DataDictionary()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit pedantic, but I think I'd prefer something like DataDictionarySpec just to clarify that this is a definition of the format of the data dictionary rather than an instance of one

Comment on lines +53 to +60
- uses: actions/checkout@v4
with:
repository: NYCPlanning/product-metadata
path: product_metadata

- name: set_product_metadata_path
run: echo "PRODUCT_METADATA_REPO_PATH=$(pwd)/product_metadata" >> $GITHUB_ENV
working-directory: ./
Copy link
Member

@damonmcc damonmcc Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make more sense to download org metadata via python? we do have dcpy.connectors.edm.product_metadata but maybe dcpy.connectors.github is more relevant

understandable if you'd rather go with this for now and focus on other changes here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, for sure something like that... we'd be using Github as a database at that point, and would probably have to rewire how metadata is deserialized (ie to grab over via http). It'd probably make more sense to just "compile" each dataset's metadata, ie compute the dataset with all the org/product overrides, and store that somewhere. Open to ideas though!

@damonmcc
Copy link
Member

damonmcc commented Dec 6, 2024

@alexrichey from looking at the sample XLSX for Zoning Map Amendments you linked to (zoning_map_amendments_data_dictionary.xlsx), this is great!! I compared it to the XLSX on the (private for now) Open Data page for Zoning Map Amendments and they seem identical to me

when you get through the last fixes, happy to approve the PR. before and after that, I'll keep looking through the changes and following the convos

- formatted key (summary + description)
- value
"""
rows = _make_table_top(
Copy link
Contributor

@fvankrieken fvankrieken Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% on this, but at this level of abstraction it still feels slightly odd to me to have both the "top" and actual "table" all present as "row"s. Not that it's entirely improper - even for a non-excel output, to some extent it makes sense to turn each "table"/section(/tab) just into a series of rows/elements to be formatted. But it also seems similarly valid that these abstract "table"s could still have titles (or "header section", or whatever you want to call it - distinct entity from the tabular report itself), which could still already have some formatting but the individual formatters have their own opinion on how to put those together with the table.

That said, there's also some elegance in the specific formatters just getting a bunch of "row"s and knowing exactly what to do with each row. So really I'm just thinking out loud here without any helpful specific feedback lol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think a good path forward, as said in person, is to test this out on html/pdfs and see what feels useful there

@fvankrieken
Copy link
Contributor

@fvankrieken I'm currently fixing a few things, and documenting. If you happen to start reviewing imminently, here's what I'm thinking kind of broadly:

I like your idea of a tabular_report - specifically I'm thinking of it as an object you can generate for any of our Pydantic models (though just starting with Metadata), that will translate a pydantic class into information displayable in a table, with the field metadata if it's available. Then we can pass the tabular_report to different renderers (ie XLSX, or HTML for @damonmcc). It would also carry around some suggested styling for cells, (e.g. if you render the column.values rows as a str, you should use a monospaced font) that renderers can use or ignore.

So that'd mean probably removing Rows and Cells from design.elements, and making the XLSX writer a little more prescriptive about generating tables (as opposed to constructing them entirely abstractly)

But I think the DataDictionary and Artifact side of things are pretty solid concepts that need some sharpening. So would love feedback on that.

Attached is a sample XLSX for Zoning Map Amendments. It's a good example for the table formatting in the Column Information tab.

zoning_map_amendments_data_dictionary.xlsx

I like this - when you describe it this way, I feel like this is a (much) more complex version of the little things I was bumping up against with my indented_print. More complex formatting obviously. For generalizing to pydantic models, curious on your thoughts around levels of nested models - would you "truncate" at some point in building a more generic report? Or handle that at the level of renderer? This is your tabulate case - something that needs to be coerced to a cell itself has nested structure. So in other words, should the "abstract_doc" have some knowledge of how deep the data can go, or should a renderer have logic to, when it finds another table inside a cell, to call "tabulate" or something like that. Or do we leave all of that up to the specific artifact definition

@alexrichey alexrichey merged commit 340ec91 into main Dec 9, 2024
18 of 20 checks passed
@alexrichey alexrichey deleted the ar-new-metadata-tab branch December 9, 2024 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants