Extract XLSX logic + Add Data Dictionary + Add Artifacts #1263

alexrichey · 2024-11-21T16:21:30Z

TLDR: I'm working on a real writeup for this, and adding docstrings to make this more readable. If you get there beforehand, the commits are quite atomic and could be easily read that way. Also, there are two tests that we'd expect to fail atm, since they rely on changes in the metadata repo here

There are a few things going on here:

A rewrite of the XLSX code to pull our business logic out, and make it a generalized generator of worksheets with tables in them. This bit I feel pretty good about, although I think we need to add back logic about table headers, rather than generating them abstractly.
(Here's where things get muddy) Facilities to convert Dataset attributes into a table format / abstract of a table. This includes design elements like Tables/Rows/Cells/Styles in design.elements.py, as well as abstract_doc.py which will turn a Pydantic model into a tabular_report'ish thing. I wouldn't spend too much time on details here but would love big thoughts about how to simplify this. Here's a comment outlining my thoughts.
Adds a data dictionary for fields on our metadata. This concept feels solid, so nitpicks very welcome.
Adds declarative instructions for generating XLSX files, including what columns to include, and what objects to table'ify. Again, I think this is a solid concept that needs some help settling in. Please pick those nits.
Modifications to existing scripts to start from the full OrgMetadata instead of specific Datasets. In general, we should really never use the dataset metadata directly, as it will not inherit Org attributes, the data dictionary, or artifacts.

I general, I'd love to get input to clean up 1, 3, and 4 above, then merge.

Then for 2, I think we can rewrite this to accommodate the HTML/PDF generation. So for 2) big picture thoughts welcome, but detailed review isn't needed, unless you notice something glaringly wrong.

fvankrieken · 2024-11-21T22:05:29Z

dcpy/lifecycle/package/abstract_doc.py

I do think with formatted cells and rows, we're really talking about a Workbook or Sheet or Worksheet here more so than a doc

I know part of the idea is to abstract from that a bit, but I don't think we can quite escape it

Maybe its a "formatted table"? "tabular report"? The crux of it seems to be these things that are undoubtedly worksheet tables, with some decoration/metadata that then a specific implementation gets to decide exactly how to dump it out

I like tabular report. However, I do think you could pretty easily render this as HTML, so while it's currently just Excel workbooks, I think that might be overly specific.

Even when we're talking about "rows" of "cells" in models.design.elements? That seems distinctly tabular in terms of language choice. But I think that's more me getting caught up on semantics - Is the thought that these still might be displayed in a more nested (json-like) format in an html/pdf? Where each "column" is say a header, each "row" a subheader, and each "cell" and element under that

fvankrieken · 2024-11-21T22:09:50Z

Not to muddy the waters but have you looked at xlsxwriter? Seems to have a bit more support for formatting. Though it seems like you've managed to make things work pretty well with openpyxl

codecov · 2024-12-02T19:01:44Z

Codecov Report

Attention: Patch coverage is 89.19753% with 35 lines in your changes missing coverage. Please review.

Project coverage is 70.41%. Comparing base (65b2420) to head (a4e1633).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
dcpy/lifecycle/package/abstract_doc.py	84.61%	9 Missing and 7 partials ⚠️
dcpy/lifecycle/package/xlsx_writer.py	88.34%	6 Missing and 6 partials ⚠️
dcpy/lifecycle/package/assemble.py	16.66%	5 Missing ⚠️
dcpy/lifecycle/package/_cli.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1263      +/-   ##
==========================================
+ Coverage   69.67%   70.41%   +0.74%     
==========================================
  Files         111      114       +3     
  Lines        5913     6129     +216     
  Branches      659      700      +41     
==========================================
+ Hits         4120     4316     +196     
- Misses       1661     1669       +8     
- Partials      132      144      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexrichey · 2024-12-03T22:09:09Z

Not to muddy the waters but have you looked at xlsxwriter? Seems to have a bit more support for formatting. Though it seems like you've managed to make things work pretty well with openpyxl

@fvankrieken Yeah, I'd started with openpyxl because we were starting from an XSLX provided by OTI, whereas xlsxwriter can't do that. This is less of a consideration now that we're effectively generating everything, though I think it's nice to enable a template XLSX that could be used as a starter. On the flipside, openpyxl hasn't had a commit in nine years (maybe it didn't need any) however, it'd be pretty trivial at this point to switch over, as almost all the logic now lives in abstract_docs.py

alexrichey · 2024-12-04T16:22:13Z

@fvankrieken I'm currently fixing a few things, and documenting. If you happen to start reviewing imminently, here's what I'm thinking kind of broadly:

I like your idea of a tabular_report - specifically I'm thinking of it as an object you can generate for any of our Pydantic models (though just starting with Metadata), that will translate a pydantic class into information displayable in a table, with the field metadata if it's available. Then we can pass the tabular_report to different renderers (ie XLSX, or HTML for @damonmcc). It would also carry around some suggested styling for cells, (e.g. if you render the column.values rows as a str, you should use a monospaced font) that renderers can use or ignore.

So that'd mean probably removing Rows and Cells from design.elements, and making the XLSX writer a little more prescriptive about generating tables (as opposed to constructing them entirely abstractly)

But I think the DataDictionary and Artifact side of things are pretty solid concepts that need some sharpening. So would love feedback on that.

Attached is a sample XLSX for Zoning Map Amendments. It's a good example for the table formatting in the Column Information tab.

zoning_map_amendments_data_dictionary.xlsx

alexrichey · 2024-12-04T16:43:45Z

dcpy/lifecycle/package/abstract_doc.py

+MONOSPACED_FONT = "Consolas"
+
+
+def _make_title_subtitle_cell(title: str, subtitle: str):


@fvankrieken some additional context on a previous comment: this type of logic is what I'm thinking we'll move into the XLSX module, along with basically everything in _make_table_top. I think it's really only useful when made concrete in a renderer.

alexrichey · 2024-12-04T16:45:36Z

dcpy/lifecycle/package/abstract_doc.py

+
+
+# TODO: move to pydantic models
+def get_data_source(


@fvankrieken and this would be folded into the tabular_report, which could probably be a Pydantic class from which we'd inherit. (would probably move get_field_metadata into that class as well)

alexrichey · 2024-12-04T16:50:50Z

dcpy/models/base.py

@@ -26,6 +26,19 @@ class SortedSerializedBase(BaseModel):
    _exclude_falsey_values: bool = True
    _head_sort_order: list[str] = PrivateAttr(default=["id"])
    _tail_sort_order: list[str] = PrivateAttr(default=["custom"])
+    _repr_functions: dict[str, typing.Callable[[typing.Any], str]] = {}
+
+    def field_repr(self, field_name: str):


thinking we'll move this functionality into a TabularReport class

fvankrieken · 2024-12-04T20:18:57Z

dcpy/models/product/data_dictionary.py

+class DataDictionary(CustomizableBase, TemplatedYamlReader):
+    org: dict[str, dict[str, FieldDefinition]] = {}
+    product: dict[str, dict[str, FieldDefinition]] = {}
+    dataset: dict[str, dict[str, FieldDefinition]] = {}


Slowly working through this. What are your thoughts on this vs each of these being a list of FieldSets, and then FieldSet could also have a label field. If these are analagous to tabs in an excel sheet, at first thought

datasets: columns: column_name: summary: column_name extra_description: Name of ... attributes: ...

is less intuitive than

datasets: - name: columns # or name: "Column Information" since that's the tab label in the file fields: column_name: summary: column_name extra_description: Name of ...

You can have a similar argument for the inner dictionary as well. It's definitely more of a usual pattern for our codebase, so just wanted to know what drew you to this.

Also a little curious how you imagine org and product playing in here, but can talk about that in person

Yeah, it's sort of open how org and product will work here, since everything is kinda boiled down to a dataset level. I've sort of been assuming that's not the case, but let's def chat about that, as well as the distinction in yaml. (not sure I'm quite following, but I'm sure a quick chat will make it clear)

fvankrieken · 2024-12-04T20:21:56Z

dcpy/models/product/metadata.py

@@ -123,6 +124,7 @@ class OrgMetadata(SortedSerializedBase, extra="forbid"):
    template_vars: dict = Field(default_factory=dict)
    metadata: OrgMetadataFile
    column_defaults: dict[tuple[str, COLUMN_TYPES], DatasetColumn]
+    data_dictionary: DataDictionary = DataDictionary()


A bit pedantic, but I think I'd prefer something like DataDictionarySpec just to clarify that this is a definition of the format of the data dictionary rather than an instance of one

damonmcc · 2024-12-05T15:27:27Z

.github/workflows/template_build.yml

+      - uses: actions/checkout@v4
+        with:
+          repository: NYCPlanning/product-metadata
+          path: product_metadata
+
+      - name: set_product_metadata_path
+        run:  echo "PRODUCT_METADATA_REPO_PATH=$(pwd)/product_metadata" >> $GITHUB_ENV
+        working-directory: ./


would it make more sense to download org metadata via python? we do have dcpy.connectors.edm.product_metadata but maybe dcpy.connectors.github is more relevant

understandable if you'd rather go with this for now and focus on other changes here

Eventually, for sure something like that... we'd be using Github as a database at that point, and would probably have to rewire how metadata is deserialized (ie to grab over via http). It'd probably make more sense to just "compile" each dataset's metadata, ie compute the dataset with all the org/product overrides, and store that somewhere. Open to ideas though!

damonmcc · 2024-12-06T01:33:27Z

@alexrichey from looking at the sample XLSX for Zoning Map Amendments you linked to (zoning_map_amendments_data_dictionary.xlsx), this is great!! I compared it to the XLSX on the (private for now) Open Data page for Zoning Map Amendments and they seem identical to me

when you get through the last fixes, happy to approve the PR. before and after that, I'll keep looking through the changes and following the convos

fvankrieken · 2024-12-06T15:16:18Z

dcpy/lifecycle/package/abstract_doc.py

+        - formatted key (summary + description)
+        - value
+    """
+    rows = _make_table_top(


Not 100% on this, but at this level of abstraction it still feels slightly odd to me to have both the "top" and actual "table" all present as "row"s. Not that it's entirely improper - even for a non-excel output, to some extent it makes sense to turn each "table"/section(/tab) just into a series of rows/elements to be formatted. But it also seems similarly valid that these abstract "table"s could still have titles (or "header section", or whatever you want to call it - distinct entity from the tabular report itself), which could still already have some formatting but the individual formatters have their own opinion on how to put those together with the table.

That said, there's also some elegance in the specific formatters just getting a bunch of "row"s and knowing exactly what to do with each row. So really I'm just thinking out loud here without any helpful specific feedback lol

But I think a good path forward, as said in person, is to test this out on html/pdfs and see what feels useful there

fvankrieken · 2024-12-06T15:25:29Z

@fvankrieken I'm currently fixing a few things, and documenting. If you happen to start reviewing imminently, here's what I'm thinking kind of broadly:

I like your idea of a tabular_report - specifically I'm thinking of it as an object you can generate for any of our Pydantic models (though just starting with Metadata), that will translate a pydantic class into information displayable in a table, with the field metadata if it's available. Then we can pass the tabular_report to different renderers (ie XLSX, or HTML for @damonmcc). It would also carry around some suggested styling for cells, (e.g. if you render the column.values rows as a str, you should use a monospaced font) that renderers can use or ignore.

So that'd mean probably removing Rows and Cells from design.elements, and making the XLSX writer a little more prescriptive about generating tables (as opposed to constructing them entirely abstractly)

But I think the DataDictionary and Artifact side of things are pretty solid concepts that need some sharpening. So would love feedback on that.

Attached is a sample XLSX for Zoning Map Amendments. It's a good example for the table formatting in the Column Information tab.

zoning_map_amendments_data_dictionary.xlsx

I like this - when you describe it this way, I feel like this is a (much) more complex version of the little things I was bumping up against with my indented_print. More complex formatting obviously. For generalizing to pydantic models, curious on your thoughts around levels of nested models - would you "truncate" at some point in building a more generic report? Or handle that at the level of renderer? This is your tabulate case - something that needs to be coerced to a cell itself has nested structure. So in other words, should the "abstract_doc" have some knowledge of how deep the data can go, or should a renderer have logic to, when it finds another table inside a cell, to call "tabulate" or something like that. Or do we leave all of that up to the specific artifact definition

alexrichey force-pushed the ar-new-metadata-tab branch from 35870e3 to 71f56c8 Compare November 21, 2024 18:43

fvankrieken reviewed Nov 21, 2024

View reviewed changes

alexrichey force-pushed the ar-new-metadata-tab branch 2 times, most recently from ece315b to b859aca Compare November 22, 2024 17:34

alexrichey marked this pull request as draft November 25, 2024 17:14

alexrichey force-pushed the ar-new-metadata-tab branch from b859aca to d41ce37 Compare November 26, 2024 17:45

alexrichey mentioned this pull request Dec 2, 2024

Add Data Dictionary and Artifacts. Zoning metadata updates per Matt NYCPlanning/product-metadata#24

Merged

alexrichey force-pushed the ar-new-metadata-tab branch from b217b1a to 8edf256 Compare December 2, 2024 15:58

alexrichey force-pushed the ar-new-metadata-tab branch 4 times, most recently from f1e3a7a to b88fd84 Compare December 3, 2024 15:28

alexrichey marked this pull request as ready for review December 3, 2024 18:43

alexrichey changed the title ~~Data Dictionary WIP for XLSX~~ Extract XLSX logic + Add Data Dictionary + Add Artifacts Dec 3, 2024

Alex Richey added 5 commits December 3, 2024 16:47

Add new attrs required for the zoning metadata

be1e344

Add Revisions to the Dataset Model

7fa7302

Add DataDictionary to document fields on our models

5433135

Add __repr__ equivalent to pretty-print types

248ad94

Add Abstract Docs and Artifacts

25d7e35

alexrichey force-pushed the ar-new-metadata-tab branch 2 times, most recently from 8564331 to 2109c2c Compare December 3, 2024 22:06

alexrichey force-pushed the ar-new-metadata-tab branch from 2109c2c to ac0fcc4 Compare December 3, 2024 22:18

Alex Richey added 3 commits December 4, 2024 11:39

Rewrite XLSX generator to take tables + org md

9cdb763

Rename package.oti_xlsx -> xlsx_writer

8c76690

Remove OTI Template Tabs from XLSX

04eaed4

alexrichey force-pushed the ar-new-metadata-tab branch from ac0fcc4 to 04eaed4 Compare December 4, 2024 16:40

alexrichey commented Dec 4, 2024

View reviewed changes

fvankrieken reviewed Dec 4, 2024

View reviewed changes

damonmcc reviewed Dec 5, 2024

View reviewed changes

alexrichey assigned sf-dcp Dec 5, 2024

fvankrieken reviewed Dec 6, 2024

View reviewed changes

Ongoing docs + quick fixes

a4e1633

alexrichey force-pushed the ar-new-metadata-tab branch from 8146a3e to a4e1633 Compare December 6, 2024 15:46

fvankrieken approved these changes Dec 9, 2024

View reviewed changes

alexrichey merged commit 340ec91 into main Dec 9, 2024
18 of 20 checks passed

alexrichey deleted the ar-new-metadata-tab branch December 9, 2024 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract XLSX logic + Add Data Dictionary + Add Artifacts #1263

Extract XLSX logic + Add Data Dictionary + Add Artifacts #1263

alexrichey commented Nov 21, 2024 •

edited

Loading

fvankrieken Nov 21, 2024

fvankrieken Nov 21, 2024

fvankrieken Nov 21, 2024

alexrichey Dec 3, 2024

fvankrieken Dec 4, 2024

fvankrieken commented Nov 21, 2024

codecov bot commented Dec 2, 2024 •

edited

Loading

alexrichey commented Dec 3, 2024

alexrichey commented Dec 4, 2024

alexrichey Dec 4, 2024

alexrichey Dec 4, 2024

alexrichey Dec 4, 2024

fvankrieken Dec 4, 2024 •

edited

Loading

alexrichey Dec 4, 2024

fvankrieken Dec 4, 2024

damonmcc Dec 5, 2024 •

edited

Loading

alexrichey Dec 5, 2024

damonmcc commented Dec 6, 2024

fvankrieken Dec 6, 2024 •

edited

Loading

fvankrieken Dec 6, 2024

fvankrieken commented Dec 6, 2024

		MONOSPACED_FONT = "Consolas"


		def _make_title_subtitle_cell(title: str, subtitle: str):

Extract XLSX logic + Add Data Dictionary + Add Artifacts #1263

Extract XLSX logic + Add Data Dictionary + Add Artifacts #1263

Conversation

alexrichey commented Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fvankrieken commented Nov 21, 2024

codecov bot commented Dec 2, 2024 • edited Loading

Codecov Report

alexrichey commented Dec 3, 2024

alexrichey commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fvankrieken Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damonmcc Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damonmcc commented Dec 6, 2024

fvankrieken Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fvankrieken commented Dec 6, 2024

alexrichey commented Nov 21, 2024 •

edited

Loading

codecov bot commented Dec 2, 2024 •

edited

Loading

fvankrieken Dec 4, 2024 •

edited

Loading

damonmcc Dec 5, 2024 •

edited

Loading

fvankrieken Dec 6, 2024 •

edited

Loading