-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract XLSX logic + Add Data Dictionary + Add Artifacts #1263
Conversation
35870e3
to
71f56c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think with formatted cells and rows, we're really talking about a Workbook
or Sheet
or Worksheet
here more so than a doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know part of the idea is to abstract from that a bit, but I don't think we can quite escape it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe its a "formatted table"? "tabular report"? The crux of it seems to be these things that are undoubtedly worksheet tables, with some decoration/metadata that then a specific implementation gets to decide exactly how to dump it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like tabular report. However, I do think you could pretty easily render this as HTML, so while it's currently just Excel workbooks, I think that might be overly specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when we're talking about "rows" of "cells" in models.design.elements
? That seems distinctly tabular in terms of language choice. But I think that's more me getting caught up on semantics - Is the thought that these still might be displayed in a more nested (json-like) format in an html/pdf? Where each "column" is say a header, each "row" a subheader, and each "cell" and element under that
Not to muddy the waters but have you looked at xlsxwriter? Seems to have a bit more support for formatting. Though it seems like you've managed to make things work pretty well with openpyxl |
ece315b
to
b859aca
Compare
b859aca
to
d41ce37
Compare
b217b1a
to
8edf256
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1263 +/- ##
==========================================
+ Coverage 69.67% 70.41% +0.74%
==========================================
Files 111 114 +3
Lines 5913 6129 +216
Branches 659 700 +41
==========================================
+ Hits 4120 4316 +196
- Misses 1661 1669 +8
- Partials 132 144 +12 ☔ View full report in Codecov by Sentry. |
f1e3a7a
to
b88fd84
Compare
8564331
to
2109c2c
Compare
@fvankrieken Yeah, I'd started with openpyxl because we were starting from an XSLX provided by OTI, whereas xlsxwriter can't do that. This is less of a consideration now that we're effectively generating everything, though I think it's nice to enable a template XLSX that could be used as a starter. On the flipside, openpyxl hasn't had a commit in nine years (maybe it didn't need any) however, it'd be pretty trivial at this point to switch over, as almost all the logic now lives in |
2109c2c
to
ac0fcc4
Compare
@fvankrieken I'm currently fixing a few things, and documenting. If you happen to start reviewing imminently, here's what I'm thinking kind of broadly: I like your idea of a So that'd mean probably removing Rows and Cells from design.elements, and making the XLSX writer a little more prescriptive about generating tables (as opposed to constructing them entirely abstractly) But I think the DataDictionary and Artifact side of things are pretty solid concepts that need some sharpening. So would love feedback on that. Attached is a sample XLSX for Zoning Map Amendments. It's a good example for the table formatting in the |
ac0fcc4
to
04eaed4
Compare
MONOSPACED_FONT = "Consolas" | ||
|
||
|
||
def _make_title_subtitle_cell(title: str, subtitle: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fvankrieken some additional context on a previous comment: this type of logic is what I'm thinking we'll move into the XLSX module, along with basically everything in _make_table_top. I think it's really only useful when made concrete in a renderer.
|
||
|
||
# TODO: move to pydantic models | ||
def get_data_source( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fvankrieken and this would be folded into the tabular_report
, which could probably be a Pydantic class from which we'd inherit. (would probably move get_field_metadata
into that class as well)
@@ -26,6 +26,19 @@ class SortedSerializedBase(BaseModel): | |||
_exclude_falsey_values: bool = True | |||
_head_sort_order: list[str] = PrivateAttr(default=["id"]) | |||
_tail_sort_order: list[str] = PrivateAttr(default=["custom"]) | |||
_repr_functions: dict[str, typing.Callable[[typing.Any], str]] = {} | |||
|
|||
def field_repr(self, field_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking we'll move this functionality into a TabularReport
class
class DataDictionary(CustomizableBase, TemplatedYamlReader): | ||
org: dict[str, dict[str, FieldDefinition]] = {} | ||
product: dict[str, dict[str, FieldDefinition]] = {} | ||
dataset: dict[str, dict[str, FieldDefinition]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slowly working through this. What are your thoughts on this vs each of these being a list of FieldSet
s, and then FieldSet could also have a label
field. If these are analagous to tabs in an excel sheet, at first thought
datasets:
columns:
column_name:
summary: column_name
extra_description: Name of ...
attributes:
...
is less intuitive than
datasets:
- name: columns # or name: "Column Information" since that's the tab label in the file
fields:
column_name:
summary: column_name
extra_description: Name of ...
You can have a similar argument for the inner dictionary as well. It's definitely more of a usual pattern for our codebase, so just wanted to know what drew you to this.
Also a little curious how you imagine org
and product
playing in here, but can talk about that in person
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's sort of open how org
and product
will work here, since everything is kinda boiled down to a dataset level. I've sort of been assuming that's not the case, but let's def chat about that, as well as the distinction in yaml. (not sure I'm quite following, but I'm sure a quick chat will make it clear)
@@ -123,6 +124,7 @@ class OrgMetadata(SortedSerializedBase, extra="forbid"): | |||
template_vars: dict = Field(default_factory=dict) | |||
metadata: OrgMetadataFile | |||
column_defaults: dict[tuple[str, COLUMN_TYPES], DatasetColumn] | |||
data_dictionary: DataDictionary = DataDictionary() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit pedantic, but I think I'd prefer something like DataDictionarySpec
just to clarify that this is a definition of the format of the data dictionary rather than an instance of one
- uses: actions/checkout@v4 | ||
with: | ||
repository: NYCPlanning/product-metadata | ||
path: product_metadata | ||
|
||
- name: set_product_metadata_path | ||
run: echo "PRODUCT_METADATA_REPO_PATH=$(pwd)/product_metadata" >> $GITHUB_ENV | ||
working-directory: ./ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make more sense to download org metadata via python? we do have dcpy.connectors.edm.product_metadata
but maybe dcpy.connectors.github
is more relevant
understandable if you'd rather go with this for now and focus on other changes here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, for sure something like that... we'd be using Github as a database at that point, and would probably have to rewire how metadata is deserialized (ie to grab over via http). It'd probably make more sense to just "compile" each dataset's metadata, ie compute the dataset with all the org/product overrides, and store that somewhere. Open to ideas though!
@alexrichey from looking at the sample XLSX for Zoning Map Amendments you linked to (zoning_map_amendments_data_dictionary.xlsx), this is great!! I compared it to the XLSX on the (private for now) Open Data page for Zoning Map Amendments and they seem identical to me when you get through the last fixes, happy to approve the PR. before and after that, I'll keep looking through the changes and following the convos |
- formatted key (summary + description) | ||
- value | ||
""" | ||
rows = _make_table_top( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% on this, but at this level of abstraction it still feels slightly odd to me to have both the "top" and actual "table" all present as "row"s. Not that it's entirely improper - even for a non-excel output, to some extent it makes sense to turn each "table"/section(/tab) just into a series of rows/elements to be formatted. But it also seems similarly valid that these abstract "table"s could still have titles (or "header section", or whatever you want to call it - distinct entity from the tabular report itself), which could still already have some formatting but the individual formatters have their own opinion on how to put those together with the table.
That said, there's also some elegance in the specific formatters just getting a bunch of "row"s and knowing exactly what to do with each row. So really I'm just thinking out loud here without any helpful specific feedback lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think a good path forward, as said in person, is to test this out on html/pdfs and see what feels useful there
I like this - when you describe it this way, I feel like this is a (much) more complex version of the little things I was bumping up against with my |
8146a3e
to
a4e1633
Compare
TLDR: I'm working on a real writeup for this, and adding docstrings to make this more readable. If you get there beforehand, the commits are quite atomic and could be easily read that way. Also, there are two tests that we'd expect to fail atm, since they rely on changes in the metadata repo here
There are a few things going on here:
A rewrite of the XLSX code to pull our business logic out, and make it a generalized generator of worksheets with tables in them. This bit I feel pretty good about, although I think we need to add back logic about table headers, rather than generating them abstractly.
(Here's where things get muddy) Facilities to convert Dataset attributes into a table format / abstract of a table. This includes design elements like Tables/Rows/Cells/Styles in
design.elements.py
, as well asabstract_doc.py
which will turn a Pydantic model into a tabular_report'ish thing. I wouldn't spend too much time on details here but would love big thoughts about how to simplify this. Here's a comment outlining my thoughts.Adds a data dictionary for fields on our metadata. This concept feels solid, so nitpicks very welcome.
Adds declarative instructions for generating XLSX files, including what columns to include, and what objects to table'ify. Again, I think this is a solid concept that needs some help settling in. Please pick those nits.
Modifications to existing scripts to start from the full OrgMetadata instead of specific Datasets. In general, we should really never use the dataset metadata directly, as it will not inherit Org attributes, the data dictionary, or artifacts.
I general, I'd love to get input to clean up 1, 3, and 4 above, then merge.
Then for 2, I think we can rewrite this to accommodate the HTML/PDF generation. So for 2) big picture thoughts welcome, but detailed review isn't needed, unless you notice something glaringly wrong.