Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate library scripts to ingest - FacDB #1313

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

fvankrieken
Copy link
Contributor

@fvankrieken fvankrieken commented Dec 10, 2024

#1290 - almost closes, doesn't have nypl libraries

Commit 1 has a big tweak - basically, adds option to cast varchar fields from library to bigint on comparison. This should be done, in my opinion, in one case only.

  • these columns are casted to int at the beginning of our builds
  • as part of validation/migration, downstream code changes are made to take advantage of the new, more "accurate" types.

Otherwise, this should not be done as part of the comparison when migrating from library to ingest

for example, comparing nysed_nonpublicenrollment, originally I got this

ProgrammingError: (psycopg2.errors.DatatypeMismatch) EXCEPT types character 
varying and bigint cannot be matched

but now running lifecycle scripts validate_ingest compare nysed_nonpublicenrollment --c2n prek --c2n halfk --c2n fullk --c2n gr1 --c2n gr2 --c2n gr3 --c2n gr4 --c2n gr5 --c2n gr6 --c2n gr7 --c2n gr8 --c2n gr9 --c2n gr10 --c2n gr11 --c2n gr12 --c2n institution_id --c2n beds_code --c2n ugs --c2n uge (a bit verbose but I think forcing intentionality is good when this is sort of twisting the validation) I get

________________________________________________________________________________
Tables
    Left: nysed_nonpublicenrollment_library
    Right: nysed_nonpublicenrollment_ingest
________________________________________________________________________________
Row count
    Left: 1822
    Right: 1822
________________________________________________________________________________
Column comparison
    Both
        affliation
        beds_code
        county
        data_library_version
        fullk
        gr1
        gr10
        gr11
        gr12
        gr2
        gr3
        gr4
        gr5
        gr6
        gr7
        gr8
        gr9
        halfk
        institution_id
        ogc_fid
        prek
        school_name
        school_year
        uge
        ugs
    Left only: None
    Right only: None
    Type differences
        Halfk
            Left: character varying
            Right: bigint
        Institution id
            Left: character varying
            Right: bigint
        Gr11
            Left: character varying
            Right: bigint
        Beds code
            Left: character varying
            Right: bigint
        Gr8
            Left: character varying
            Right: bigint
        Gr6
            Left: character varying
            Right: bigint
        Prek
            Left: character varying
            Right: bigint
        Uge
            Left: character varying
            Right: bigint
        Gr7
            Left: character varying
            Right: bigint
        Gr2
            Left: character varying
            Right: bigint
        Gr9
            Left: character varying
            Right: bigint
        Gr10
            Left: character varying
            Right: bigint
        Gr1
            Left: character varying
            Right: bigint
        Gr3
            Left: character varying
            Right: bigint
        Affliation
            Left: character varying
            Right: text
        Gr12
            Left: character varying
            Right: bigint
        Ugs
            Left: character varying
            Right: bigint
        Gr5
            Left: character varying
            Right: bigint
        School name
            Left: character varying
            Right: text
        Gr4
            Left: character varying
            Right: bigint
        School year
            Left: character varying
            Right: text
        County
            Left: character varying
            Right: text
        Fullk
            Left: character varying
            Right: bigint
________________________________________________________________________________
Data comparison
    Compared columns
        affliation
        beds_code
        county
        fullk
        gr1
        gr10
        gr11
        gr12
        gr2
        gr3
        gr4
        gr5
        gr6
        gr7
        gr8
        gr9
        halfk
        institution_id
        prek
        school_name
        school_year
        uge
        ugs
    Ignored columns
        ogc_fid
        data_library_version
    Columns coerced to numeric
        prek
        halfk
        fullk
        gr1
        gr2
        gr3
        gr4
        gr5
        gr6
        gr7
        gr8
        gr9
        gr10
        gr11
        gr12
        institution_id
        beds_code
        ugs
        uge
    Left only
        Empty DataFrame
        Columns: [halfk, institution_id, gr11, beds_code, gr8, gr6, prek, uge, gr7, gr2, gr9, gr10, gr1, gr3, affliation, gr12, ugs, gr5, school_name, gr4, school_year, county, fullk]
        Index: []
    Right only
        Empty DataFrame
        Columns: [halfk, institution_id, gr11, beds_code, gr8, gr6, prek, uge, gr7, gr2, gr9, gr10, gr1, gr3, affliation, gr12, ugs, gr5, school_name, gr4, school_year, county, fullk]
        Index: []

Copy link

codecov bot commented Dec 10, 2024

Codecov Report

Attention: Patch coverage is 28.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 72.09%. Comparing base (76fbe85) to head (676653b).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
dcpy/lifecycle/scripts/validate_ingest.py 0.00% 11 Missing ⚠️
dcpy/data/compare.py 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1313      +/-   ##
==========================================
+ Coverage   70.58%   72.09%   +1.51%     
==========================================
  Files         115      113       -2     
  Lines        5966     5935      -31     
  Branches      695      701       +6     
==========================================
+ Hits         4211     4279      +68     
+ Misses       1609     1506     -103     
- Partials      146      150       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fvankrieken fvankrieken force-pushed the fvk-ingest-facdb branch 16 times, most recently from 36134fd to 82ace91 Compare December 19, 2024 15:10
@fvankrieken fvankrieken marked this pull request as ready for review January 14, 2025 17:51
@fvankrieken
Copy link
Contributor Author

Final commit has one code change - @damonmcc I could drop this. But really just takes advantage of the casting to numeric for that dataset. This would rely on actually running ingest for that dataset after this is merged.

@fvankrieken fvankrieken requested a review from damonmcc January 14, 2025 17:53
@fvankrieken
Copy link
Contributor Author

Lemme give this one more good once-over before you have a look @damonmcc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

Successfully merging this pull request may close these issues.

1 participant