Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

470 IngestionMediator class #593

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

jkwening
Copy link
Collaborator

@jkwening jkwening commented Oct 1, 2017

Note

This is an overhaul of our ingestion process. This PR request is now stable and passed all included unit tests. services.py is now the point of access for user requests related ingestion workflow, weekly_update process, and ancillary features. The following modules are now obsolete:

  • load_data.py: replaced by services.py

  • functions.py: its methods have been refactored into Meta.py - class encapsulation of meta.csv

What's In This

Implemented mediator design pattern for ingestion objects and work flow. This required:

  • creating an ingestion_mediator module: housing IngestionMediator class
  • creating a base colleague class that will be inherited by all objects will communicate with the mediator
  • refactored LoadData, Manifest, HISql, and GetApiData to all inherit from Colleague class and utilize IngestionMediator to coordinate activities.
    • these are now decoupled from one another and communicate via IngestionMediator and never directly with one another
  • refactored and encapsulated meta.json related methods into Meta class within Meta.py module
  • LoadData can now load data either from existing cleaned.psv files or by processing and cleaning raw_data.csv files.
  • added 'dependency' field to manifest.csv that will track dependencies across unique_data_ids and if requested will automatically update and reload dependent unique data ids. This is the default behavior for weekly update procedure.

TODO

  • refactor LoadData, Manifest, and Cleaner modules to decouple them
    from each other and have IngestionMediator coordinate their activities
  • decouple cleaner workflow from LoadData workflow and coordinate activity into raw data processing work flow - end goal is for LoadData to only load successfully cleaned PSV files. For prescat raw data, S3 bucket update will trigger cleaner process, and reloading into database.
  • implement solution and workflow for resolving double-adding projects issue based on
    'dependency' field in manifest Error logging and cron jobs #470
    • implement reverse dependency feature - for a given data id, check whether the data that it is dependent on is loaded in the db, if not, get that loaded into the db and recursively whatever this is dependent on prior to loading the originally requested data id low priority feature
  • merge some of services.py code as methods in IngestionMediator class and refactor so it instantiates an instance of IngestionMediator allowing it to do regularly scheduled updates on the server and another method for on demand requests by user (update-only or rebuild - maintain current functionalities and expand if needed)
    • refactor weekly_update() in services
    • add methods for on demand requests by user in services.py
    • tie in send log file to admin feature
  • clean up unnecessary codes that have been refactored or are now obsolete

Started implementing mediator design pattern for ingestion objects and
work flow. This required:
- creating an ingestion_mediator module: housing IngestionMediator class
- creating a base colleague class that will be inherited by all objects
  that will communicate with the mediator
- started modifying LoadData and Manifest classes to inherit from
  Colleague class so it can communicate with the mediator and mediator
with it
- added 'dependency' field to manifest.csv that will track dependencies
  across unique_data_ids so to prevent them being reloaded without the
precense of their respective dependent unique_data_id

TODO:
- refactor LoadData, Manifest, and Cleaner modules to decouple them
from each other and have IngestionMediator class coordinate their
activities
- implement solution for resolving double-adding projects issue based on
  'dependency' field in manifest
…coupled and moved into IngestionMediator and any other needed refactoring
…ild activity into ingestion mediator to mirro current workflow of LoadData before beginning decoupling from Manifest and then Cleaner.
…ted Meta.py and additional refactoring. Todo - finish up coordinating writing clean psv file into db
…ance variable to Colleague class, and related methods to IngestionMediator class to allow loading directly from cleaned psv file without having to reprocess and clean raw data files prior to loading to db. TODO - write unit tests to confirm code works appropriately
Completed additional decoupling refactoring for LoadData and SQLWriter
needed to seperate out loading via raw data vs clean psv file. Current
code passes unit tests that verifies that both methods works
successfully.

TODO:
- add remaining load data activities, specifically zone_facts table
- add activities related to get_api_data.py and incorporate dependency
  workflow
Passed unit tests - can now also load zone_facts table

TODO - implement solution for resolving double-adding projects issue by
refactoring to utilize 'dependency' field in manifest
GetApiData class was added with three methods:
- get all api files
- get files by modules
- get files by unique_data_id

It's much clearer what is going on and additionaly honors the Mediator
design pattern for our code ingestion.
Can now load data and trigger loading of dependent unique data ids.
Refactoring code passed unit tests. This code is not stable enough for
PR review and push to dev branch.

TODO - add reverse dependency look up: for unique data id, check to see
if the data it is dependent on is loaded into the db. If not, load that
first and recursively whatever it is dependent on before loading the
originall requested data id.
Includes methods for on demand requests by user in services.py. This
includes a command line interface similar to load_data.py. Moving
forward, this replaces load_data.py as point of user access for
ingestion and processing related workflow.

Additionally, refactored get_api_data.py -> GetApiData.py encapsulated
as a class object instead of scripting module. It has been moved into
'python/housinginsights/ingestion' file path.
@jkwening jkwening changed the title WIP 470 IngestionMediator class 470 IngestionMediator class Oct 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants