-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor setup.py #115
Refactor setup.py #115
Changes from all commits
bc3838b
e24a814
767b621
1d6c50c
017a254
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,10 +5,11 @@ | |
/.vscode/ | ||
/htmlcov/ | ||
/logs | ||
/mails/ | ||
mails/ | ||
/media/ | ||
/reports/ | ||
/static/ | ||
venv/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't be necessary when using pipenv There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did not know pipenv beforehand so to get the project up and running I just used the (for me well known) python module 'venv'. |
||
|
||
.coverage | ||
.python-version | ||
|
@@ -17,4 +18,5 @@ db.sqlite3 | |
docker-stack*.yml | ||
geckodriver.log | ||
hbscorez.log | ||
src/hbscorez/settings_docker_stack*.py | ||
hbscorez*.log | ||
src/hbscorez/settings_docker_stack*.py | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why remove the trailing space? |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,8 +16,12 @@ def html_dom(html_text: str) -> _Element: | |
|
||
|
||
def parse_association_urls(dom: _Element, root_url: str) -> list[str]: | ||
# Extract all association URLs | ||
items = cast(list[str], dom.xpath('//div[@id="main-content"]//table[@summary]/tbody/tr/td[1]/a/@href')) | ||
return [item if item.startswith('http') else root_url + item for item in items] | ||
|
||
# Filter and return all striped association URLs | ||
#return [item if item.startswith('http') else root_url + item for item in items] | ||
return [root_url + item.lstrip('/') for item in items if not item.startswith('https')] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how does this improve URL parsing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There have been some duplicate entries (abbreviations) in this list. I tried to filter out those before making the HTTP-Request. |
||
|
||
|
||
def parse_association_bhv_id_from_dom(dom: _Element) -> int: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -115,7 +115,7 @@ | |
'disable_existing_loggers': False, | ||
'formatters': { | ||
'verbose': { | ||
'format': '{levelname:8} {asctime} {module} - {message}', | ||
'format': '{levelname:8} {asctime} ({module}) [{filename}:{lineno}]: {message}', | ||
'style': '{', | ||
}, | ||
'simple': { | ||
|
@@ -129,6 +129,14 @@ | |
'class': 'logging.FileHandler', | ||
'filename': ROOT_DIR / 'hbscorez.log', | ||
'formatter': 'verbose', | ||
'encoding': 'UTF-8' | ||
}, | ||
'file': { | ||
'level': 'INFO', | ||
'class': 'logging.FileHandler', | ||
'filename': ROOT_DIR / 'hbscorez_INFO.log', | ||
'formatter': 'verbose', | ||
'encoding': 'UTF-8' | ||
Comment on lines
+134
to
+139
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll have to check if the production settings extends or overwrite the logging configuration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The idea was to split the logging into different files, which retrospecively spoken, does not improve anything. However the encoding could be added due to some german "Umlaute" appearing in the association, district and / or team names. New PR is in the pipeline :) |
||
}, | ||
'console': { | ||
'level': 'INFO', | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,31 +10,23 @@ | |
from base.middleware import env | ||
from base.models import Value | ||
from districts.models import District | ||
from leagues.models import League, LeagueName, Season | ||
from leagues.models import League, LeagueName, Season, YouthUndecidableError | ||
from teams.models import Team | ||
|
||
LOGGER = logging.getLogger('hbscorez') | ||
|
||
logger = logging.getLogger('hbscorez') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't lowercase this module level constant, see pylint-dev/pylint#2166 |
||
|
||
def add_default_arguments(parser): | ||
parser.add_argument('--associations', '-a', nargs='+', type=int, metavar='orgGrpID', | ||
help="IDs of Associations.") | ||
parser.add_argument('--districts', '-d', nargs='+', type=int, metavar='orgID', | ||
help="IDs of Districts.") | ||
parser.add_argument('--seasons', '-s', nargs='+', type=int, metavar='start year', | ||
help="Start Years of Seasons.") | ||
parser.add_argument('--leagues', '-l', nargs='+', type=int, metavar='score/sGID', | ||
help="IDs of Leagues.") | ||
parser.add_argument('--youth', action='store_true', | ||
help="Include youth leagues.") | ||
|
||
parser.add_argument('--associations', '-a', nargs='+', type=int, metavar='orgGrpID', help="IDs of Associations.") | ||
parser.add_argument('--districts', '-d', nargs='+', type=int, metavar='orgID', help="IDs of Districts.") | ||
parser.add_argument('--seasons', '-s', nargs='+', type=int, metavar='start year', help="Start Year of Seasons.") | ||
parser.add_argument('--leagues', '-l', nargs='+', type=int, metavar='score/sGID', help="IDs of Leagues.") | ||
parser.add_argument('--youth', action='store_true', help="Include youth leagues.") | ||
|
||
class Command(BaseCommand): | ||
|
||
def add_arguments(self, parser): | ||
add_default_arguments(parser) | ||
parser.add_argument('--skip-teams', action='store_true', | ||
help="Skip processing Teams.") | ||
parser.add_argument('--skip-teams', action='store_true', help="Skip processing Teams.") | ||
|
||
def handle(self, *args, **options): | ||
options['processed_districts'] = set() | ||
|
@@ -62,30 +54,28 @@ def scrape_associations(options): | |
try: | ||
scrape_association(bhv_id, options) | ||
except Exception: | ||
LOGGER.exception("Could not create Association") | ||
logger.exception("Could not create Association") | ||
|
||
|
||
def scrape_association(bhv_id: int, options): | ||
url = Association.build_source_url(bhv_id) | ||
def scrape_association(assocication_id: int, options): | ||
url = Association.build_source_url(assocication_id) | ||
html = http.get_text(url) | ||
dom = parsing.html_dom(html) | ||
|
||
name = parsing.parse_association_name(dom) | ||
try: | ||
abbreviation = Association.get_association_abbreviation(name) | ||
except KeyError: | ||
LOGGER.warning("No abbreviation for association '%s'", name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine with restructuring this exception handling to a None-check. |
||
abbreviation = Association.get_association_abbreviation(name) | ||
|
||
if abbreviation is None: | ||
return | ||
|
||
if options['associations'] and bhv_id not in options['associations']: | ||
LOGGER.debug('SKIPPING Association (options): %s %s', bhv_id, name) | ||
if options['associations'] and assocication_id not in options['associations']: | ||
logger.debug('SKIPPING Association (options): %s %s', assocication_id, name) | ||
return | ||
|
||
association, created = Association.objects.get_or_create(name=name, abbreviation=abbreviation, bhv_id=bhv_id) | ||
association, created = Association.objects.get_or_create(name=name, abbreviation=abbreviation, bhv_id=assocication_id) | ||
if created: | ||
LOGGER.info('CREATED Association: %s', association) | ||
logger.info('CREATED Association: %s', association) | ||
else: | ||
LOGGER.info('EXISTING Association: %s', association) | ||
logger.info('EXISTING Association: %s', association) | ||
|
||
|
||
def scrape_districs(association: Association, options): | ||
|
@@ -98,64 +88,64 @@ def scrape_districs(association: Association, options): | |
try: | ||
scrape_district(item, association, options) | ||
except Exception: | ||
LOGGER.exception("Could not create District") | ||
logger.exception("Could not create District") | ||
|
||
|
||
def scrape_district(district_item, association: Association, options): | ||
name = district_item.text | ||
bhv_id = int(district_item.get('value')) | ||
|
||
if options['districts'] and bhv_id not in options['districts']: | ||
LOGGER.debug('SKIPPING District (options): %s %s', bhv_id, name) | ||
logger.debug('SKIPPING District (options): %s %s', bhv_id, name) | ||
return | ||
|
||
district, created = District.objects.get_or_create(name=name, bhv_id=bhv_id) | ||
district.associations.add(association) | ||
if bhv_id in options['processed_districts']: | ||
LOGGER.debug('SKIPPING District: %s %s (already processed)', bhv_id, name) | ||
logger.debug('SKIPPING District: %s %s (already processed)', bhv_id, name) | ||
return | ||
|
||
if created: | ||
LOGGER.info('CREATED District: %s', district) | ||
logger.info('\tCREATED District: %s (in association: %s (%s))', district, association.abbreviation, association.bhv_id) | ||
else: | ||
LOGGER.info('EXISTING District: %s', district) | ||
logger.info('\tEXISTING District: %s (in association: %s (%s))', district, association.abbreviation, association.bhv_id) | ||
options['processed_districts'].add(bhv_id) | ||
|
||
for start_year in range(2004, datetime.now().year + 1): | ||
try: | ||
scrape_season(district, start_year, options) | ||
except Exception: | ||
LOGGER.exception("Could not create Season") | ||
logger.exception("Could not create Season") | ||
|
||
|
||
def scrape_season(district, start_year, options): | ||
if options['seasons'] and start_year not in options['seasons']: | ||
LOGGER.debug('SKIPPING Season (options): %s', start_year) | ||
logger.debug('SKIPPING Season (options): %s', start_year) | ||
return | ||
|
||
season, season_created = Season.objects.get_or_create(start_year=start_year) | ||
if season_created: | ||
LOGGER.info('CREATED Season: %s', season) | ||
logger.info(2*'\t' + 'CREATED Season: %s', season) | ||
else: | ||
LOGGER.info('EXISTING Season: %s', season) | ||
logger.info(2*'\t' + 'EXISTING Season: %s', season) | ||
|
||
for start_date in [date(start_year, 10, 1) + timedelta(days=10 * n) for n in range(4)]: | ||
LOGGER.debug('trying District Season: %s %s %s', district, season, start_date) | ||
logger.debug('Trying District Season: %s %s %s', district, season, start_date) | ||
url = District.build_source_url(district.bhv_id, start_date) | ||
html = http.get_text(url) | ||
dom = parsing.html_dom(html) | ||
league_links = parsing.parse_league_links(dom) | ||
if league_links: | ||
break | ||
else: | ||
LOGGER.warning('District Season without Leagues: %s %s', district, season) | ||
logger.info(3*'\t' + 'District Season without League(s)') | ||
return | ||
|
||
for league_link in league_links: | ||
try: | ||
scrape_league(league_link, district, season, options) | ||
except Exception: | ||
LOGGER.exception("Could not create League") | ||
logger.exception("Could not create League") | ||
|
||
|
||
@transaction.atomic | ||
|
@@ -164,11 +154,11 @@ def scrape_league(league_link, district, season, options): | |
bhv_id = parsing.parse_league_bhv_id(league_link) | ||
|
||
if options['leagues'] and bhv_id not in options['leagues']: | ||
LOGGER.debug('SKIPPING League (options): %s %s', bhv_id, abbreviation) | ||
logger.debug('SKIPPING League (options): %s %s', bhv_id, abbreviation) | ||
return | ||
|
||
if abbreviation == 'TEST': | ||
LOGGER.debug('SKIPPING League (test league): %s %s', bhv_id, abbreviation) | ||
logger.debug('SKIPPING League (test league): %s %s', bhv_id, abbreviation) | ||
return | ||
|
||
url = League.build_source_url(bhv_id) | ||
|
@@ -188,32 +178,37 @@ def scrape_league(league_link, district, season, options): | |
'Vorbereitung', 'F-FS', 'M-FS', 'Quali', | ||
'Freiwurf', 'Maxi', 'turnier', 'Turnier', 'Cup', 'wettbewerb', | ||
'Test', 'Planung', 'planung', | ||
'Freundschaftsspiel' | ||
] | ||
|
||
if any(n in name for n in irrelevant_league_name_indicators): | ||
LOGGER.debug('SKIPPING League (name): %s %s', bhv_id, name) | ||
logger.debug('SKIPPING League (name): %s %s', bhv_id, name) | ||
return | ||
|
||
team_links = parsing.parse_team_links(dom) | ||
if not team_links: | ||
LOGGER.debug('SKIPPING League (no team table): %s %s', bhv_id, name) | ||
logger.debug('SKIPPING League (no team table): %s %s', bhv_id, name) | ||
return | ||
|
||
game_rows = parsing.parse_game_rows(dom) | ||
if not game_rows: | ||
LOGGER.debug('SKIPPING League (no games): %s %s', bhv_id, name) | ||
logger.debug('SKIPPING League (no games): %s %s', bhv_id, name) | ||
return | ||
|
||
if League.is_youth(abbreviation, name) and not options['youth']: | ||
LOGGER.debug('SKIPPING League (youth league): %s %s %s', bhv_id, abbreviation, name) | ||
try: | ||
if League.is_youth(abbreviation, name) and not options['youth']: | ||
logger.debug('SKIPPING League (youth league): %s %s %s', bhv_id, abbreviation, name) | ||
return | ||
except YouthUndecidableError: | ||
logger.info(3*'\t' + 'Youth undecidable for league: %s %s %s', bhv_id, abbreviation, name) | ||
Comment on lines
+198
to
+203
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which benefit do you see over the current exception handling? |
||
return | ||
|
||
league, league_created = League.objects.get_or_create( | ||
name=name, abbreviation=abbreviation, district=district, season=season, bhv_id=bhv_id) | ||
league, league_created = League.objects.get_or_create(name=name, abbreviation=abbreviation, district=district, season=season, bhv_id=bhv_id) | ||
|
||
if league_created: | ||
LOGGER.info('CREATED League: %s', league) | ||
logger.info(3*'\t' + 'CREATED League: %s', league) | ||
else: | ||
LOGGER.info('EXISTING League: %s', league) | ||
logger.info(3*'\t' + 'EXISTING League: %s', league) | ||
|
||
if options['skip_teams']: | ||
return | ||
|
@@ -222,7 +217,7 @@ def scrape_league(league_link, district, season, options): | |
scrape_team(team_link, league) | ||
|
||
retirements = parsing.parse_retirements(dom) | ||
Team.check_retirements(retirements, league, LOGGER) | ||
Team.check_retirements(retirements, league, logger) | ||
|
||
|
||
def scrape_team(link, league): | ||
|
@@ -236,4 +231,4 @@ def scrape_team(link, league): | |
short_team_names = parsing.parse_team_short_names(game_rows) | ||
short_team_name = Team.find_matching_short_name(name, short_team_names) | ||
|
||
Team.create_or_update_team(name, short_team_name, league, bhv_id, LOGGER) | ||
Team.create_or_update_team(name, short_team_name, league, bhv_id, logger) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -65,8 +65,10 @@ def is_youth(abbreviation: str, name: str) -> bool: | |
'Männer', 'Frauen', 'Herren', 'Damen', | ||
'Hären', 'Dammen', 'Senioren', 'Seniorinnen', | ||
'Hommes', 'Dames', 'Fraen', ' F-', ' M-', | ||
'Inklusion', 'Special Olympics'] | ||
adult_match_strong = any(n in name for n in adult_name_indicators) | ||
'Inklusion', 'Special Olympics', | ||
'-F-', '-M-' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a test for this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do in the new PR :) |
||
] | ||
adult_match_strong = any(n in name for n in adult_name_indicators) and "M-Runde" not in name | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this change means: don't strongly match leagues as adult, whose name contains There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be backed by a test - There was a corner case where |
||
|
||
if youth_match_strong != adult_match_strong: | ||
return youth_match_strong | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove the leading slash?
this would cause any directory named
mails
in the whole project to be ignored.including any module with that name. when actually we only need to ignore the root mail directory.