Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some songs have duplicate rows (due to artist aliases?) #5

Open
colinmorris opened this issue May 26, 2020 · 1 comment
Open

Some songs have duplicate rows (due to artist aliases?) #5

colinmorris opened this issue May 26, 2020 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@colinmorris
Copy link

colinmorris commented May 26, 2020

In the latest release of the dataset, there are 74 rows corresponding to Liz Phair songs. 61 of those rows are in azlyrics_lyrics_l.csv under the artist name "Liz Phair". 13 are in azlyrics_lyrics_p.csv under "Phair, Liz".

There are 11 songs which appear in both files. As far as I can tell, the lyrics, song url, and song title are identical between the two files - the only field that differs is the artist name.

I guess this is ultimately an issue of jank on the Azlyrics side, since the site directory has separate listings for 'Liz Phair' and 'Phair, Liz' in their artist directory (which both lead to the same url, https://www.azlyrics.com/p/phair.html). But it would be nice if the scraping pipeline handled deduplication.

I did a quick analysis and found 6,513 total rows with duplicate song urls.

@AlbertSuarez
Copy link
Owner

Hey @colinmorris, you are right. There's a problem in the AZLyrics where, as you said, multiple artists could lead to the same song URL, which I didn't know it and sucks. I'm gonna try to add a PR fixing this adding a checker before adding the row if the song URL exists or not. Thanks!

@AlbertSuarez AlbertSuarez self-assigned this Oct 13, 2020
@AlbertSuarez AlbertSuarez added the bug Something isn't working label Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants