Some songs have duplicate rows (due to artist aliases?) #5

colinmorris · 2020-05-26T17:52:43Z

In the latest release of the dataset, there are 74 rows corresponding to Liz Phair songs. 61 of those rows are in azlyrics_lyrics_l.csv under the artist name "Liz Phair". 13 are in azlyrics_lyrics_p.csv under "Phair, Liz".

There are 11 songs which appear in both files. As far as I can tell, the lyrics, song url, and song title are identical between the two files - the only field that differs is the artist name.

I guess this is ultimately an issue of jank on the Azlyrics side, since the site directory has separate listings for 'Liz Phair' and 'Phair, Liz' in their artist directory (which both lead to the same url, https://www.azlyrics.com/p/phair.html). But it would be nice if the scraping pipeline handled deduplication.

I did a quick analysis and found 6,513 total rows with duplicate song urls.

The text was updated successfully, but these errors were encountered:

AlbertSuarez · 2020-10-13T08:33:41Z

Hey @colinmorris, you are right. There's a problem in the AZLyrics where, as you said, multiple artists could lead to the same song URL, which I didn't know it and sucks. I'm gonna try to add a PR fixing this adding a checker before adding the row if the song URL exists or not. Thanks!

AlbertSuarez self-assigned this Oct 13, 2020

AlbertSuarez added the bug Something isn't working label Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some songs have duplicate rows (due to artist aliases?) #5

Some songs have duplicate rows (due to artist aliases?) #5

colinmorris commented May 26, 2020 •

edited

Loading

AlbertSuarez commented Oct 13, 2020

Some songs have duplicate rows (due to artist aliases?) #5

Some songs have duplicate rows (due to artist aliases?) #5

Comments

colinmorris commented May 26, 2020 • edited Loading

AlbertSuarez commented Oct 13, 2020

colinmorris commented May 26, 2020 •

edited

Loading