Datasets and preprocessing#
For this project we found various datasets online to support our perspectives on this topic. After we gathered these datasets, we used preprocessing to clean and filter out the data. We also performed various transformations on the data to make it more useful for our purposes. This preprocessing consisted of mainly dropping unneeded columns, empty data and invalid data. Afterwards, some datasets also received transformations to make later usage easier.
import pandas as pd
def clean(labels: list[str], filename: str, index=0) -> None:
file = pd.read_csv('../origin/' + filename, index_col=index, low_memory=False)
file.drop(labels=labels, axis=1, inplace=True)
file.dropna(axis=0, how='any', inplace=True)
file.to_csv('../cleaned/' + filename)
print(file.head())
Datasets used#
Spotify Weekly Top 200 Songs Streaming Data#
Source: https://www.kaggle.com/datasets/yelexa/spotify200
This big dataset has a record of all the Spotify songs in the weekly top 200. It contains: the artist, the song name, from which album it is, the amount of streams and a lot more values. This is really useful because it can show us how many total streams each artist has. And how many streams each song has.
Actions performed on the dataset#
clean(['uri', 'artists_num', 'artist_id', 'artist_genre', 'artist_img', 'collab', 'album_cover', 'source', 'album_num_tracks', 'acousticness', 'instrumentalness', 'danceability', 'duration', 'energy', 'liveness', 'loudness', 'mode', 'speechiness', 'tempo', 'valence', 'key', 'pivot', 'country', 'region', 'language', 'release_date'], 'final.csv')
rank artist_names artist_individual track_name peak_rank \
0 1 Paulo Londra Paulo Londra Plan A 1
1 2 WOS WOS ARRANCARMELO 2
2 3 Paulo Londra Paulo Londra Chance 3
3 5 Cris Mj Cris Mj Una Noche en Medellín 5
4 6 Emilia Emilia cuatro veinte 6
previous_rank weeks_on_chart streams week
0 1 4 3003411 2022-04-14
1 129 2 2512175 2022-04-14
2 59 2 2408983 2022-04-14
3 5 8 2080139 2022-04-14
4 9 3 1923270 2022-04-14
RIAA Artists By Certified Album Units Sold#
Source: https://www.kaggle.com/datasets/darinhawley/riaa-artists-by-certified-units-sold
This dataset contains records about artists and how many albums they sold, whether those albums are gold, platinum, multi-platinum, and diamond. This dataset originates from the RIAA’s database. This dataset was originally made for a quiz. We can use this to see how many sales each album of each artist has and how many platinum records they have. Using this we can see which artist is more successful by the RIAA.
Actions performed on the dataset#
clean(['Artist ID'], 'riaakaggle.csv', False)
Artist Certified Units Gold Platinum Multi-Platinum Diamond
0 THE BEATLES 183.0 48 42 26 6
1 GARTH BROOKS 157.0 31 31 17 9
2 ELVIS PRESLEY 139.0 101 57 25 1
3 EAGLES 120.0 13 13 11 3
4 LED ZEPPELIN 111.5 19 18 14 5
Data on Songs from Billboard 1999-2019#
Source: https://www.kaggle.com/datasets/danield2255/data-on-songs-from-billboard-19992019
This dataset was originally made for a research project at an university. The dataset contains multiple files about songs, artists and albums that appeared on Billboard. These records contain information like the amount of awards won, RIAA certificates received, rankings on Spotify, rankings on Billboard, amount of streams, song information, and album information. These datapoints are stored over multiple files to retain organization.
Actions performed on the dataset#
clean(['Genres', 'Group.Solo', 'YearFirstAlbum'], 'artistDf.csv')
Artist Followers NumAlbums Gender
X
0 Ed Sheeran 52698756 8 M
1 Justin Bieber 30711450 10 M
2 Jonas Brothers 3069527 10 M
3 Drake 41420478 11 M
4 Chris Brown 9676862 6 M
clean(['Lyrics', 'Genre', 'Writing.Credits', 'Features'], 'billboardHot100_1999-2019.csv')
Artists Name Weekly.rank Peak.position \
1 Lil Nas, Old Town Road 1 1.0
3 Billie Eilish Bad Guy 3 2.0
4 Khalid Talk 4 3.0
5 Ed Sheeran, Justin Bieber I Don't Care 5 2.0
6 Jonas Brothers Sucker 6 1.0
Weeks.on.chart Week Date
1 7.0 2019-07-06 April 5, 2019
3 13.0 2019-07-06 March 29, 2019
4 20.0 2019-07-06 February 7, 2019
5 7.0 2019-07-06 May 10, 2019
6 17.0 2019-07-06 March 1, 2019
clean(['Genre', 'Album', 'GrammyYear'], 'grammyAlbums_199-2019.csv')
Award Artist
0 Album Of The Year Kacey Musgraves
1 Best Traditional Pop Vocal Album Willie Nelson
2 Best Pop Vocal Album Ariana Grande
3 Best Dance/Electronic Album Justice
4 Best Contemporary Instrumental Album Steve Gadd Band
clean(['X', 'GrammyYear', 'Genre', 'Name'], 'grammySongs_1999-2019.csv')
GrammyAward \
1 Record Of The Year
2 Song Of The Year
3 Best Pop Solo Performance
4 Best Pop Duo/Group Performance
5 Best Dance Recording
Artist
1 Childish Gambino
2 Childish Gambino
3 Lady Gaga
4 Lady Gaga & Bradley Cooper
5 Silk City & Dua Lipa Featuring Diplo & Mark Ro...
clean(['Label'], 'riaaAlbumCerts_1999-2019.csv')
Album Artist Status
0 GREATEST HITS MARIAH CAREY 2x Multi-Platinum
1 THE REMIXES MARIAH CAREY Gold
2 VIEWS DRAKE 6x Multi-Platinum
3 MAJOR KEY DJ KHALED 1x Platinum
4 THE CHRISTMAS SESSIONS MERCYME Gold
clean(['Name', 'Label'], 'riaaSingleCerts_1999-2019.csv')
Artist RiaaStatus
X
0 Dj Khaled Gold
1 Ciara 1x Platinum
2 Daddy Yankee 11x Diamond
3 Billie Eilish 1x Platinum
4 Jennifer Lopez 6x Multi-Platinum
clean(['Acousticness', 'Album', 'Danceability', 'Duration', 'Energy', 'Explicit', 'Instrumentalness', 'Liveness', 'Loudness', 'Mode', 'Speechiness', 'Tempo', 'TimeSignature', 'Valence'], 'songAttributes_1999-2019.csv')
Artist Name Popularity
0 Collective Soul Welcome All Again 35
1 Collective Soul Fuzzy 31
2 Collective Soul Dig 30
3 Collective Soul You 35
4 Collective Soul My Days 21
clean(['Features'], 'spotifyWeeklyTop200Streams.csv')
Name Artist Streams Week
0 In My Feelings Drake 30747676 2018-07-20
1 Lucid Dreams Juice WRLD 12930705 2018-07-20
2 Nonstop Drake 12312859 2018-07-20
3 God is a woman Ariana Grande 10771324 2018-07-20
4 SAD! XXXTENTACION 10503061 2018-07-20