Datasets and preprocessing#

For this project we found various datasets online to support our perspectives on this topic. After we gathered these datasets, we used preprocessing to clean and filter out the data. We also performed various transformations on the data to make it more useful for our purposes. This preprocessing consisted of mainly dropping unneeded columns, empty data and invalid data. Afterwards, some datasets also received transformations to make later usage easier.

import pandas as pd

def clean(labels: list[str], filename: str, index=0) -> None:
    file = pd.read_csv('../origin/' + filename, index_col=index, low_memory=False)
    file.drop(labels=labels, axis=1, inplace=True)
    file.dropna(axis=0, how='any', inplace=True)
    file.to_csv('../cleaned/' + filename)
    print(file.head())

Datasets used#

Spotify Weekly Top 200 Songs Streaming Data#

Source: https://www.kaggle.com/datasets/yelexa/spotify200

This big dataset has a record of all the Spotify songs in the weekly top 200. It contains: the artist, the song name, from which album it is, the amount of streams and a lot more values. This is really useful because it can show us how many total streams each artist has. And how many streams each song has.

Actions performed on the dataset#

clean(['uri', 'artists_num', 'artist_id', 'artist_genre', 'artist_img', 'collab', 'album_cover', 'source', 'album_num_tracks', 'acousticness', 'instrumentalness', 'danceability', 'duration', 'energy', 'liveness', 'loudness', 'mode', 'speechiness', 'tempo', 'valence', 'key', 'pivot', 'country', 'region', 'language', 'release_date'], 'final.csv')
  rank  artist_names artist_individual             track_name peak_rank  \
0    1  Paulo Londra      Paulo Londra                 Plan A         1   
1    2           WOS               WOS           ARRANCARMELO         2   
2    3  Paulo Londra      Paulo Londra                 Chance         3   
3    5       Cris Mj           Cris Mj  Una Noche en Medellín         5   
4    6        Emilia            Emilia          cuatro veinte         6   

  previous_rank weeks_on_chart  streams        week  
0             1              4  3003411  2022-04-14  
1           129              2  2512175  2022-04-14  
2            59              2  2408983  2022-04-14  
3             5              8  2080139  2022-04-14  
4             9              3  1923270  2022-04-14  

RIAA Artists By Certified Album Units Sold#

Source: https://www.kaggle.com/datasets/darinhawley/riaa-artists-by-certified-units-sold

This dataset contains records about artists and how many albums they sold, whether those albums are gold, platinum, multi-platinum, and diamond. This dataset originates from the RIAA’s database. This dataset was originally made for a quiz. We can use this to see how many sales each album of each artist has and how many platinum records they have. Using this we can see which artist is more successful by the RIAA.

Actions performed on the dataset#

clean(['Artist ID'], 'riaakaggle.csv', False)
          Artist  Certified Units  Gold  Platinum  Multi-Platinum  Diamond
0    THE BEATLES            183.0    48        42              26        6
1   GARTH BROOKS            157.0    31        31              17        9
2  ELVIS PRESLEY            139.0   101        57              25        1
3         EAGLES            120.0    13        13              11        3
4   LED ZEPPELIN            111.5    19        18              14        5

Data on Songs from Billboard 1999-2019#

Source: https://www.kaggle.com/datasets/danield2255/data-on-songs-from-billboard-19992019

This dataset was originally made for a research project at an university. The dataset contains multiple files about songs, artists and albums that appeared on Billboard. These records contain information like the amount of awards won, RIAA certificates received, rankings on Spotify, rankings on Billboard, amount of streams, song information, and album information. These datapoints are stored over multiple files to retain organization.

Actions performed on the dataset#

clean(['Genres', 'Group.Solo', 'YearFirstAlbum'], 'artistDf.csv')
           Artist  Followers  NumAlbums Gender
X                                             
0      Ed Sheeran   52698756          8      M
1   Justin Bieber   30711450         10      M
2  Jonas Brothers    3069527         10      M
3           Drake   41420478         11      M
4     Chris Brown    9676862          6      M
clean(['Lyrics', 'Genre', 'Writing.Credits', 'Features'], 'billboardHot100_1999-2019.csv')
                     Artists           Name  Weekly.rank  Peak.position  \
1                   Lil Nas,  Old Town Road            1            1.0   
3              Billie Eilish        Bad Guy            3            2.0   
4                     Khalid           Talk            4            3.0   
5  Ed Sheeran, Justin Bieber   I Don't Care            5            2.0   
6             Jonas Brothers         Sucker            6            1.0   

   Weeks.on.chart        Week              Date  
1             7.0  2019-07-06     April 5, 2019  
3            13.0  2019-07-06    March 29, 2019  
4            20.0  2019-07-06  February 7, 2019  
5             7.0  2019-07-06      May 10, 2019  
6            17.0  2019-07-06     March 1, 2019  
clean(['Genre', 'Album', 'GrammyYear'], 'grammyAlbums_199-2019.csv')
                                  Award           Artist
0                     Album Of The Year  Kacey Musgraves
1      Best Traditional Pop Vocal Album    Willie Nelson
2                  Best Pop Vocal Album    Ariana Grande
3           Best Dance/Electronic Album          Justice
4  Best Contemporary Instrumental Album  Steve Gadd Band
clean(['X', 'GrammyYear', 'Genre', 'Name'], 'grammySongs_1999-2019.csv')
                      GrammyAward  \
1              Record Of The Year   
2                Song Of The Year   
3       Best Pop Solo Performance   
4  Best Pop Duo/Group Performance   
5            Best Dance Recording   

                                              Artist  
1                                   Childish Gambino  
2                                   Childish Gambino  
3                                          Lady Gaga  
4                         Lady Gaga & Bradley Cooper  
5  Silk City & Dua Lipa Featuring Diplo & Mark Ro...  
clean(['Label'], 'riaaAlbumCerts_1999-2019.csv')
                     Album        Artist             Status
0            GREATEST HITS  MARIAH CAREY  2x Multi-Platinum
1              THE REMIXES  MARIAH CAREY               Gold
2                    VIEWS         DRAKE  6x Multi-Platinum
3                MAJOR KEY     DJ KHALED        1x Platinum
4   THE CHRISTMAS SESSIONS       MERCYME               Gold
clean(['Name', 'Label'], 'riaaSingleCerts_1999-2019.csv')
           Artist         RiaaStatus
X                                   
0       Dj Khaled               Gold
1           Ciara        1x Platinum
2    Daddy Yankee        11x Diamond
3   Billie Eilish        1x Platinum
4  Jennifer Lopez  6x Multi-Platinum
clean(['Acousticness', 'Album', 'Danceability', 'Duration', 'Energy', 'Explicit', 'Instrumentalness', 'Liveness', 'Loudness', 'Mode', 'Speechiness', 'Tempo', 'TimeSignature', 'Valence'], 'songAttributes_1999-2019.csv')
            Artist               Name  Popularity
0  Collective Soul  Welcome All Again          35
1  Collective Soul              Fuzzy          31
2  Collective Soul                Dig          30
3  Collective Soul                You          35
4  Collective Soul            My Days          21
clean(['Features'], 'spotifyWeeklyTop200Streams.csv')
             Name         Artist   Streams        Week
0  In My Feelings          Drake  30747676  2018-07-20
1    Lucid Dreams     Juice WRLD  12930705  2018-07-20
2         Nonstop          Drake  12312859  2018-07-20
3  God is a woman  Ariana Grande  10771324  2018-07-20
4            SAD!   XXXTENTACION  10503061  2018-07-20