Summer Olympics History
This project explores connections between Summer Olympic data by country and corresponding characteristics for each country. These characteristics were derived from additional datasets that included data such as population, land mass area, and national pride for each country. The intention of this project was to visualize the distribution of Summer Olympic medals using Tableau, in relation with population size and general sentiments of patriotism surveyed within each country.
Data Collection
Data for this project came from various sources, as indicated below.
summer.csv — A record of all the Summer Olympic medalists from 1896 to 2012. Additional attributes include sport, event, and medal type (e.g. gold). Data was downloaded as a .csv from Kaggle.
landmass.csv — Provides both the total area of each country in both kilometers and miles, as well as the land area and percentage of the world’s landmass. Used convertcsv.com to convert data from the original website to a .csv file.
patriotism_by_country.csv — Shows percentage of each country’s population that believes their country is the best, which comes from survey data. Also includes the total population of each country for 2022. Data was downloaded as a .csv from World Population Review.
countrynames.csv — Provides alternative names for countries that could be used in other files to enable mapping from one data source to another. Data was downloaded from class website, provided by OpenGeoCode.
patriotism_by_country.csv — Provides IOC country codes for countries that compete in Olympic Games, which was added to the summer.csv file. Data was downloaded from Github.
Data Modeling
Once I collected the data I brainstormed how information within these varying datasets would be utilized within a relational database, as illustrated below.
Figure 1. Initial workflow model.
Figure 2. Entity Relationship Diagram and Relational Vocabulary.
Data Wrangling
After drafting and revising the data model I began the data wrangling process by creating the olympic_country_data database as well as tables for countries and country names as there were various data sources that needed to be combined before bringing in additional data for athletes, events, etc. In this phase I used Python and PostgreSQL to read and clean the datasets, such as updating the format for the landmass data (e.g. total area) into integers and adding alternative names for various countries (as depicted below).
Figure 3. Excerpts of Python code used to insert and clean data for countries’ landmass and names.
Once data was cleaned and added to its appropriate tables I established connections utilizing primary keys and foreign keys based on the data model included above (e.g. inserting country_id into the athletes table). Following this, I queried the database based on inquiries such as, the relationship between a country’s landmass and their population size and the number of Olympic medals a country possesses versus their pride percentage. Examples of these are included below.
Figure 4. Excerpts of Python code used to establish relationships and query tables.
Data Visualization
Once the queries were exported as .csv files they were then imported into Tableau and joined based on shared country names. The map and graph depicted below are examples of data analysis conducted within Tableau, seeking to answer questions like, “What’s the correlation between national pride and the number of Olympic medalist from a country?”, and “What’s the composition of countries based on their population versus their landmass?”.
Figure 5. Examples of data visualization created in Tableau.
Project Learnings
Having to move from a data model to a full blown relational database was a new and challenging experience!
Pulling from different data sources provided new opportunities to tackle real-world data cleaning issues.
It was helpful to have a central topic to defer to when creating and querying the database.
I spent most of my time focused on the data wrangling portion, so I would have liked to devote more time to the data visualizations.