Dataset [6]: Global Temperature Anomaly
The first dataset that we took a look at was the temperature anomaly dataset, which contains global, hemispheric, continental, and coordinate temperature anomalies. The dataset type is a table, and is downloadable in .csv, .xml, and .json formats. The data from this dataset is provided by Climate at a Glance and is collected by NOAA's Global Surface Temperature Analysis. The table contains two quantitative attributes, which are the year of the recorded temperature anomaly in YYYYMM format (191001 for Jan. 1910, 191002 for Feb. 1910, etc) and the temperature anomaly, measured in Celsius as a deviation from the mean temperature, which varies depending on the region. [6] Global and hemispheric anomalies are recorded based on the 1901-2000 average, coordinate anomalies are recorded based on the 1991-2020 average, and all other regional anomalies are recorded based on the 1910-2000 average.
An apparent tradeoff for information pertaining to more specific areas is that the period of recorded data is much smaller. While global and regional data both have over a 100 year range, coordinate data only has a range of about thirty years, likely because of necessary technology only being recently developed. Another limitation of this dataset is that it only contains two attributes, the year and the temperature anomaly. Therefore, we needed to add an extra categorical attribute for the region that the data pertains to so that it can be read by visualization software. Some extra cleaning was also required, which included some extra header information at the top of each file.
Dataset [7]: Carbon Dioxide Emissions
The second dataset that we plan on using is that of carbon dioxide emissions, which contains data by country. The dataset type is a table, and is downloadable in the .xlsx format. This dataset includes ten different attributes. However, we only used four, as the other attributes were parts of different attributes that we chose to use instead. We plan to use one categorical attribute and three quantitative attributes. The categorical attribute is country, which corresponds to the name of the country, and the quantitative attributes include year, which is the year the data was recorded in YYYY format, and two carbon dioxide emission attributes.[7] Both of the carbon dioxide attributes are quantitative and measured in thousand metric tons and include “Total CO2 emissions from fossil-fuels and cement production”, which represents the annual total emissions of carbon dioxide, and “Per capita CO2 emissions, which represent the total carbon dioxide emissions divided by the population of the country at that time to obtain the per capita emissions.
While this dataset does contain information as early as the 18th century in some countries, some countries do not go nearly as far back as others. For example, The United Kingdom has data from 1751-2020, other countries such as Finland only have data from 1860-2020. That means that while creating the visualizations using this dataset, we needed to take into consideration the lack of data for certain countries during specific time periods and encode them appropriately.
Dataset [8]: Sea Level Rise
The third dataset that we plan to use is the sea level rise dataset, which contains global and regional data. The dataset type is a table, and it is downloadable in .csv, .cn, .pdf, and .png formats. However, only the .csv format is useful for our purposes. [8] The altimetry data in this dataset is provided by the NOAA Laboratory for Satellite Altimetry, and was collected by four main satellite stations. TOPEX/Poseidon measured sea level rise from 1992-2006, Jason-1 measured sea level rise from 2002-2013, Jason-2 measured sea level rise from 2008-2017, and Jason-3 measured sea level rise from 2016-2023. There are two different collection sources, one that contains data using TOPEX, Jason-1, Jason-2, and Jason-3, one that contains data from several altimeters, and both of those are split based on whether or not seasonal signals are removed. We decided to use the TOPEX and Jason dataset with the seasonal signals removed, as the natural rising of sea level due to seasonal heat is not our main focus, and removing that data will allow us to focus more on potential external causes of sea level rise. Our chosen dataset contains five attributes, all of which are quantitative. Despite being strings, TOPEX, Jason-1, Jason-2, and Jason-3 are all quantitative attributes that are used to determine which satellite took that particular measurement. The year is also a quantitative attribute that displays the time of the measurement, and is represented as a decimal rather than a date, and the other four columns display the measured change in sea level in millimeters.
The main restriction of this dataset comes from the relatively short data collection period. Compared to our other datasets, which contain information for hundreds of years, this one only has data for the past 30 years. That restriction made it hard to use in conjunction with our other datasets, so we opted to use it by itself to prevent it from limiting the year length of our other datasets. Like dataset [6], we also needed to add an extra column that contains the region that the data pertains to. Since there are twenty regional datasets and the global dataset lacks individual regions, we also needed to aggregate all twenty region datasets into one large dataset.
Data Transformation
Data transformation was performed for task 2, presenting sea level rise data and identifying outliers,
by selecting the minimum and maximum sea level records for each year and obtaining the average value of each year.
This created a new table-type transformed dataset with the attributes of minimum, maximum, and average sea level
for each year. This was a necessary task to effectively visualize a large amount of data and pick out meaningful values.
The minimum is the lowest sea level of the year, the maximum is the highest sea level, and the average is the mean of
all recorded values for the year. However, since numerous factors play a role in sea level differences, our purpose is
to see trends over time rather than the specific values themselves.
Visualization Tools & Programming Languages
We have used python 3.8.2 [9] as our programming environment.
And Numpy [10] & Pandas[11]
libraries for numerical operations, data manipulation and analysis on datasets,
Matplotlib [12] is used for basic plotting of maps and animations
with some basic interactivity along with its modules such as matplotlib.pyplot and matplotlib.colors.for
developing interactive buttons to control the animations we have used Ipywidgets [13].
for the created animation for representing the color scale we have normalized the values into a color scale using the
library ScalarMappable [14]. For creating Interactive choropleth maps for CO2 emissions
we have used a simple library Plotly.express [15]. For scaling, manipulating
the generated images of various formats we have used Pillow [16]
Software Architecture
Used Google colab [17] initially but faced more difficulties while uploading
and using the datasets from google drive [18]. And moved to work with Jupyter
notebooks [19] later and also used a pip [20]as
installer for downloading the necessary libraries. Tableau [21] has been used to create some of the artifacts
initially and later made use of jupyter notebook and other python packages to improvise on the visualizations.
All the data has been stored locally in ‘.csv/.json/.xlsx’ formats for easy retrieval and manipulation of table
based data. The data transformations have been done using pandas in python retaining the copies of all the intermittent
files generated.