Data analysis with Pandas

Learning objectives

After this chapter you will be able to:

  • make a dataframe based on an existing list or dictionary
  • make a series
  • explore data in a dataframe
  • convert date in a dataframe (string operations, datatype conversion)
  • change column names in a dataframe
  • remove columns from a dataframe
  • add columns to a dataframe
  • filter data on columns, row, conditions, indexes and slicing
  • perform a statistical analysis on data in a datafrme, with or without grouping
  • read data from a csv file, an Excel file, a json file or an sqlite3 database into a dataframe
  • write data from a dataframe to a csv file, an Excel file or a json file
  • visualize data from a dataframe in a line plot, a scatter plot, a bar plot or a pie plot.

Sources

The official pandas site

What is Pandas?

Pandas is a Python library especially suited to writing and analyzing large amounts of data. When you have a collection of ten thousands of data, Excel might not be the right tool foor you. Pandas makes processing these data easy.

You install Pandas via pip. Do not forget to create a virtual environment.

python -m venv .env
source .env/bin/activate # Windows: .env/Scripts/activate.bat
pip install pandas

We will import Pandas into our Python script.

import pandas as pd # we give Pandas the alias pd by convention

Data structures in Pandas

The two bacic data structures in Pandas are: Series and DataFrame.

Series is one-dimensional (one column), a DataFrame is two-dimensional (table with rows and columns). Pandas has been developed on top of NumPy and is therfore very efficient as to calcuting power. Both data types can contain all types of objects. By default the indices are numbers as in lists, but we can also pass our own index, in the form of a list of labels. It is self-evident that the index list must have the same lenght as the values list.

import pandas as pd
import numpy as np

rij = pd.Series(np.random.randn(5))
print(rij)

Output:

0    0.309326
1   -0.394174
2    1.722596
3   -0.644675
4   -1.048602
dtype: float64

Example with our own list of indices:

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
print(s)

Output:

a   -0.179841
b    0.474589
c    0.304474
d    0.667548
e   -0.996806
dtype: float64

We can also create a Series object from a dict. The keys are the indices.

d = {"b": 1, "a": 0, "c": 2}
print(pd.Series(d))

Output:

b    1
a    0
c    2
dtype: int64

A DataFrame is two-dimensional and consists of columns and column headers. You can create a DataFrame from elements of type Series, a NumPy array, another DataFrame or a dict. Let's illustrate this with a few examples:

import pandas as pd

# Make a DataFrame from  'dictionary of lists'.
my_data_frame = pd.DataFrame({
    "Name": ["Pol", "Romeo", "Thomas", "An"],
    "Age": [25, 65, 63, 21],
    "Sex": ["male", "male", "male", "female"]
})

# Dump the DataFrame. 
# Note that the keys of the dicationary are used as column names, the values from the lists as column data.
print(my_data_frame)

Output:

     Name  Age     Sex
0     Pol   25    male
1   Romeo   65    male
2  Thomas   63    male
3      An   21  female

Query data from a Series or DataFrame

Each column in a DataFrame is a Series. You can query these separately. Let's take the example above:

print(my_data_frame["Age"])

Output:

0    25
1    65
2    63
3    21

If you want to query one element, you can use the index: my_data_frame["Age"][0]. This will give the value 25.

Get data from and write data to different data sources

CSV and Excel

Pandas is not only strong in quick statistical analysis, but you can query sources from different data sources very easily. In the example below we create a DataFrame by reading a csv file titanic.csv.

titanic_data_frame = pd.read_csv("titanic.csv")

Pandas will create a DataFrame and try to convert all data to the correct data type. Now we can for examples query the first ten elements:

titanic_data_frame.head(10)
PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S
5            6         0       3  ...   8.4583   NaN         Q
6            7         0       1  ...  51.8625   E46         S
7            8         0       3  ...  21.0750   NaN         S
8            9         1       3  ...  11.1333   NaN         S
9           10         1       2  ...  30.0708   NaN         C

You can also write a DataFrame to a csv file.

df = pd.DataFrame({"brand": ["Mercedes", "BMW", "Audi"], 
                   "color": ["white", "black", "blue"]})
df.to_csv("cars.csv", sep=",", index=False)  # default separator is ','

By using index=False, you avoid that the row number too are added to the csv file.

You can also write a DataFrame to an Excel file, by using the to_excel() method. Similarly, there are other to_* methods, such as to_html(), to_json(), to_sql(), ...

titanic_data_frame.to_excel("myTitanicExcelFile.xlsx", sheet_name="passengerlist", index=False)

It is possible that you get the error "No modules name 'openpyxl'". In this case you install the module with pip install openpyxl.

The opposite is also easy, reading an Excel file into a DataFrame.

titanic_data_frame_from_excel = pd.read_excel("myTitanicExcelFile.xlsx", sheet_name="passengerlist")

If you get the erros "Missing optional dependency 'xlrd'", install the module with pip install xlrd.

Once you've created a DataFrame, it is time to research the data in the DataFrame. You can query 'technical' info about your DataFrame by using the info() method.

titanic_data_frame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Json

import pandas as pd
import matplotlib.pyplot as plt
import json

# JSON-bestand importeren.
co2_values = pd.read_json("co2.json", type='series')

Database

import sqlite3
from sqlite3 import Error
import pandas as pd

try:
  conn = sqlite3.connect("chinook.db")
except Error as e:
  print(e)

sql_query = """
SELECT Title, Name
FROM albums
INNER JOIN artists ON artists.ArtistID = albums.ArtistID
"""

albums = pd.read_sql_query(sql_query, conn)
albums.head()
conn.close()
Title Name
0 For Those About To Rock We Salute You AC/DC
1 Balls to the Wall Accept
2 Restless and Wild Accept
3 Let There Be Rock AC/DC
4 Big Ones Aerosmith

Select a subset from a DataFrame

If you want to select several columns from a DataFrame, use double square brackets. The inner square brackets are a Python list with columns headers indication which columns you want to get. The outer square brackets indicate you want to make a selection from your DataFrame.

import pandas as pd

titanic_data_frame = pd.read_csv("titanic.csv")

age_and_sex = titanic_data_frame[["Age", "Sex"]]
print(age_and_sex)
    Age Sex
0   22.0    male
1   38.0    female
2   26.0    female
3   35.0    female
4   35.0    male
... ... ...
886 27.0    male
887 19.0    female
888 NaN female
889 26.0    male
890 32.0    male
891 rows × 2 columns

Of you want to select several rows from a DataFrame, you use a condition in the square brackets. In the folowing example we will select the persons older than 30 from the Titanic passenger list.

older_than_39 = titanic_data_frame[titanic_data_frame["Age"] > 39]
print(older_than_39)
     PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch      Ticket     Fare Cabin Embarked
6              7         0       1                            McCarthy, Mr. Timothy J    male  54.0      0      0       17463  51.8625   E46        S
11            12         1       1                           Bonnell, Miss. Elizabeth  female  58.0      0      0      113783  26.5500  C103        S
15            16         1       2                   Hewlett, Mrs. (Mary D Kingcome)   female  55.0      0      0      248706  16.0000   NaN        S
30            31         0       1                           Uruchurtu, Don. Manuel E    male  40.0      0      0    PC 17601  27.7208   NaN        C
33            34         0       2                              Wheadon, Mr. Edward H    male  66.0      0      0  C.A. 24579  10.5000   NaN        S
..           ...       ...     ...                                                ...     ...   ...    ...    ...         ...      ...   ...      ...
862          863         1       1  Swift, Mrs. Frederick Joel (Margaret Welles Ba...  female  48.0      0      0       17466  25.9292   D17        S
865          866         1       2                           Bystrom, Mrs. (Karolina)  female  42.0      0      0      236852  13.0000   NaN        S
871          872         1       1   Beckwith, Mrs. Richard Leonard (Sallie Monypeny)  female  47.0      1      1       11751  52.5542   D35        S
873          874         0       3                        Vander Cruyssen, Mr. Victor    male  47.0      0      0      345765   9.0000   NaN        S
879          880         1       1      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)  female  56.0      0      1       11767  83.1583   C50        C

[163 rows x 12 columns]

This works because the condition titanic_data_frame["Age"] > 39 return a Series of bool. This Series is used to filter the original DataFrame.

age_filter = titanic_data_frame["Age"] > 39
print(age_filter)
0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

Filter on Travel Class can be done by using the isin() method. You can for instance filter on the passengers that travel in first and second class, by passing a list like object to the method.

class_one_and_two_passengers = titanic_data_frame[titanic_data_frame["Pclass"].isin([1,2])]
print(class_one_and_two_passengers)
     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
1              2         1       1  ...  71.2833   C85         C
3              4         1       1  ...  53.1000  C123         S
6              7         0       1  ...  51.8625   E46         S
9             10         1       2  ...  30.0708   NaN         C
11            12         1       1  ...  26.5500  C103         S
..           ...       ...     ...  ...      ...   ...       ...
880          881         1       2  ...  26.0000   NaN         S
883          884         0       2  ...  10.5000   NaN         S
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
889          890         1       1  ...  30.0000  C148         C

[400 rows x 12 columns]

Some passengers have an unkown age. If you want to restrict operations to passenger that have an age, you can use the method notna(). This will return True when a not null value is found.

passengers_with_age = titanic_data_frame[titanic_data_frame["Age"].notna()]
print(passengers_with_age)
     PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0              1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1              2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2              3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3              4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4              5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
..           ...       ...     ...                                                ...     ...   ...    ...    ...               ...      ...   ...      ...
885          886         0       3               Rice, Mrs. William (Margaret Norton)  female  39.0      0      5            382652  29.1250   NaN        Q
886          887         0       2                              Montvila, Rev. Juozas    male  27.0      0      0            211536  13.0000   NaN        S
887          888         1       1                       Graham, Miss. Margaret Edith  female  19.0      0      0            112053  30.0000   B42        S
889          890         1       1                              Behr, Mr. Karl Howell    male  26.0      0      0            111369  30.0000  C148        C
890          891         0       3                                Dooley, Mr. Patrick    male  32.0      0      0            370376   7.7500   NaN        Q

[714 rows x 12 columns]

It is also possible to select rows and columns at once, by using the loc/iloc operator. We use loc with column names, row labels or conditions. We use iloc with indexes. We first give the row filter and then the column filter.

# Find the names of all the adults
adult_names = titanic_data_frame.loc[titanic_data_frame["Age"] >= 18, "Name"]
adult_names
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
885                 Rice, Mrs. William (Margaret Norton)
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 601, dtype: object
# iloc for indices
titanic_data_frame.iloc[3:6, 2:6]     # Row 3 unti 5. Column 2 to 5.
   Pclass                                          Name     Sex   Age
3       1  Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0
4       3                      Allen, Mr. William Henry    male  35.0
5       3                              Moran, Mr. James    male   NaN
# You can also update data in the same way. Suppose you want to change the names of the first four rows to 'unknow', you can do this as follows: 
titanic_data_frame.iloc[0:3,3] = "unknown"
titanic_data_frame.head(6)


   PassengerId  Survived  Pclass                                          Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                                       unknown    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1                                       unknown  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                                       unknown  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1  Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                      Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
5            6         0       3                              Moran, Mr. James    male   NaN      0      0            330877   8.4583   NaN        Q

As mentioned earlier, iloc works with indexes and loc with column names and conditions. Let us for example print the columns name, sex and age:

print(titanic.loc[:, ["Name", "Sex", "Age"]])

We first define the rows. With ':' we indicate 'all rows'. For the columns we give a list of column name.

Let's to the same, but for all rows with a minimum age of 30 and a maximum age of 39.

print(titanic.loc[(titanic["Age"] >= 30) & (titanic["Age"] < 40), ["Name", "Sex", "Age"]])

We must use ampersand ('&') instaad of 'and', and the pipe symbol ('|') instead of 'or'.

Drawing graphs using Pandas

Pandas works with the matplotlib library to draw nice visualizations of your data. Below you will find a few examples of how to draw graphs from Pandas data.

Some of these examples read a csv file. The source file can be found at https://github.com/pandas-dev/pandas/tree/master/doc/data/air_quality_no2.csv or on the Toledo platform.

You can find more information about the optional arguments of read_csv here: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

# Read NO2 values from a csv file. You will see that 'index_col' states that the first column
# should be used as index (unique reference) and that 'parse_dates' states that the index values
air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
air_quality
                     station_antwerp  station_paris  station_london
datetime                                                           
2019-05-07 02:00:00              NaN            NaN            23.0
2019-05-07 03:00:00             50.5           25.0            19.0
2019-05-07 04:00:00             45.0           27.7            19.0
2019-05-07 05:00:00              NaN           50.4            16.0
2019-05-07 06:00:00              NaN           61.9             NaN
...                              ...            ...             ...
2019-06-20 22:00:00              NaN           21.4             NaN
2019-06-20 23:00:00              NaN           24.9             NaN
2019-06-21 00:00:00              NaN           26.5             NaN
2019-06-21 01:00:00              NaN           21.8             NaN
2019-06-21 02:00:00              NaN           20.0             NaN

[1035 rows x 3 columns]

In order to use matplotlib, the library must be installed. It is possible that this is done autmatically while installing Pandas, but if it isn't, you will have to run the statement pip install matplotlib.

import matplotlib.pyplot as plt

# A simple line plot using Pandas (and matplotlib in the background).
# One line per column. 
# More info about the plot functions in Pandas:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html?highlight=plot#pandas-series-plot.
air_quality.plot()
plt.show()

plot air_quality

# filter on Antwerp
air_quality["station_antwerp"].plot()
plt.show()

air_quality_Antwerp

Add new columns in a DataFrame based on existing columns

Let's take the conversion of NO2-concentation to mg/m³. If we take a temperature of 25°C and a air pressure of 1013hPa, we have to multiply the value with 1882.

See: https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html.

air_quality["antwerp_mg_per_cubic_meter"] = air_quality["station_antwerp"] * 1.882
print(air_quality)
                     station_antwerp  station_paris  station_london  antwerp_mg_per_cubic_meter
datetime                                                                                       
2019-05-07 02:00:00              NaN            NaN            23.0                         NaN
2019-05-07 03:00:00             50.5           25.0            19.0                      95.041
2019-05-07 04:00:00             45.0           27.7            19.0                      84.690
2019-05-07 05:00:00              NaN           50.4            16.0                         NaN
2019-05-07 06:00:00              NaN           61.9             NaN                         NaN
...                              ...            ...             ...                         ...
2019-06-20 22:00:00              NaN           21.4             NaN                         NaN
2019-06-20 23:00:00              NaN           24.9             NaN                         NaN
2019-06-21 00:00:00              NaN           26.5             NaN                         NaN
2019-06-21 01:00:00              NaN           21.8             NaN                         NaN
2019-06-21 02:00:00              NaN           20.0             NaN                         NaN

[1035 rows x 4 columns]

You can also make calculations with two columns. We use the backslash to split a statement over two line.

air_quality["ratio_paris_antwerp"] = \
    air_quality["station_paris"] / air_quality["station_antwerp"]
print(air_quality)
                     station_antwerp  station_paris  station_london  antwerp_mg_per_cubic_meter  ratio_paris_antwerp
datetime                                                                                                            
2019-05-07 02:00:00              NaN            NaN            23.0                         NaN                  NaN
2019-05-07 03:00:00             50.5           25.0            19.0                      95.041             0.495050
2019-05-07 04:00:00             45.0           27.7            19.0                      84.690             0.615556
2019-05-07 05:00:00              NaN           50.4            16.0                         NaN                  NaN
2019-05-07 06:00:00              NaN           61.9             NaN                         NaN                  NaN
...                              ...            ...             ...                         ...                  ...
2019-06-20 22:00:00              NaN           21.4             NaN                         NaN                  NaN
2019-06-20 23:00:00              NaN           24.9             NaN                         NaN                  NaN
2019-06-21 00:00:00              NaN           26.5             NaN                         NaN                  NaN
2019-06-21 01:00:00              NaN           21.8             NaN                         NaN                  NaN
2019-06-21 02:00:00              NaN           20.0             NaN                         NaN                  NaN

[1035 rows x 5 columns]

You can rename columns.

air_quality_renamed =air_quality.rename(
    columns={"station_antwerp": "ANTW",
    "station_london": "London WM",
    "station_paris": "Paris"}
)
print(air_quality_renamed)
                     ANTW  Paris  London WM  antwerp_mg_per_cubic_meter  ratio_paris_antwerp
datetime                                                                                    
2019-05-07 02:00:00   NaN    NaN       23.0                         NaN                  NaN
2019-05-07 03:00:00  50.5   25.0       19.0                      95.041             0.495050
2019-05-07 04:00:00  45.0   27.7       19.0                      84.690             0.615556
2019-05-07 05:00:00   NaN   50.4       16.0                         NaN                  NaN
2019-05-07 06:00:00   NaN   61.9        NaN                         NaN                  NaN
...                   ...    ...        ...                         ...                  ...
2019-06-20 22:00:00   NaN   21.4        NaN                         NaN                  NaN
2019-06-20 23:00:00   NaN   24.9        NaN                         NaN                  NaN
2019-06-21 00:00:00   NaN   26.5        NaN                         NaN                  NaN
2019-06-21 01:00:00   NaN   21.8        NaN                         NaN                  NaN
2019-06-21 02:00:00   NaN   20.0        NaN                         NaN                  NaN

[1035 rows x 5 columns]
air_quality_renamed = air_quality_renamed.rename(columns=str.upper)
print(air_quality_renamed)
                     ANTW  PARIS  LONDON WM  ANTWERP_MG_PER_CUBIC_METER  RATIO_PARIS_ANTWERP
datetime                                                                                    
2019-05-07 02:00:00   NaN    NaN       23.0                         NaN                  NaN
2019-05-07 03:00:00  50.5   25.0       19.0                      95.041             0.495050
2019-05-07 04:00:00  45.0   27.7       19.0                      84.690             0.615556
2019-05-07 05:00:00   NaN   50.4       16.0                         NaN                  NaN
2019-05-07 06:00:00   NaN   61.9        NaN                         NaN                  NaN
...                   ...    ...        ...                         ...                  ...
2019-06-20 22:00:00   NaN   21.4        NaN                         NaN                  NaN
2019-06-20 23:00:00   NaN   24.9        NaN                         NaN                  NaN
2019-06-21 00:00:00   NaN   26.5        NaN                         NaN                  NaN
2019-06-21 01:00:00   NaN   21.8        NaN                         NaN                  NaN
2019-06-21 02:00:00   NaN   20.0        NaN                         NaN                  NaN

[1035 rows x 5 columns]

Get statistical information from DataFrames

You can get summary data from a DataFrame with my_data_frame.describe().

             Age
count   4.000000
mean   43.500000
std    23.741665
min    21.000000
25%    24.000000
50%    44.000000
75%    63.500000
max    65.000000

You get summary data from the numerical columns.

To find the eldest person we use the function max().

max_age = my_data_frame["Age"].max()
print("The eldest person in the DataFrame is " + str(max_age) + " years old.")

We also have other functions such as count(), mean(), and so on.

average_age = titanic_data_frame["Age"].mean()
print(f"The average age of the Titanic passengers was {round(average_age, 1)} years.")

titanic_data_frame[["Age", "Fare"]].median()

The function DataFrame.describe() gives standard information. You can customize this with DataFrame.agg().

titanic_data_frame.agg({"Age": ["min", "max", "median", "skew"],
                        "Fare": ["min", "max", "median", "mean"]})
              Age        Fare
min      0.420000    0.000000
max     80.000000  512.329200
median  28.000000   14.454200
skew     0.389108         NaN
mean          NaN   32.204208

Getting statistical data gets really interesting when you can group using groupby.

print(titanic_data_frame[["Sex", "Age"]].groupby("Sex").mean())
              Age
Sex              
female  27.915709
male    30.726645

# Grouping by multiple columns.
print(titanic_data_frame.groupby(["Sex", "Pclass"])["Fare"].mean())
Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64
# Count how many people travel in which class.
print(titanic_data_frame["Pclass"].value_counts())
# or
print(titanic_data_frame.groupby("Pclass")["Pclass"].count())
Pclass
3    491
1    216
2    184
Name: count, dtype: int64
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

Reorganize a table

It is sometimes handy to reorganize a table to that it is easier to process or show the data.

Examples of reorganizing a table:

- sort tables.

- switch columns and rows in a table.

- ...

titanic_data_frame.sort_values(by="Age")

titanic_data_frame.sort_values(by=["Pclass", "Age"], ascending=False)

We can also make a pivot table, in which we group and switch columns and rows.

Let's take the example of the air quality data, the extended versions: https://github.com/pandas-dev/pandas/blob/main/doc/data/air_quality_no2_long.csv

air_quality_long = pd.read_csv("air_quality_long.csv", 
        index_col="date.utc", parse_dates=True)

# we will only look at the NO2 data.
air_quality_long_no2 = air_quality_long[air_quality_long["parameter"] == "no2"]
print(air_quality_long_no2)

We want to get the values per location per index field (= data and hour). We call this a pivot table, because the location data are not separate rows, but we will find them in the columns.

print(air_quality_long_no2.pivot(columns="location", values="value"))
location                   BETR801  FR04014  London Westminster
date.utc                                                       
2019-04-09 01:00:00+00:00     22.5     24.4                 NaN
2019-04-09 02:00:00+00:00     53.5     27.4                67.0
2019-04-09 03:00:00+00:00     54.5     34.2                67.0
2019-04-09 04:00:00+00:00     34.5     48.5                41.0
2019-04-09 05:00:00+00:00     46.5     59.5                41.0
...                            ...      ...                 ...
2019-06-20 20:00:00+00:00      NaN     21.4                 NaN
2019-06-20 21:00:00+00:00      NaN     24.9                 NaN
2019-06-20 22:00:00+00:00      NaN     26.5                 NaN
2019-06-20 23:00:00+00:00      NaN     21.8                 NaN
2019-06-21 00:00:00+00:00      NaN     20.0                 NaN

[1705 rows x 3 columns]
# This can be plotted easily
air_quality_long_no2.pivot(columns="location", values="value").plot()
plt.show()

pivot_table_air_quality

Usually pivot tables want to give summary information. That's when we use the method pivot_table(). See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot_table.html?highlight=pivot_table#pandas.DataFrame.pivot_table

# Average concentration of NO2 and fine dust per location.
air_quality_long.pivot_table(values="value", index="location", columns="parameter", aggfunc="mean")

This means we will take the location data as rows (index) and calculate the average air quality per location, grouping the values no2 an pm25 in separate columns.

parameter                 no2       pm25
location                                
BETR801             26.950920  23.169492
FR04014             29.374284        NaN
London Westminster  29.740050  13.443568

There is a lot more you can do with Pandas tables, but you can get a long way with the things we covered. For more functions, read the documentation.

Casus: super heroes survey

We will illustrate the above by processing the results of a short survey about super heroes. We asked a few student some questions via a online form:

  • What is your favourite super hero?
  • What is your lenght in cm?
  • What is the size of your shoes?
  • What is your favourite drink?
  • What is your favourite operating systems?

The first column is the date/time in which the survey was completed. You will find all the data in the csv file 'short_survey.csv'.

We shall process that data of this survey in the following steps:

  • We will read the csv file.
  • We will explore the data set: shape, data types, statistics.
  • We will show the first five and list three rows.
  • We will remove the column with timestamps.
  • We will rename the columns to: Super hero, height, shoe size, fav. drink, operating system.
  • We will perform a few analyses:
  • Average and median of shoe sizes.
  • Minimum, maxium and average of height.
  • Sort rows alphabetically on super hero in descending order.
  • Show average shoe size and height grouped by operating system.
  • We will visualize the data as follows:
  • a plot of the height with red crosses;
  • a bar diagram of the number of people that chose a certain super hero;
  • a bar diagram of the number op people that chose a certain favourite drink;
  • a pie chart showing the choice of operating system.
  • We will write number of data to an Excel file using the method to_excel(), with the naam of the sheet as the first argument and index as the second argument with value False. This argument decides whether row numbers are in the Excel sheet.

We could execute this in a script, or in a Jupyter notebook. We will choose the latter option.