Looping Over Data Sets

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How can I process many data sets with a single command?

Objectives
  • Be able to read and write globbing expressions that match sets of files.

  • Use glob to create lists of files.

  • Write for loops to perform operations on files given their names in a list.

If you don’t have the data files already (e.g., using Google Colab), you will want to download them.

# Same download function as last time

import urllib.request
import os

def download_data(filename):
  if not os.path.exists('data'):
          os.mkdir('data')
  url = 'https://raw.githubusercontent.com/ualberta-rcg/python-intro/gh-pages/data/' + filename
  output_file = 'data/' + filename
  urllib.request.urlretrieve(url, output_file)
  print("Downloaded " + filename + " to the data directory")

# Run the function to download

filenames = ['gapminder_gdp_africa.csv',
             'gapminder_gdp_americas.csv',
             'gapminder_gdp_oceania.csv',
             'gapminder_gdp_europe.csv',
             'gapminder_gdp_asia.csv',
             'gapminder_all.csv']

for filename in filenames:
  download_data(filename)

Use a for loop to process files given a list of their names.

import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    df = pd.read_csv(filename, index_col='country')
    print(filename, df.min())
data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
⋮ ⋮ ⋮
gdpPercap_1997    312.188423
gdpPercap_2002    241.165877
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952    331
gdpPercap_1957    350
gdpPercap_1962    388
gdpPercap_1967    349
⋮ ⋮ ⋮
gdpPercap_1997    415
gdpPercap_2002    611
gdpPercap_2007    944
dtype: float64

Use glob.glob to find sets of files whose names match a pattern.

import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data/gapminder_all.csv', 'data/gapminder_gdp_africa.csv', \
'data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_asia.csv', 'data/gapminder_gdp_europe.csv', \
'data/gapminder_gdp_oceania.csv']
print('all PDB files:', glob.glob('*.pdb'))
all PDB files: []

Use glob and for to process batches of files.

for filename in glob.glob('data/gapminder_*.csv'):
    df = pd.read_csv(filename)
    print(filename, df['gdpPercap_1952'].min())
data/gapminder_all.csv 298.8462121
data/gapminder_gdp_africa.csv 298.8462121
data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_asia.csv 331.0
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_gdp_oceania.csv 10039.59564

Determining Matches

Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

  1. data/gapminder_gdp_africa.csv
  2. data/gapminder_gdp_americas.csv
  3. data/gapminder_gdp_asia.csv
  4. 1 and 2 are not matched.

Solution

1 is not matched by the glob.

Minimum File Size

Modify this program so that it prints the number of records in the file that has the fewest records.

import glob
import pandas as pd
fewest = ____
for filename in glob.glob('data/*.csv'):
    df = pd.____(filename)
    fewest = min(____, df.shape[0])
print('smallest file has', fewest, 'records')

Notice that the shape method returns a tuple with the number of rows and columns of the data frame.

Solution

import glob
import pandas as pd
fewest = float('Inf')
for filename in glob.glob('data/*.csv'):
    df = pd.read_csv(filename)
    fewest = min(fewest, df.shape[0])
print('smallest file has', fewest, 'records')

Comparing Data

Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.

Solution

import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    df = pd.read_csv(filename)
    # extract region from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'
    region = filename.rpartition('_')[2][:-4] 
    df.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()

Processing Small Files

Modify this program so that it only processes files with fewer than 50 records.

import glob
import pandas as pd
for filename in glob.glob('data/*.csv'):
    contents = pd.read_csv(filename)
    ____:
        print(filename, len(contents))

Solution

import glob
import pandas as pd
for filename in glob.glob('data/*.csv'):
    contents = pd.read_csv(filename)
    if len(contents)<50:
        print(filename, len(contents))

Key Points

  • Use a for loop to process files given a list of their names.

  • Use glob.glob to find sets of files whose names match a pattern.

  • Use glob and for to process batches of files.