Looping Over Data Sets
Overview
Teaching: 15 min
Exercises: 5 minQuestions
How can I process many data sets with a single command?
Objectives
Be able to read and write globbing expressions that match sets of files.
Use glob to create lists of files.
Write for loops to perform operations on files given their names in a list.
If you don’t have the data files already (e.g., using Google Colab), you will want to download them.
# Same download function as last time
import urllib.request
import os
def download_data(filename):
if not os.path.exists('data'):
os.mkdir('data')
url = 'https://raw.githubusercontent.com/ualberta-rcg/python-intro/gh-pages/data/' + filename
output_file = 'data/' + filename
urllib.request.urlretrieve(url, output_file)
print("Downloaded " + filename + " to the data directory")
# Run the function to download
filenames = ['gapminder_gdp_africa.csv',
'gapminder_gdp_americas.csv',
'gapminder_gdp_oceania.csv',
'gapminder_gdp_europe.csv',
'gapminder_gdp_asia.csv',
'gapminder_all.csv']
for filename in filenames:
download_data(filename)
Use a for
loop to process files given a list of their names.
- A filename is just a character string.
- And lists can contain character strings.
import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
df = pd.read_csv(filename, index_col='country')
print(filename, df.min())
data/gapminder_gdp_africa.csv gdpPercap_1952 298.846212
gdpPercap_1957 335.997115
gdpPercap_1962 355.203227
gdpPercap_1967 412.977514
⋮ ⋮ ⋮
gdpPercap_1997 312.188423
gdpPercap_2002 241.165877
gdpPercap_2007 277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952 331
gdpPercap_1957 350
gdpPercap_1962 388
gdpPercap_1967 349
⋮ ⋮ ⋮
gdpPercap_1997 415
gdpPercap_2002 611
gdpPercap_2007 944
dtype: float64
Use glob.glob
to find sets of files whose names match a pattern.
- In Unix, the term “globbing” means “matching a set of files with a pattern”.
- The most common patterns are:
*
meaning “match zero or more characters”?
meaning “match exactly one character”
- Python contains the
glob
library to provide pattern matching functionality - The
glob
library contains a function also calledglob
to match file patterns - E.g.,
glob.glob('*.txt')
matches all files in the current directory whose names end with.txt
. - Result is a (possibly empty) list of character strings.
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data/gapminder_all.csv', 'data/gapminder_gdp_africa.csv', \
'data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_asia.csv', 'data/gapminder_gdp_europe.csv', \
'data/gapminder_gdp_oceania.csv']
print('all PDB files:', glob.glob('*.pdb'))
all PDB files: []
Use glob
and for
to process batches of files.
- Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
for filename in glob.glob('data/gapminder_*.csv'):
df = pd.read_csv(filename)
print(filename, df['gdpPercap_1952'].min())
data/gapminder_all.csv 298.8462121
data/gapminder_gdp_africa.csv 298.8462121
data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_asia.csv 331.0
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_gdp_oceania.csv 10039.59564
- This includes all data, as well as per-region data.
- Use a more specific pattern in the exercises to exclude the whole data set.
- But note that the minimum of the entire data set is also the minimum of one of the data sets, which is a nice check on correctness.
Determining Matches
Which of these files is not matched by the expression
glob.glob('data/*as*.csv')
?
data/gapminder_gdp_africa.csv
data/gapminder_gdp_americas.csv
data/gapminder_gdp_asia.csv
- 1 and 2 are not matched.
Solution
1 is not matched by the glob.
Minimum File Size
Modify this program so that it prints the number of records in the file that has the fewest records.
import glob import pandas as pd fewest = ____ for filename in glob.glob('data/*.csv'): df = pd.____(filename) fewest = min(____, df.shape[0]) print('smallest file has', fewest, 'records')
Notice that the shape method returns a tuple with the number of rows and columns of the data frame.
Solution
import glob import pandas as pd fewest = float('Inf') for filename in glob.glob('data/*.csv'): df = pd.read_csv(filename) fewest = min(fewest, df.shape[0]) print('smallest file has', fewest, 'records')
Comparing Data
Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.
Solution
import glob import pandas as pd import matplotlib.pyplot as plt fig, ax = plt.subplots(1,1) for filename in glob.glob('data/gapminder_gdp*.csv'): df = pd.read_csv(filename) # extract region from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv' region = filename.rpartition('_')[2][:-4] df.mean().plot(ax=ax, label=region) plt.legend() plt.show()
Processing Small Files
Modify this program so that it only processes files with fewer than 50 records.
import glob import pandas as pd for filename in glob.glob('data/*.csv'): contents = pd.read_csv(filename) ____: print(filename, len(contents))
Solution
import glob import pandas as pd for filename in glob.glob('data/*.csv'): contents = pd.read_csv(filename) if len(contents)<50: print(filename, len(contents))
Key Points
Use a
for
loop to process files given a list of their names.Use
glob.glob
to find sets of files whose names match a pattern.Use
glob
andfor
to process batches of files.