curl -o "data/raw/nyc_taxi/year=${year}/month=${month}/part-0.parquet" \
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_${year}-${month}.parquet"
I’m very into the idea of multilingual data science at the moment. I’m spending a lot of time with Python, but I still want to use R for those use cases where it’s better. For example: I cannot find a Python package for data viz that I like as much as ggplot2. Building simple web apps is very easy in Shiny. Etc.
If you’re using multiple languages you need something to connect them. The book Data Science at the Command Line is very useful for learning about this sort of thing. The idea that “… the command line can act as a glue between many different data science tools” is not new, but I’m using it more than I ever have. Here’s a simple example.
Suppose we want to work with the New York City Taxi Data. We want to download all of the files for the years in which we’re interested. The files follow a set pattern: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_${year}-${month}.parquet
.
We want to write out those data locally, and use Hive-style partitioning by year and month. So we write to data/raw/nyc_taxi/year=${year}/month=${month}/part-0.parquet
.
We could load up a package in Python or R to do this. Voltron Data have these data stored in an S3 bucket at s3://voltrondata-labs-datasets/nyc-taxi
, and a recent Jumping Rivers blog shows how to connect to that with the {arrow} package.
But it’s extremely easy to do from the command line with curl. Something like:
Wrap that in a couple of for loops and the job is done. But I want to manage my project in a reproducible way, which means I need to manage the workflow. If these data files don’t exist or have changed at all, I must download them. If they do exist then I should do nothing. I also don’t want to waste time by downloading things I don’t need: each month’s data is > 100MB.
If I’m working just in R I’d always use {targets}, but for a project across multiple languages I need something more general, such as GNU Make. And we can create the whole set of targets in a Makefile fairly simply.
# Base URL
BASE_TAXI_URL = https://d37ci6vzurychx.cloudfront.net/trip-data
# Years and months for which to download data
YEARS = $(shell seq 2018 2021)
MONTHS = $(shell seq -w 1 12)
# Directory to store downloaded files
TAXI_DATA_DIR = data/raw/nyc_taxi
# List of all .parquet files to download
PARQUET_FILES = $(foreach year,$(YEARS),$(foreach month,$(MONTHS),$(TAXI_DATA_DIR)/year=$(year)/month=$(month)/part-0.parquet))
# Rule to download a .parquet file
$(PARQUET_FILES):
./download_taxi_data $@
## Make datasets
data: $(PARQUET_FILES)
Let’s go through this step by step.
Set up the base of the URL.
BASE_TAXI_URL = https://d37ci6vzurychx.cloudfront.net/trip-data
Now create the sequences of years and months for which we want data.
YEARS = $(shell seq 2018 2021)
MONTHS = $(shell seq -w 1 12)
The -w
option ensures that all of them have the same width by zero-padding.
Next step creates a variable TAXI_DATA_DIR = data/raw/nyc_taxi
. Then we get to the only bit of the process that may not be obvious at first:
PARQUET_FILES = $(foreach year,$(YEARS),$(foreach month,$(MONTHS),$(TAXI_DATA_DIR)/year=$(year)/month=$(month)/part-0.parquet))
Make has a foreach
function, which allows us to create a whole list based on a pattern. In this case it’s just a way to do two for loops, one nested in the other.
For each year in YEARS
For each month in MONTHS
Create a filepath following this pattern
The rest of the Makefile creates the actual targets to be written:
# Rule to download a .parquet file
$(PARQUET_FILES):
./download_taxi_data $@
## Make datasets
data: $(PARQUET_FILES)
The first part defines the rule for how to build each of those targets (i.e. the files to be written out). The second part creates another target (or rather a phony target) called data
that just depends on all of the individual PARQUET_FILES
. That way we can call make data
at the command line and get all of them at once.
You’ve probably spotted ./download_taxi_data
in that last chunk. That’s a separate script that does that actual download.
#!/usr/bin/env bash
# Script using curl to download yellow taxi data from NYC Taxi and Limousine Commission
# Usage: bash taxi_download <output_path>
if [ "$#" -ne 1 ]; then
echo "Illegal number of parameters"
echo "Usage: bash taxi_download <output_path>"
exit 1
fi
output_path=$1
month_dir=$(dirname "$output_path")
year_dir=$(dirname "$month_dir")
output_dir=$(dirname "$year_dir")
# Create directories if they don't exist
mkdir -p $month_dir
month=$(basename "$month_dir" | grep -oP '\d{2}')
year=$(basename "$year_dir" | grep -oP '\d{4}')
echo "Downloading yellow taxi data for $year-$month"
curl -o $output_path "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_$year-$month.csv" --progress-bar
echo "Download complete"
Why bother with all of this? It seems like a lot, and for this toy example it is. But if you’ve ever felt the pain of wasting hours or days because you reached a point in a project where the state of things was so unclear that the only option remaining was “let me just start everything from scratch”, then the value of managing the workflow like this may be more convincing.
NB. Yes I know that Airflow exists, but I haven’t learned to use it yet.