15  NYC Taxi Ride Records Analysis

In the following exercises, we will clean and analyze a collection of taxi ride records from New York City (see Table 15.1). The records are generated from individual taxi trips. Each trip is logged electronically through a digital metering system that captures detailed information from the moment a taxi trip starts (when a passenger is picked up) to the moment the trip ends (when the passenger reaches their destination). Pickup and drop-off locations correspond to taxi zones within the city’s five districts.

Table 15.1: Taxi ride records from New York City.
pickup_datetime dropoff_datetime passenger_count trip_distance pickup_location_id dropoff_location_id fare_amount tip_amount
2018-06-21 05:52 2018-06-21 06:00 0 1.58 148 231 7.5 1.76
2018-06-21 06:17 2018-06-21 06:21 1 0 140 262 4.5 1
2018-06-21 06:26 2018-06-21 06:53 1 1.69 230 141 0 3.36
2018-06-21 06:28 2018-06-22 05:57 1 1.64 68 230 17.5 3.66
2018-06-21 06:48 2018-06-21 07:03 1 1.66 236 161 11 2.36
2018-06-21 07:38 2018-06-21 07:52 2 2.3 237 234 11.5 2.46
2018-06-21 08:05 2018-06-21 08:11 6 1.02 79 137 6.5 0
2018-06-21 08:26 2018-06-22 07:51 5 1.72 230 90 16 3.46
2018-06-21 09:04 2018-06-21 09:07 3 11.27 138 163 45.5 9
2018-06-21 11:41 2018-06-21 11:48 1 1.46 43 43 7.5 1.76
2018-06-21 12:00 2018-06-21 12:27 1 1.95 162 142 16.5 3.11
2018-06-21 12:01 2018-06-21 12:01 2 0 112 112 2.5 0
2018-06-21 12:02 2018-06-21 12:11 1 0.61 161 233 7 0
2018-06-21 12:02 2018-06-21 12:14 1 1.8 162 90 9.5 2.05
2018-06-21 12:03 2018-06-21 12:28 1 7.99 138 262 25 4
2018-06-21 12:04 2018-06-21 12:28 1 1.59 186 162 15 3.16
2018-06-21 12:04 2018-06-21 12:37 1 5.58 24 233 24 7.44
2018-06-21 12:06 2018-06-21 12:10 5 0.89 142 163 5.5 0
2018-06-21 12:07 2018-06-21 12:38 3 1.8 90 163 18.5 0
2018-06-21 12:07 2018-06-21 12:18 1 2.12 48 239 0 2.06
R1C1 Notation

In Excel, the cell reference notation R1C1 is an alternative to the standard A1 notation. The R stands for row, and the C stands for column. You can enable R1C1 notation in Excel by going to File > Options > Formulas > R1C1 reference style. This notation is useful when working with arrays in VBA.

15.1 Defining Data Structures

Define the following as public so they are available for all the subs/functions in the module(s):

  • tripRecords: a dynamic array of Variant values.
  • cleanedTripRecords: a dynamic array of Variant values.
  • numRows: a Long variable.
  • numCols: a Long variable.
  • WS_TRIPS: a String constant equals "nyc_trip_records".
  • WS_TRIPS_CLEANED: a String constant equals "nyc_trip_records_processed".
  • WS_TRIPS: a String constant equals "nyc_trip_records".
  • WS_TRIPS_CLEANED: a String constant equals "nyc_trip_records_processed".

15.2 Analyzing Data

Create a function called AnalyzeData that:

  • Receives a Range parameter table consisting of a table in an Excel worksheet with headers and numerical values (for example, the table in worksheet WS_TRIPS) and returns a String type.

  • Calculates for each column of the table parameter, the following summary statistics: maximum, minimum, average, total count, and total sum. You cannot use Excel functions in the calculation.

  • Returns a String presenting all summary statistics. For each column, the statistics (i.e., avg, sum, max, min, and count) should be formatted as follows:

    <COLUMN NAME> | Avg=<AVG> | Sum=<SUM> | Max=<MAX> | Min=<MIN> | Count=<COUNT>.

    For example, passing the two last columns in worksheet WS_TRIPS, we have:

    fare_amount | Avg=13.82 | Sum=1382 | Max=57.5 | Min=0 | Count=100
    tip_amount | Avg=2.03 | Sum=203.18 | Max=11.66 | Min=0 | Count=100

    Tip: The command vbCrLf adds new lines in strings. Check the output in the Immediate window using the command Debug.Print.

15.3 Utils: Number of Rows and Columns

  1. Create a function countNumberOfRows that receives a worksheet name as a String parameter wsName and returns a Long value corresponding to the number of filled rows in this worksheet.
  2. Create a function countNumberOfCols that receives a worksheet name as a String parameter wsName and returns a Long value corresponding to the number of filled columns in this worksheet.
  3. Write a function DiffRows that receives as parameters two 2D arrays and returns a Long value corresponding to the difference in the number of rows. For example, if array a1 has 10 lines and array a2 has 8 lines, the function returns 2.

Tip: An empty cell is equal to "" (i.e., empty string). Start counting from the first cell of worksheet WS_TRIPS.

15.4 Loading the Data

Write a sub called LoadTripRecords that executes the following tasks.

  1. Determine the values of public variables numRows and numCols using the functions countNumberOfRows and countNumberOfColumns, respectively. For both functions, use the worksheet name stored in WS_TRIPS as an argument.
  2. Dimension the dynamic array tripRecords so that it has numRows rows and numCols columns. In other words, it now has numCols \(\times\) numRows elements.
  3. Store all the data (i.e., headings and values) from worksheet WS_TRIPS into the public array tripRecords.

15.5 Cleaning the Data

  1. Write a function called IsRowValid to check if a row from the array tripRecords contains valid data. This function receives a row number as a parameter row (Long) and returns True (i.e., a Boolean type) only if:

    • trip_distance is higher than 0, and
    • fare_amount is higher than 0, and
    • passenger_count is higher than 0.

    Tip: Notice that the first three data rows of worksheet WS_TRIPS are invalid.

  2. Create a Sub called CleanInvalidTrips that fills the public array cleanedTripRecords with only the valid rows from array tripRecords (use the IsRowValid function).

15.6 Writing the Data

  1. Create a Sub called WriteValidTripRecords that writes the contents of the public array cleanedTripRecords in worksheet WS_TRIPS_CLEANED.
  2. Add a button to the worksheet WS_TRIPS in column M with the caption “Clean trip records”, which, when clicked, executes a sub called SaveCleanData that:
    • Calls LoadTripRecords.
    • Calls CleanInvalidTrips.
    • Calls WriteValidTripRecords.
    • Shows a message box with the message “ rows have been cleaned.”, where “” corresponds to the difference in the number of rows between public arrays tripRecords and cleanedTripRecords.