| pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_location_id | dropoff_location_id | fare_amount | tip_amount |
|---|---|---|---|---|---|---|---|
| 2018-06-21 05:52 | 2018-06-21 06:00 | 0 | 1.58 | 148 | 231 | 7.5 | 1.76 |
| 2018-06-21 06:17 | 2018-06-21 06:21 | 1 | 0 | 140 | 262 | 4.5 | 1 |
| 2018-06-21 06:26 | 2018-06-21 06:53 | 1 | 1.69 | 230 | 141 | 0 | 3.36 |
| 2018-06-21 06:28 | 2018-06-22 05:57 | 1 | 1.64 | 68 | 230 | 17.5 | 3.66 |
| 2018-06-21 06:48 | 2018-06-21 07:03 | 1 | 1.66 | 236 | 161 | 11 | 2.36 |
| 2018-06-21 07:38 | 2018-06-21 07:52 | 2 | 2.3 | 237 | 234 | 11.5 | 2.46 |
| 2018-06-21 08:05 | 2018-06-21 08:11 | 6 | 1.02 | 79 | 137 | 6.5 | 0 |
| 2018-06-21 08:26 | 2018-06-22 07:51 | 5 | 1.72 | 230 | 90 | 16 | 3.46 |
| 2018-06-21 09:04 | 2018-06-21 09:07 | 3 | 11.27 | 138 | 163 | 45.5 | 9 |
| 2018-06-21 11:41 | 2018-06-21 11:48 | 1 | 1.46 | 43 | 43 | 7.5 | 1.76 |
| 2018-06-21 12:00 | 2018-06-21 12:27 | 1 | 1.95 | 162 | 142 | 16.5 | 3.11 |
| 2018-06-21 12:01 | 2018-06-21 12:01 | 2 | 0 | 112 | 112 | 2.5 | 0 |
| 2018-06-21 12:02 | 2018-06-21 12:11 | 1 | 0.61 | 161 | 233 | 7 | 0 |
| 2018-06-21 12:02 | 2018-06-21 12:14 | 1 | 1.8 | 162 | 90 | 9.5 | 2.05 |
| 2018-06-21 12:03 | 2018-06-21 12:28 | 1 | 7.99 | 138 | 262 | 25 | 4 |
| 2018-06-21 12:04 | 2018-06-21 12:28 | 1 | 1.59 | 186 | 162 | 15 | 3.16 |
| 2018-06-21 12:04 | 2018-06-21 12:37 | 1 | 5.58 | 24 | 233 | 24 | 7.44 |
| 2018-06-21 12:06 | 2018-06-21 12:10 | 5 | 0.89 | 142 | 163 | 5.5 | 0 |
| 2018-06-21 12:07 | 2018-06-21 12:38 | 3 | 1.8 | 90 | 163 | 18.5 | 0 |
| 2018-06-21 12:07 | 2018-06-21 12:18 | 1 | 2.12 | 48 | 239 | 0 | 2.06 |
18 NYC Taxi Ride Records Analysis
In the following exercises, we will clean and analyze a collection of taxi ride records from New York City (see Table 18.1). The records are generated from individual taxi trips. Each trip is logged electronically through a digital metering system that captures detailed information from the moment a taxi trip starts (when a passenger is picked up) to the moment the trip ends (when the passenger reaches their destination). Pickup and drop-off locations correspond to taxi zones within the city’s five districts.
In Excel, the cell reference notation R1C1 is an alternative to the standard A1 notation. The R stands for row, and the C stands for column. You can enable R1C1 notation in Excel by going to File > Options > Formulas > R1C1 reference style. This notation is useful when working with arrays in VBA.
18.1 Defining Data Structures
Define the following as public so they are available for all the subs/functions in the module(s):
tripRecords: a dynamic array ofVariantvalues.cleanedTripRecords: a dynamic array ofVariantvalues.numRows: aLongvariable.numCols: aLongvariable.WS_TRIPS: aStringconstant equals"nyc_trip_records".WS_TRIPS_CLEANED: aStringconstant equals"nyc_trip_records_processed".WS_TRIPS: aStringconstant equals"nyc_trip_records".WS_TRIPS_CLEANED: aStringconstant equals"nyc_trip_records_processed".
18.2 Analyzing Data
Create a function called AnalyzeData that:
Receives a
Rangeparametertableconsisting of a table in an Excel worksheet with headers and numerical values (for example, the table in worksheetWS_TRIPS) and returns aStringtype.Calculates for each column of the table parameter, the following summary statistics: maximum, minimum, average, total count, and total sum. You cannot use Excel functions in the calculation.
Returns a
Stringpresenting all summary statistics. For each column, the statistics (i.e., avg, sum, max, min, and count) should be formatted as follows:<COLUMN NAME> | Avg=<AVG> | Sum=<SUM> | Max=<MAX> | Min=<MIN> | Count=<COUNT>.For example, passing the two last columns in worksheet
WS_TRIPS, we have:Tip: The command
vbCrLfadds new lines in strings. Check the output in theImmediate windowusing the commandDebug.Print.
18.3 Utils: Number of Rows and Columns
- Create a function
countNumberOfRowsthat receives a worksheet name as aStringparameterwsNameand returns aLongvalue corresponding to the number of filled rows in this worksheet. - Create a function
countNumberOfColsthat receives a worksheet name as aStringparameterwsNameand returns aLongvalue corresponding to the number of filled columns in this worksheet. - Write a function
DiffRowsthat receives as parameters two 2D arrays and returns aLongvalue corresponding to the difference in the number of rows. For example, if arraya1has 10 lines and arraya2has 8 lines, the function returns 2.
Tip: An empty cell is equal to "" (i.e., empty string). Start counting from the first cell of worksheet WS_TRIPS.
18.4 Loading the Data
Write a sub called LoadTripRecords that executes the following tasks.
- Determine the values of public variables
numRowsandnumColsusing the functionscountNumberOfRowsandcountNumberOfColumns, respectively. For both functions, use the worksheet name stored inWS_TRIPSas an argument. - Dimension the dynamic array
tripRecordsso that it hasnumRowsrows andnumColscolumns. In other words, it now hasnumCols\(\times\)numRowselements. - Store all the data (i.e., headings and values) from worksheet
WS_TRIPSinto the public arraytripRecords.
18.5 Cleaning the Data
Write a function called
IsRowValidto check if a row from the arraytripRecordscontains valid data. This function receives a row number as a parameterrow(Long) and returnsTrue(i.e., aBooleantype) only if:trip_distanceis higher than 0, andfare_amountis higher than 0, andpassenger_countis higher than 0.
Tip: Notice that the first three data rows of worksheet
WS_TRIPSare invalid.Create a Sub called
CleanInvalidTripsthat fills the public arraycleanedTripRecordswith only the valid rows from arraytripRecords(use theIsRowValidfunction).
18.6 Writing the Data
- Create a Sub called
WriteValidTripRecordsthat writes the contents of the public arraycleanedTripRecordsin worksheetWS_TRIPS_CLEANED. - Add a button to the worksheet
WS_TRIPSin columnMwith the caption “Clean trip records”, which, when clicked, executes a sub calledSaveCleanDatathat:- Calls
LoadTripRecords. - Calls
CleanInvalidTrips. - Calls
WriteValidTripRecords. - Shows a message box with the message “
rows have been cleaned.”, where “ ” corresponds to the difference in the number of rows between public arrays tripRecordsandcleanedTripRecords.
- Calls