pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_location_id | dropoff_location_id | fare_amount | tip_amount |
---|---|---|---|---|---|---|---|
2018-06-21 05:52 | 2018-06-21 06:00 | 0 | 1.58 | 148 | 231 | 7.5 | 1.76 |
2018-06-21 06:17 | 2018-06-21 06:21 | 1 | 0 | 140 | 262 | 4.5 | 1 |
2018-06-21 06:26 | 2018-06-21 06:53 | 1 | 1.69 | 230 | 141 | 0 | 3.36 |
2018-06-21 06:28 | 2018-06-22 05:57 | 1 | 1.64 | 68 | 230 | 17.5 | 3.66 |
2018-06-21 06:48 | 2018-06-21 07:03 | 1 | 1.66 | 236 | 161 | 11 | 2.36 |
2018-06-21 07:38 | 2018-06-21 07:52 | 2 | 2.3 | 237 | 234 | 11.5 | 2.46 |
2018-06-21 08:05 | 2018-06-21 08:11 | 6 | 1.02 | 79 | 137 | 6.5 | 0 |
2018-06-21 08:26 | 2018-06-22 07:51 | 5 | 1.72 | 230 | 90 | 16 | 3.46 |
2018-06-21 09:04 | 2018-06-21 09:07 | 3 | 11.27 | 138 | 163 | 45.5 | 9 |
2018-06-21 11:41 | 2018-06-21 11:48 | 1 | 1.46 | 43 | 43 | 7.5 | 1.76 |
2018-06-21 12:00 | 2018-06-21 12:27 | 1 | 1.95 | 162 | 142 | 16.5 | 3.11 |
2018-06-21 12:01 | 2018-06-21 12:01 | 2 | 0 | 112 | 112 | 2.5 | 0 |
2018-06-21 12:02 | 2018-06-21 12:11 | 1 | 0.61 | 161 | 233 | 7 | 0 |
2018-06-21 12:02 | 2018-06-21 12:14 | 1 | 1.8 | 162 | 90 | 9.5 | 2.05 |
2018-06-21 12:03 | 2018-06-21 12:28 | 1 | 7.99 | 138 | 262 | 25 | 4 |
2018-06-21 12:04 | 2018-06-21 12:28 | 1 | 1.59 | 186 | 162 | 15 | 3.16 |
2018-06-21 12:04 | 2018-06-21 12:37 | 1 | 5.58 | 24 | 233 | 24 | 7.44 |
2018-06-21 12:06 | 2018-06-21 12:10 | 5 | 0.89 | 142 | 163 | 5.5 | 0 |
2018-06-21 12:07 | 2018-06-21 12:38 | 3 | 1.8 | 90 | 163 | 18.5 | 0 |
2018-06-21 12:07 | 2018-06-21 12:18 | 1 | 2.12 | 48 | 239 | 0 | 2.06 |
15 NYC Taxi Ride Records Analysis
In the following exercises, we will clean and analyze a collection of taxi ride records from New York City (see Table 15.1). The records are generated from individual taxi trips. Each trip is logged electronically through a digital metering system that captures detailed information from the moment a taxi trip starts (when a passenger is picked up) to the moment the trip ends (when the passenger reaches their destination). Pickup and drop-off locations correspond to taxi zones within the city’s five districts.
In Excel, the cell reference notation R1C1
is an alternative to the standard A1
notation. The R
stands for row, and the C
stands for column. You can enable R1C1
notation in Excel by going to File > Options > Formulas > R1C1 reference style
. This notation is useful when working with arrays in VBA.
15.1 Defining Data Structures
Define the following as public so they are available for all the subs/functions in the module(s):
tripRecords
: a dynamic array ofVariant
values.cleanedTripRecords
: a dynamic array ofVariant
values.numRows
: aLong
variable.numCols
: aLong
variable.WS_TRIPS
: aString
constant equals"nyc_trip_records"
.WS_TRIPS_CLEANED
: aString
constant equals"nyc_trip_records_processed"
.WS_TRIPS
: aString
constant equals"nyc_trip_records"
.WS_TRIPS_CLEANED
: aString
constant equals"nyc_trip_records_processed"
.
15.2 Analyzing Data
Create a function called AnalyzeData
that:
Receives a
Range
parametertable
consisting of a table in an Excel worksheet with headers and numerical values (for example, the table in worksheetWS_TRIPS
) and returns aString
type.Calculates for each column of the table parameter, the following summary statistics: maximum, minimum, average, total count, and total sum. You cannot use Excel functions in the calculation.
Returns a
String
presenting all summary statistics. For each column, the statistics (i.e., avg, sum, max, min, and count) should be formatted as follows:<COLUMN NAME> | Avg=<AVG> | Sum=<SUM> | Max=<MAX> | Min=<MIN> | Count=<COUNT>
.For example, passing the two last columns in worksheet
WS_TRIPS
, we have:fare_amount | Avg=13.82 | Sum=1382 | Max=57.5 | Min=0 | Count=100 tip_amount | Avg=2.03 | Sum=203.18 | Max=11.66 | Min=0 | Count=100
Tip: The command
vbCrLf
adds new lines in strings. Check the output in theImmediate window
using the commandDebug.Print
.
15.3 Utils: Number of Rows and Columns
- Create a function
countNumberOfRows
that receives a worksheet name as aString
parameterwsName
and returns aLong
value corresponding to the number of filled rows in this worksheet. - Create a function
countNumberOfCols
that receives a worksheet name as aString
parameterwsName
and returns aLong
value corresponding to the number of filled columns in this worksheet. - Write a function
DiffRows
that receives as parameters two 2D arrays and returns aLong
value corresponding to the difference in the number of rows. For example, if arraya1
has 10 lines and arraya2
has 8 lines, the function returns 2.
Tip: An empty cell is equal to ""
(i.e., empty string). Start counting from the first cell of worksheet WS_TRIPS
.
15.4 Loading the Data
Write a sub called LoadTripRecords
that executes the following tasks.
- Determine the values of public variables
numRows
andnumCols
using the functionscountNumberOfRows
andcountNumberOfColumns
, respectively. For both functions, use the worksheet name stored inWS_TRIPS
as an argument. - Dimension the dynamic array
tripRecords
so that it hasnumRows
rows andnumCols
columns. In other words, it now hasnumCols
\(\times\)numRows
elements. - Store all the data (i.e., headings and values) from worksheet
WS_TRIPS
into the public arraytripRecords
.
15.5 Cleaning the Data
Write a function called
IsRowValid
to check if a row from the arraytripRecords
contains valid data. This function receives a row number as a parameterrow
(Long
) and returnsTrue
(i.e., aBoolean
type) only if:trip_distance
is higher than 0, andfare_amount
is higher than 0, andpassenger_count
is higher than 0.
Tip: Notice that the first three data rows of worksheet
WS_TRIPS
are invalid.Create a Sub called
CleanInvalidTrips
that fills the public arraycleanedTripRecords
with only the valid rows from arraytripRecords
(use theIsRowValid
function).
15.6 Writing the Data
- Create a Sub called
WriteValidTripRecords
that writes the contents of the public arraycleanedTripRecords
in worksheetWS_TRIPS_CLEANED
. - Add a button to the worksheet
WS_TRIPS
in columnM
with the caption “Clean trip records”, which, when clicked, executes a sub calledSaveCleanData
that:- Calls
LoadTripRecords
. - Calls
CleanInvalidTrips
. - Calls
WriteValidTripRecords
. - Shows a message box with the message “
rows have been cleaned.”, where “ ” corresponds to the difference in the number of rows between public arrays tripRecords
andcleanedTripRecords
.
- Calls