31  NYC Taxi Rides Data Analysis with Excel

In this assignment, we will clean, explore, and analyze a comprehensive collection of taxi ride records from New York City (see file nyc_taxi_rides.csv). Records are generated from individual taxi trips. Each trip is logged electronically through a digital metering system that captures detailed information from the moment a taxi trip starts (when a passenger is picked up) to the moment it ends (when the passenger reaches their destination). Pickup and drop-off locations correspond to taxi zones (see nyc_taxi_zones.csv) within the city’s five districts, referred to as boroughs (see below).

New York City boroughs (districts) and taxi zones.

New York City boroughs (districts) and taxi zones.

31.1 🧠 Learning Goals

By completing this assignment, you will be able to:

  1. Efficiently filter, select, and transform data using Excel.
  2. Identify and correct common data issues, ensuring clean datasets.
  3. Augment and visualize datasets.

31.2 🔧Instructions

Upon completing the assignment, upload the following:

  • The .xlsx workbook.
  • The bar_avg_tip.svg figure.

31.3 📝Assignments

The following assignments are divided into the following sections:

  • Data Cleaning: Prepare the data for analysis.
  • Data Augmentation: Add new features to the dataset.
  • Data Formatting: Format the data for analysis.
  • Data Querying: Answer specific questions about the data.
  • Data Analysis: Perform statistical analysis on the data.

31.3.1 Importing the Data

  1. Import the .csv data from nyc_taxi_rides.csv into a worksheet “Raw”.
  2. Import the .csv data from nyc_taxi_zones.csv into a worksheet “Taxi zones”.

31.3.2 Cleaning the Data (worksheet “Processed”)

  1. Copy the data and headers from the following columns of the worksheet “Raw” to a new worksheet “Processed”.

    • pickup_datetime: The date and time when the meter was engaged.
    • dropoff_datetime: The date and time when the meter was disengaged.
    • pickup_location_id: Taxi zone in which the taximeter was engaged.
    • dropoff_location_id: Taxi zone where the taximeter was disengaged.
    • passenger_count: The number of passengers in the vehicle.
    • trip_distance: The elapsed trip distance in miles reported by the taximeter.
    • fare_amount: The time-and-distance fare calculated by the meter.
    • tip_amount: Tips given with credit card (cash tips are not accounted).
  2. Turn the copied data into an Excel Table.

  3. Delete all entries where

    • trip_distance ≤ 0, or
    • fare_amount ≤ 0, or
    • passenger_count ≤ 0.

31.3.3 Augmenting the Data (worksheet “Processed”)

  1. Convert the values from trip_distance from miles to kilometers (1 km = 0.621371 mile).
  2. Change the trip_distance column name to trip_distance_km.
  3. Create a column ride_length_class to categorize rides according to their trip_distance_km:
    • Short: trip_distance_km < 1,
    • Medium: 1≤ trip_distance_km < 5, and
    • Long: trip_distance_km ≥ 5.
  4. Use the VLOOKUP or XLOOKUP functions to look up in worksheet “Taxi zones” the names of the boroughs corresponding to each location in column pickup_location_id. Add these boroughs to a new column named pickup_borough.

31.3.4 Formatting the Data (worksheet “Processed”)

  1. Order columns as follows:

    1. pickup_borough
    2. pickup_location_id
    3. pickup_datetime
    4. dropoff_location_id
    5. dropoff_datetime
    6. passenger_count
    7. trip_distance_km
    8. ride_length_class
    9. fare_amount
    10. tip_amount
  2. Adjust column widths so all headings can be seen.

  3. Freeze the first two columns (pickup_borough and pickup_location_id) and the top row.

  4. Sort the data hierarchically (ascending order) according to the following levels:

    1. pickup_borough
    2. pickup_location_id
    3. pickup_datetime
  5. Apply conditional formatting to highlight (yellow fill) all rows where the tip_amount is higher than 10 dollars.

31.3.5 Data Querying (New Worksheet “Queries”)

  1. In cell B2, write a formula that calculates the average fare_amount.
  2. Use a formula in cell B3 to determine the total number of rides whose passenger_count is greater than 3 and tip_amount is greater than 10 dollars.
  3. Use a formula in cell B4 to find the most frequent dropoff_location_id for rides departing from Manhattan.

31.3.6 Data Analysis (New Worksheets “Histogram” and “Pivot”)

  1. Use the Analysis Toolpak to create a histogram of the column trip_distance_km (use ten bins). Display the histogram in a new worksheet “Histogram” (also show Pareto and Cumulative Percentage in the chart). Change the x-axis title to “Trip Distance (Km) and the y-axis title to”Frequency of rides”.

  2. Create a Pivot table of all the data on worksheet “Processed” and place this in a new worksheet called “Pivot”. Then, display per pickup_borough:

    • The total number of rides,
    • The average fare_amount, and
    • The average tip_amount.

    Sort the boroughs in descending order of average tip_amount.

31.3.7 Data Visualization (New Worksheet “Charts”)

  1. Using data from your pivot table, create a bar chart in a worksheet “Charts” showcasing the average tip_amount for each pickup_borough. Add to the chart:

    • An x-axis entitled “Departure borough”;
    • A y-axis entitled “Average tip amount (dollars)”;
    • A title “Tipping behavior of NYC taxi users”.
  2. Export the chart to a .SVG figure entitled bar_avg_tip.svg.