L.A. Parking Citations - EDA Part II

Introduction

This is the second of a four-part exploratory data analysis project on an open source parking citation dataset. The project utilizes Python and several of its numerical packages (NumPy, SciPy, Matplotlib, Pandas, GeoPandas). The dataset I used for this EDA is available for free on Kaggle or at the Los Angeles open source data website.

In this post, we’re going to explore the date-time variables associated with this citation dataset. We’ll look into when citations are given throughout the day, week, and year. We’ll also examine revenue trends, and how many citations are issued on special days of the week or year.

Load Libraries

We’ll be using numpy, scipy, pandas, and matplotlib in this portion of the EDA.

import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Combine Datasets

If you read my previous post, L.A. Parking Citations - EDA Part I, you’ll recall that we broke the dataset up into four parts in order to save easily store it in a repository. Now that we’re looking to work with the dataset once again, we need to combine these pieces back together.

# load individual sets
la_ticket_2017_1 = pd.read_csv("~/Documents/data_science/py_la_tickets/input/la_ticket_2017_1.csv")
la_ticket_2017_2 = pd.read_csv("~/Documents/data_science/py_la_tickets/input/la_ticket_2017_2.csv")
la_ticket_2017_3 = pd.read_csv("~/Documents/data_science/py_la_tickets/input/la_ticket_2017_3.csv")
la_ticket_2017_4 = pd.read_csv("~/Documents/data_science/py_la_tickets/input/la_ticket_2017_4.csv")

# combine dataset
la_ticket_2017 = pd.concat([la_ticket_2017_1, la_ticket_2017_2, la_ticket_2017_3, la_ticket_2017_4])

# view dataset
la_ticket_2017.head()

	Unnamed: 0	Issue time	Meter Id	RP State Plate	Make	Color	Location	Violation code	Violation Description	Fine amount	Latitude	Longitude	Month	Day	Day of Year
0	2777524	104.0	NaN	OR	TOYT	BL	1822 WINONA BLVD	80.56E4+	RED ZONE	93.0	6.470239e+06	1.860397e+06	3	1	60
1	2777558	1050.0	NaN	CA	HOND	BK	710 EL CENTRO AV	4000A1	NO EVIDENCE OF REG	50.0	9.999900e+04	9.999900e+04	3	28	87
2	2827647	115.0	NaN	CA	HOND	GY	7502 WILLIS AVENUE	22500E	BLOCKING DRIVEWAY	68.0	6.424012e+06	1.897916e+06	1	3	3
3	2827648	126.0	NaN	CA	FORD	WT	14957 FRIAR STREET	5204A-	DISPLAY OF TABS	25.0	6.422948e+06	1.890266e+06	1	3	3
4	2827649	132.0	NaN	CA	HOND	BK	14634 SYLVAN ST	22514	FIRE HYDRANT	68.0	6.425072e+06	1.889888e+06	1	3	3

Citations by Day of the Year

Previously, we created the ‘Day of Year’ variable, visible in the above dataset. This variable will be useful in understanding trends in the number of citations issued throughout the year, and will allow us to easily transform the date-time data.

Let’s have a look at this variable now.

la_ticket_2017.hist('Day of Year', bins=365, color = 'seagreen')

png

I’ve binned the results such that each bar corresponds to each day of the year. Number of citations issued is described on the y-axis, and the sequential day of the year is described on the x-axis.

This visualization makes a few things apparent. First, it’s apparent that at least 500 parking citations are given each day. Citations issued max out at around 9000 in a single day. There are three distinct levels of citations being issued. Daily totals cluster at ~2500, ~4000, and ~7500 citations issued.

Let’s see if these clusters hold when we examine how much revenue is generated, each day, on average. To visualize revenue generated, we can sum the ‘Fine amount’ variable as grouped by the day of the year and then create a density plot from this data.

sum_fine_by_day = la_ticket_2017.groupby('Day of Year')['Fine amount'].sum()

sum_fine_by_day.plot.density(color = 'lightslategray')

png

The groupings hold after transforming the data in terms of revenue. We can see three peaks in average daily revenue. A local maximum appears at approximately $175,000, and another at $300,000. These peaks likely correspond to 2500 and 4000 citation groupings. A much larger global maximum arises at just under $600,000$. This corresponds to most frequency grouping of daily citations issued, ~7500.

It’s interesting to see that a typical day will generate between $150000 and $600000 in revenue from parking citations.

Now that we’re sure that there is a discernable pattern in daily citations issued, let’s explore what dictates that pattern. Parking regulations typically change in accordance to the day of the week. For example, many cities have no meter tolls on Sundays. It’s possible that changes in regulation and enforcement throughout the week result in these groupings.

To investigate this hypothesis, we’ll transform and visualize the data in terms of ‘day of the week’.

Citations by Day of the Week

First, we’ll create a dedicated Pandas Series that holds information on parking citations issued per day. That way we don’t have to re-run groupby and count commands each time we want to access data organized in this fashion.

# create a ticket count for each day
tickets_by_day = la_ticket_2017.groupby('Day of Year').count()

# create a stand alone series
ticket_by_day = tickets_by_day.iloc[:,0]

# show created series
ticket_by_day.head()

Day of Year
  1115
  1344
  8109
  7035
  7916
Name: Unnamed: 0, dtype: int64

Next, we’ll test out creating a Day of the Week variable on Sunday. The first day of 2017 was a Sunday, so that means that ‘Day of Year’ number 1, 8, 15, … were all Sundays. We can loop through the Series collecting these Sundays and stuffing them into a new variable.

i = 1 
sunday = []

# while counter is within dataset
while i <= len(ticket_by_day):
    # extract element and append to results array
    sunday.append(ticket_by_day[i])
    # proceed to next week
    i = i + 7

# format and display results
np.hstack(sunday)

array([1115, 2348, 2112, 1224, 1952, 1822, 2251, 1865, 2304, 1931, 2142,
       2372, 2312, 2131, 2479, 2074, 2295, 2200, 2020, 1964, 2087, 2257,
       2285, 2110, 2196, 2553, 2099, 2271, 2499, 2386, 2257, 2243, 2316,
       2407, 1972, 1969, 2124, 2095, 2328, 2115, 2140, 2026, 2319, 1892,
       1920, 1740, 1821, 1803, 2147, 1965, 1642, 1255, 1604])

Success!

Now to create a general function to apply this process to each day of the week.

# define the function name
def by_weekday(day):
    # set the counter equal to the argument
    i = day
    # create an empty array for results
    day = []
    
    # apply while loop to extract data
    while i <= 365:
        day.append(ticket_by_day[i])
        i = i + 7
    
    # format data
    np.hstack(day)
    day = pd.Series(day)
    
    # return results
    return(day)


# testing the function on Sunday
sunday = by_weekday(1)
sunday.head()

  1115
  2348
  2112
  1224
  1952
dtype: int64

Testing the function on Sunday produces the same results as above. Now we can proceed to apply this to each day of the week, creating seven new Pandas Series, then combining them into another DataFrame.

monday = by_weekday(2)
tuesday = by_weekday(3)
wednesday = by_weekday(4)
thursday = by_weekday(5)
friday = by_weekday(6)
saturday = by_weekday(7)

ticket_by_weekday = pd.concat([sunday, monday, tuesday, wednesday, thursday, friday, saturday], axis = 1)
ticket_by_weekday.columns = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
ticket_by_weekday.head()

	Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
0	1115	1344.0	8109.0	7035.0	7916.0	7233.0	4859.0
1	2348	6604.0	8087.0	7204.0	6939.0	6980.0	5040.0
2	2112	1375.0	8229.0	7312.0	8201.0	6128.0	4409.0
3	1224	7612.0	8431.0	8127.0	8511.0	7610.0	4806.0
4	1952	8542.0	8678.0	8203.0	8767.0	8016.0	5149.0

Great, the dataset is now in terms of ‘Day of the Week’. Let’s visualize this transformation.

ticket_by_weekday.plot()

png

A beautiful visualization if I do say so myself! It looks like my hunch that the groupings were based on day of the week, was correct. Sunday regularly produces around 2000 citations. Saturday regularly produces 45000 citations. The weekdays average around 7000 to 8000 citations. This graph is a bit hard to follow, the data might be easier to understand if we visualize it as a series of boxplots.

ticket_by_weekday.boxplot()

png

This makes the mean citations issued much clearer. There is a very distinct difference between the number of citations issued on Sundays versus Mondays. But is there a difference between the weekdays? We can do a series of statistical tests to see if the means are significantly different.

There are many procedures to test if two sample means are significantly different. One of the most straightforward is the t-test. Let’s conduct a two-population t-test on the Sunday and Monday data to be sure that the difference observed in the visualization is statistically significant.

# t-test Sunday and Monday
stats.ttest_ind(ticket_by_weekday['Sunday'], ticket_by_weekday['Monday'])

Ttest_indResult(statistic=-14.114555018346483, pvalue=9.993642210838924e-26)

The mean number of citations issued on Sundays is statistically different than the mean number of citations issued on Mondays. Instead of doing t-tests by hand for all the pairs, let’s write a little loop that will do the leg work for us.

# initial day to compare
for day in range(7):
    # second day for the comparison
    for day_2 in range(day+1, 7):
        # print the days being compared
        print(list(ticket_by_weekday)[day], list(ticket_by_weekday)[day_2])
        # conduct the t-test
        print(stats.ttest_ind(ticket_by_weekday.iloc[:,day], ticket_by_weekday.iloc[:,day_2]))

Sunday Monday
Ttest_indResult(statistic=-14.114555018346483, pvalue=9.993642210838924e-26)
Sunday Tuesday
Ttest_indResult(statistic=-39.200105865818514, pvalue=2.5709275857822333e-63)
Sunday Wednesday
Ttest_indResult(statistic=-70.26192329299684, pvalue=3.3199297166490703e-88)
Sunday Thursday
Ttest_indResult(statistic=-36.573858454513214, pvalue=1.9044297958758257e-60)
Sunday Friday
Ttest_indResult(statistic=-28.946699561781212, pvalue=5.397450684984687e-51)
Sunday Saturday
Ttest_indResult(statistic=-23.6204632589478, pvalue=3.772390749169104e-43)
Monday Tuesday
Ttest_indResult(statistic=-3.7245765466041916, pvalue=0.0003209744709376137)
Monday Wednesday
Ttest_indResult(statistic=-2.931192014701394, pvalue=0.004168140587481362)
Monday Thursday
Ttest_indResult(statistic=-3.4884588693442655, pvalue=0.0007185137137979519)
Monday Friday
Ttest_indResult(statistic=-0.8990678820480197, pvalue=0.37073406361282246)
Monday Saturday
Ttest_indResult(statistic=7.205958264020363, pvalue=1.0242512878117741e-10)
Tuesday Wednesday
Ttest_indResult(statistic=2.199934610785886, pvalue=0.03006852630328924)
Tuesday Thursday
Ttest_indResult(statistic=0.3307029989528652, pvalue=0.7415471929779237)
Tuesday Friday
Ttest_indResult(statistic=4.52141745074799, pvalue=1.661069033071054e-05)
Tuesday Saturday
Ttest_indResult(statistic=22.27992785117015, pvalue=5.604256381357363e-41)
Wednesday Thursday
Ttest_indResult(statistic=-1.6734243085378289, pvalue=0.09730802346354772)
Wednesday Friday
Ttest_indResult(statistic=3.6005602951584, pvalue=0.0004922318699883966)
Wednesday Saturday
Ttest_indResult(statistic=31.563121936548725, pvalue=1.9078872746154058e-54)
Thursday Friday
Ttest_indResult(statistic=4.086626355994712, pvalue=8.737245777165502e-05)
Thursday Saturday
Ttest_indResult(statistic=20.84950833547568, pvalue=1.457426897692227e-38)
Friday Saturday
Ttest_indResult(statistic=14.788541058567853, pvalue=4.029675334598723e-27)

These results aren’t pretty, but they do the job! Running through the results, we can see that Sunday is significantly different from each of the other days of the week. Monday and Friday are not significantly different; however, Monday and Wednesday are. There is no meaningful difference between Tuesday and Thursday. The average number of citations issued on Saturday is significantly different from each of the other days of the week as well.

That explains our daily-citation groupings!

Let’s mull over the date-time data from another perspective, and see if we can come up with any more interesting insights.

Citations by Month

Tickets issued by month. In the previous data-wrangling post, we created a stand-alone ‘Month’ variable for each citation. Let’s visualize these now and see if there’s anything interesting to look at.

la_ticket_2017.hist('Month', bins=12, color = 'indigo', edgecolor='black', linewidth=1)

png

A uniform distribution throughout the year. the effect of a short February is evident in the total tickets issued. It seems that November is also a bit slower of a month. There doesn’t seem to be any pattern to speak of here. How about a pattern within the month?

la_ticket_2017.hist('Day', bins=31, color = 'firebrick', edgecolor = 'black')

png

We see a sharp decline in tickets issued on the 29th, 30th, and 31st days of the month. This can likely be attributed to Not all months having a 29th, 30th, and 31st day. Let’s see what happens when we remove the effect of February. Nothing surprising here, last 29, 30, 31 drop off due to not being in every month.

la_ticket_2017[la_ticket_2017['Month'] != 2].hist('Day', bins = 31, color = 'firebrick', edgecolor = 'black')

png

Removing February helps days’ number 29 and 30 close the previously visualized gap. Other than that, there isn’t much of a difference.

Citations by Hours

When visualizing the time a citation is issued, we can expect to see some big trends. Parking enforcement hours, typically sometime between 8am and 5pm, should see a huge surge in citations. The early morning hours should see barely any citations.

Let’s see what kind of story the data tells us this time.

la_ticket_2017.hist('Issue time', range=(0, 2400), bins=200, color = 'forestgreen')

png

This histogram has been binned into 200 columns. This means that each hour has a six, ten-minute columns, and there are several columns of space between the hours (there is no 10:70 AM).

Like we predicted, there are very few citations issued at 5 AM. A massive spike of citations, from a total of 5000, to over 70,000 occurs early in the 8 o’clock hour. We’ll have to take a closer look at this hour specifically. There’s a lull of enforcement after that, and again a jump at 10 AM.

It would appear that the parking officers get a bit lazy after lunch; citations issued drop by half, and steadily decline thereafter.

la_ticket_2017.hist('Issue time', range=(800, 859), bins=60, color = 'forestgreen')

png

Nearly a 500% increase in citations issued between 8:04 AM and 8:05 AM. Perhaps that 4 minutes is the time it takes the official to track down their first victim after clocking in at 8:00 AM. Either way, it’s handy information to have.

It’s also clear that citations group around the 5 minute interval, suggesting that officers have the ability to round the issue time, and sometimes utilize that liberty. I would have thought an electronic clock would time-stamp each citation, but apparently not.

Citations on Holidays

Finally, let’s separate a number of specific days of the year, and see how many citations are issued on these days as compared to other, ‘normal’ days. I’m talking seeing if parking officers feel the holiday spirit, or do they play the part of the Grinch?

# these are the holidays that came to my mind and their dates in 2017
holidays = pd.Series(['2017/01/01',
                      '2017/01/16',
                      '2017/02/14',
                      '2017/05/29',
                      '2017/07/04',
                      '2017/09/04',
                      '2017/10/31',
                      '2017/11/10',
                      '2017/11/23',
                      '2017/12/25'],
                     index = ['New Years Day',
                              'MLK Day',
                              'Valentines Day',
                              'Memorial Day',
                              'Independence Day',
                              'Labor Day',
                              'Halloween',
                              'Veterans Day',
                              'Thanksgiving',
                              'Christmas']
                    )

holidays

New Year’s Day       2017/01/01
MLK Day             2017/01/16
Valentine’s Day      2017/02/14
Memorial Day        2017/05/29
Independence Day    2017/07/04
Labor Day           2017/09/04
Halloween           2017/10/31
Veterans Day        2017/11/10
Thanksgiving        2017/11/23
Christmas           2017/12/25
dtype: object

# change the date format to day of the year for easy use
holidays_date = pd.to_datetime(holidays, format='%Y/%m/%d')
holidays_doy = holidays_date.dt.dayofyear
holidays_doy

New Year’s Day        1
MLK Day              16
Valentine’s Day      45
Memorial Day        149
Independence Day    185
Labor Day           247
Halloween           304
Veterans Day        314
Thanksgiving        327
Christmas           359
dtype: int64

After defining some holidays to consider, I’m going to count the number of tickets given on each of these days, and save the results to a new, ‘holiday_tickets’ Pandas DataFrame.

# create the empty dataframe, named after each holiday
holiday_tickets = pd.DataFrame(columns=['Number of Tickets'], index = holidays_doy.index)

# set a counter equal to zero
i = 0

# for each of the hoidays
while i <= len(holidays_doy) - 1:
    # sum the number of tickets given on that day and save the output to the new DataFrame
    holiday_tickets.iloc[i, ] = sum(la_ticket_2017['Day of Year'] == holidays_doy.iloc[i])
    # move to the next holiday
    i = i + 1

If that worked correctly, this should give us a bar plot of how many tickets are given on each of the prescribed holidays.

holiday_tickets.plot(kind = 'barh', color = 'slateblue').invert_yaxis()

png

Most holidays have far fewer tickets than the ~6000 citation daily average, That is with the exception of Halloween and Valentine’s day. I would assert that these are not federally recognized holidays, and therefore there is no change in enforcement.

In fact, I don’t believe that parking citation officers are feeling more lenient, they just get the day off!

Here’s the associated table.

holiday_tickets

	Number of Tickets
New Years Day	1115
MLK Day	1375
Valentines Day	7388
Memorial Day	1496
Independence Day	1445
Labor Day	1318
Halloween	7498
Veterans Day	1104
Thanksgiving	922
Christmas	893

Christmas turns out to have less citations than any other holiday, is it the least cited day of the entire year?

ticket_by_day.min()

I guess even the parking Scrooges feel a little bit of Christmas spirit!

Conclusions

In conclusion, you should park in L.A. at 5 AM on Christmas Day, especially if it’s a Sunday. If you park on a illegally on a Tuesday or Thursday at 8:20 AM or 12:10 PM, expect to pay!

You should definitely double check the parking signs between 8 AM and 1 PM on the weekdays, and don’t expect the citation officers to give you a break for Valentine’s Day!

There’s so much information stored in date-time variables, I think that’s what makes them so fun to work with. On the next segment of the L.A. Parking Citation Series, I’ll be considering the spatial variables! Where are citations issued? Where are the most grievous parking meters located?

Perhaps by combining the upcoming spatial analysis and this date-time information, you could piece together a devious plan to avoid those pesky parking citations!

Until next time,

- Fisher

Introduction

Load Libraries

Combine Datasets

Citations by Day of the Year

Citations by Day of the Week

Citations by Month

Citations by Hours

Citations on Holidays

Conclusions

Comments

Related Posts

The Stringr Package 16 Jun 2019

The Lubridate Package 16 Jun 2019

The Forcats Package 16 Jun 2019