This is the second of a four-part exploratory data analysis project on an open source parking citation dataset. The project utilizes Python and several of its numerical packages (NumPy, SciPy, Matplotlib, Pandas, GeoPandas). The dataset I used for this EDA is available for free on Kaggle or at the Los Angeles open source data website.
In this post, we’re going to explore the date-time variables associated with this citation dataset. We’ll look into when citations are given throughout the day, week, and year. We’ll also examine revenue trends, and how many citations are issued on special days of the week or year.
We’ll be using numpy, scipy, pandas, and matplotlib in this portion of the EDA.
import numpy as np from scipy import stats import pandas as pd import matplotlib.pyplot as plt plt.style.use('ggplot')
If you read my previous post, L.A. Parking Citations - EDA Part I, you’ll recall that we broke the dataset up into four parts in order to save easily store it in a repository. Now that we’re looking to work with the dataset once again, we need to combine these pieces back together.
# load individual sets la_ticket_2017_1 = pd.read_csv("~/Documents/data_science/py_la_tickets/input/la_ticket_2017_1.csv") la_ticket_2017_2 = pd.read_csv("~/Documents/data_science/py_la_tickets/input/la_ticket_2017_2.csv") la_ticket_2017_3 = pd.read_csv("~/Documents/data_science/py_la_tickets/input/la_ticket_2017_3.csv") la_ticket_2017_4 = pd.read_csv("~/Documents/data_science/py_la_tickets/input/la_ticket_2017_4.csv") # combine dataset la_ticket_2017 = pd.concat([la_ticket_2017_1, la_ticket_2017_2, la_ticket_2017_3, la_ticket_2017_4]) # view dataset la_ticket_2017.head()
|Unnamed: 0||Issue time||Meter Id||RP State Plate||Make||Color||Location||Violation code||Violation Description||Fine amount||Latitude||Longitude||Month||Day||Day of Year|
|0||2777524||104.0||NaN||OR||TOYT||BL||1822 WINONA BLVD||80.56E4+||RED ZONE||93.0||6.470239e+06||1.860397e+06||3||1||60|
|1||2777558||1050.0||NaN||CA||HOND||BK||710 EL CENTRO AV||4000A1||NO EVIDENCE OF REG||50.0||9.999900e+04||9.999900e+04||3||28||87|
|2||2827647||115.0||NaN||CA||HOND||GY||7502 WILLIS AVENUE||22500E||BLOCKING DRIVEWAY||68.0||6.424012e+06||1.897916e+06||1||3||3|
|3||2827648||126.0||NaN||CA||FORD||WT||14957 FRIAR STREET||5204A-||DISPLAY OF TABS||25.0||6.422948e+06||1.890266e+06||1||3||3|
|4||2827649||132.0||NaN||CA||HOND||BK||14634 SYLVAN ST||22514||FIRE HYDRANT||68.0||6.425072e+06||1.889888e+06||1||3||3|
Citations by Day of the Year
Previously, we created the ‘Day of Year’ variable, visible in the above dataset. This variable will be useful in understanding trends in the number of citations issued throughout the year, and will allow us to easily transform the date-time data.
Let’s have a look at this variable now.
la_ticket_2017.hist('Day of Year', bins=365, color = 'seagreen')
I’ve binned the results such that each bar corresponds to each day of the year. Number of citations issued is described on the y-axis, and the sequential day of the year is described on the x-axis.
This visualization makes a few things apparent. First, it’s apparent that at least 500 parking citations are given each day. Citations issued max out at around 9000 in a single day. There are three distinct levels of citations being issued. Daily totals cluster at ~2500, ~4000, and ~7500 citations issued.
Let’s see if these clusters hold when we examine how much revenue is generated, each day, on average. To visualize revenue generated, we can sum the ‘Fine amount’ variable as grouped by the day of the year and then create a density plot from this data.
sum_fine_by_day = la_ticket_2017.groupby('Day of Year')['Fine amount'].sum() sum_fine_by_day.plot.density(color = 'lightslategray')
The groupings hold after transforming the data in terms of revenue. We can see three peaks in average daily revenue. A local maximum appears at approximately $175,000, and another at $300,000. These peaks likely correspond to 2500 and 4000 citation groupings. A much larger global maximum arises at just under $600,000$. This corresponds to most frequency grouping of daily citations issued, ~7500.
It’s interesting to see that a typical day will generate between $150000 and $600000 in revenue from parking citations.
Now that we’re sure that there is a discernable pattern in daily citations issued, let’s explore what dictates that pattern. Parking regulations typically change in accordance to the day of the week. For example, many cities have no meter tolls on Sundays. It’s possible that changes in regulation and enforcement throughout the week result in these groupings.
To investigate this hypothesis, we’ll transform and visualize the data in terms of ‘day of the week’.
Citations by Day of the Week
First, we’ll create a dedicated Pandas Series that holds information on parking citations issued per day. That way we don’t have to re-run groupby and count commands each time we want to access data organized in this fashion.
# create a ticket count for each day tickets_by_day = la_ticket_2017.groupby('Day of Year').count() # create a stand alone series ticket_by_day = tickets_by_day.iloc[:,0] # show created series ticket_by_day.head()
Day of Year 1 1115 2 1344 3 8109 4 7035 5 7916 Name: Unnamed: 0, dtype: int64
Next, we’ll test out creating a Day of the Week variable on Sunday. The first day of 2017 was a Sunday, so that means that ‘Day of Year’ number 1, 8, 15, … were all Sundays. We can loop through the Series collecting these Sundays and stuffing them into a new variable.
i = 1 sunday =  # while counter is within dataset while i <= len(ticket_by_day): # extract element and append to results array sunday.append(ticket_by_day[i]) # proceed to next week i = i + 7 # format and display results np.hstack(sunday)
array([1115, 2348, 2112, 1224, 1952, 1822, 2251, 1865, 2304, 1931, 2142, 2372, 2312, 2131, 2479, 2074, 2295, 2200, 2020, 1964, 2087, 2257, 2285, 2110, 2196, 2553, 2099, 2271, 2499, 2386, 2257, 2243, 2316, 2407, 1972, 1969, 2124, 2095, 2328, 2115, 2140, 2026, 2319, 1892, 1920, 1740, 1821, 1803, 2147, 1965, 1642, 1255, 1604])
Now to create a general function to apply this process to each day of the week.
# define the function name def by_weekday(day): # set the counter equal to the argument i = day # create an empty array for results day =  # apply while loop to extract data while i <= 365: day.append(ticket_by_day[i]) i = i + 7 # format data np.hstack(day) day = pd.Series(day) # return results return(day) # testing the function on Sunday sunday = by_weekday(1) sunday.head()
0 1115 1 2348 2 2112 3 1224 4 1952 dtype: int64
Testing the function on Sunday produces the same results as above. Now we can proceed to apply this to each day of the week, creating seven new Pandas Series, then combining them into another DataFrame.
monday = by_weekday(2) tuesday = by_weekday(3) wednesday = by_weekday(4) thursday = by_weekday(5) friday = by_weekday(6) saturday = by_weekday(7) ticket_by_weekday = pd.concat([sunday, monday, tuesday, wednesday, thursday, friday, saturday], axis = 1) ticket_by_weekday.columns = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'] ticket_by_weekday.head()
Great, the dataset is now in terms of ‘Day of the Week’. Let’s visualize this transformation.
A beautiful visualization if I do say so myself! It looks like my hunch that the groupings were based on day of the week, was correct. Sunday regularly produces around 2000 citations. Saturday regularly produces 45000 citations. The weekdays average around 7000 to 8000 citations. This graph is a bit hard to follow, the data might be easier to understand if we visualize it as a series of boxplots.
This makes the mean citations issued much clearer. There is a very distinct difference between the number of citations issued on Sundays versus Mondays. But is there a difference between the weekdays? We can do a series of statistical tests to see if the means are significantly different.
There are many procedures to test if two sample means are significantly different. One of the most straightforward is the t-test. Let’s conduct a two-population t-test on the Sunday and Monday data to be sure that the difference observed in the visualization is statistically significant.
# t-test Sunday and Monday stats.ttest_ind(ticket_by_weekday['Sunday'], ticket_by_weekday['Monday'])
The mean number of citations issued on Sundays is statistically different than the mean number of citations issued on Mondays. Instead of doing t-tests by hand for all the pairs, let’s write a little loop that will do the leg work for us.
# initial day to compare for day in range(7): # second day for the comparison for day_2 in range(day+1, 7): # print the days being compared print(list(ticket_by_weekday)[day], list(ticket_by_weekday)[day_2]) # conduct the t-test print(stats.ttest_ind(ticket_by_weekday.iloc[:,day], ticket_by_weekday.iloc[:,day_2]))
Sunday Monday Ttest_indResult(statistic=-14.114555018346483, pvalue=9.993642210838924e-26) Sunday Tuesday Ttest_indResult(statistic=-39.200105865818514, pvalue=2.5709275857822333e-63) Sunday Wednesday Ttest_indResult(statistic=-70.26192329299684, pvalue=3.3199297166490703e-88) Sunday Thursday Ttest_indResult(statistic=-36.573858454513214, pvalue=1.9044297958758257e-60) Sunday Friday Ttest_indResult(statistic=-28.946699561781212, pvalue=5.397450684984687e-51) Sunday Saturday Ttest_indResult(statistic=-23.6204632589478, pvalue=3.772390749169104e-43) Monday Tuesday Ttest_indResult(statistic=-3.7245765466041916, pvalue=0.0003209744709376137) Monday Wednesday Ttest_indResult(statistic=-2.931192014701394, pvalue=0.004168140587481362) Monday Thursday Ttest_indResult(statistic=-3.4884588693442655, pvalue=0.0007185137137979519) Monday Friday Ttest_indResult(statistic=-0.8990678820480197, pvalue=0.37073406361282246) Monday Saturday Ttest_indResult(statistic=7.205958264020363, pvalue=1.0242512878117741e-10) Tuesday Wednesday Ttest_indResult(statistic=2.199934610785886, pvalue=0.03006852630328924) Tuesday Thursday Ttest_indResult(statistic=0.3307029989528652, pvalue=0.7415471929779237) Tuesday Friday Ttest_indResult(statistic=4.52141745074799, pvalue=1.661069033071054e-05) Tuesday Saturday Ttest_indResult(statistic=22.27992785117015, pvalue=5.604256381357363e-41) Wednesday Thursday Ttest_indResult(statistic=-1.6734243085378289, pvalue=0.09730802346354772) Wednesday Friday Ttest_indResult(statistic=3.6005602951584, pvalue=0.0004922318699883966) Wednesday Saturday Ttest_indResult(statistic=31.563121936548725, pvalue=1.9078872746154058e-54) Thursday Friday Ttest_indResult(statistic=4.086626355994712, pvalue=8.737245777165502e-05) Thursday Saturday Ttest_indResult(statistic=20.84950833547568, pvalue=1.457426897692227e-38) Friday Saturday Ttest_indResult(statistic=14.788541058567853, pvalue=4.029675334598723e-27)
These results aren’t pretty, but they do the job! Running through the results, we can see that Sunday is significantly different from each of the other days of the week. Monday and Friday are not significantly different; however, Monday and Wednesday are. There is no meaningful difference between Tuesday and Thursday. The average number of citations issued on Saturday is significantly different from each of the other days of the week as well.
That explains our daily-citation groupings!
Let’s mull over the date-time data from another perspective, and see if we can come up with any more interesting insights.
Citations by Month
Tickets issued by month. In the previous data-wrangling post, we created a stand-alone ‘Month’ variable for each citation. Let’s visualize these now and see if there’s anything interesting to look at.
la_ticket_2017.hist('Month', bins=12, color = 'indigo', edgecolor='black', linewidth=1)
A uniform distribution throughout the year. the effect of a short February is evident in the total tickets issued. It seems that November is also a bit slower of a month. There doesn’t seem to be any pattern to speak of here. How about a pattern within the month?
la_ticket_2017.hist('Day', bins=31, color = 'firebrick', edgecolor = 'black')
We see a sharp decline in tickets issued on the 29th, 30th, and 31st days of the month. This can likely be attributed to Not all months having a 29th, 30th, and 31st day. Let’s see what happens when we remove the effect of February. Nothing surprising here, last 29, 30, 31 drop off due to not being in every month.
la_ticket_2017[la_ticket_2017['Month'] != 2].hist('Day', bins = 31, color = 'firebrick', edgecolor = 'black')
Removing February helps days’ number 29 and 30 close the previously visualized gap. Other than that, there isn’t much of a difference.
Citations by Hours
When visualizing the time a citation is issued, we can expect to see some big trends. Parking enforcement hours, typically sometime between 8am and 5pm, should see a huge surge in citations. The early morning hours should see barely any citations.
Let’s see what kind of story the data tells us this time.
la_ticket_2017.hist('Issue time', range=(0, 2400), bins=200, color = 'forestgreen')
This histogram has been binned into 200 columns. This means that each hour has a six, ten-minute columns, and there are several columns of space between the hours (there is no 10:70 AM).
Like we predicted, there are very few citations issued at 5 AM. A massive spike of citations, from a total of 5000, to over 70,000 occurs early in the 8 o’clock hour. We’ll have to take a closer look at this hour specifically. There’s a lull of enforcement after that, and again a jump at 10 AM.
It would appear that the parking officers get a bit lazy after lunch; citations issued drop by half, and steadily decline thereafter.
la_ticket_2017.hist('Issue time', range=(800, 859), bins=60, color = 'forestgreen')
Nearly a 500% increase in citations issued between 8:04 AM and 8:05 AM. Perhaps that 4 minutes is the time it takes the official to track down their first victim after clocking in at 8:00 AM. Either way, it’s handy information to have.
It’s also clear that citations group around the 5 minute interval, suggesting that officers have the ability to round the issue time, and sometimes utilize that liberty. I would have thought an electronic clock would time-stamp each citation, but apparently not.
Citations on Holidays
Finally, let’s separate a number of specific days of the year, and see how many citations are issued on these days as compared to other, ‘normal’ days. I’m talking seeing if parking officers feel the holiday spirit, or do they play the part of the Grinch?
# these are the holidays that came to my mind and their dates in 2017 holidays = pd.Series(['2017/01/01', '2017/01/16', '2017/02/14', '2017/05/29', '2017/07/04', '2017/09/04', '2017/10/31', '2017/11/10', '2017/11/23', '2017/12/25'], index = ['New Years Day', 'MLK Day', 'Valentines Day', 'Memorial Day', 'Independence Day', 'Labor Day', 'Halloween', 'Veterans Day', 'Thanksgiving', 'Christmas'] ) holidays
New Year’s Day 2017/01/01 MLK Day 2017/01/16 Valentine’s Day 2017/02/14 Memorial Day 2017/05/29 Independence Day 2017/07/04 Labor Day 2017/09/04 Halloween 2017/10/31 Veterans Day 2017/11/10 Thanksgiving 2017/11/23 Christmas 2017/12/25 dtype: object
# change the date format to day of the year for easy use holidays_date = pd.to_datetime(holidays, format='%Y/%m/%d') holidays_doy = holidays_date.dt.dayofyear holidays_doy
New Year’s Day 1 MLK Day 16 Valentine’s Day 45 Memorial Day 149 Independence Day 185 Labor Day 247 Halloween 304 Veterans Day 314 Thanksgiving 327 Christmas 359 dtype: int64
After defining some holidays to consider, I’m going to count the number of tickets given on each of these days, and save the results to a new, ‘holiday_tickets’ Pandas DataFrame.
# create the empty dataframe, named after each holiday holiday_tickets = pd.DataFrame(columns=['Number of Tickets'], index = holidays_doy.index) # set a counter equal to zero i = 0 # for each of the hoidays while i <= len(holidays_doy) - 1: # sum the number of tickets given on that day and save the output to the new DataFrame holiday_tickets.iloc[i, ] = sum(la_ticket_2017['Day of Year'] == holidays_doy.iloc[i]) # move to the next holiday i = i + 1
If that worked correctly, this should give us a bar plot of how many tickets are given on each of the prescribed holidays.
holiday_tickets.plot(kind = 'barh', color = 'slateblue').invert_yaxis()
Most holidays have far fewer tickets than the ~6000 citation daily average, That is with the exception of Halloween and Valentine’s day. I would assert that these are not federally recognized holidays, and therefore there is no change in enforcement.
In fact, I don’t believe that parking citation officers are feeling more lenient, they just get the day off!
Here’s the associated table.
|Number of Tickets|
|New Years Day||1115|
Christmas turns out to have less citations than any other holiday, is it the least cited day of the entire year?
I guess even the parking Scrooges feel a little bit of Christmas spirit!
In conclusion, you should park in L.A. at 5 AM on Christmas Day, especially if it’s a Sunday. If you park on a illegally on a Tuesday or Thursday at 8:20 AM or 12:10 PM, expect to pay!
You should definitely double check the parking signs between 8 AM and 1 PM on the weekdays, and don’t expect the citation officers to give you a break for Valentine’s Day!
There’s so much information stored in date-time variables, I think that’s what makes them so fun to work with. On the next segment of the L.A. Parking Citation Series, I’ll be considering the spatial variables! Where are citations issued? Where are the most grievous parking meters located?
Perhaps by combining the upcoming spatial analysis and this date-time information, you could piece together a devious plan to avoid those pesky parking citations!
Until next time,