L.A. Parking Citations - EDA Part I


Introduction

This is the first of a four-part exploratory data analysis project on an open source parking citation dataset. The project utilizes Python and several of its data science oriented packages (NumPy, SciPy, Matplotlib, Pandas, GeoPandas). This is my first time using Python in a data-science oriented project, so python people, be prepared to be offended by a bunch of non-pythonic code!

The dataset I used for this EDA is available for free on Kaggle or at the Los Angeles open source data website.

The analysis is initially done in Jupyter Lab, which can be exported to a markdown file with its associated images. I add YAML (a simple file configuration language) to the markdown file, then copy and paste finished product and it’s images to my website’s jekyll folder. Once everything is properly updated and running on my local server, I push the changes to Github, which automatically updates my hosted site on Github Pages.

In this post, I wish to explore the dataset’s variables and observations, trim and transform ‘useless’ variables to become more useful, reduce the file size, and store the results for future exploration.

Let’s get to it!


Load Libraries

In the first part of this EDA we’ll only need to load the NumPy and Pandas packages. Importing os allows you to interface with the underlying operating system that python is running on, in my case MacOS. This is necessary to calculate file sizes. Warnings allows you to suppress warning messages.

import numpy as np
import pandas as pd
import os
import warnings

warnings.filterwarnings('ignore')


Verify Dataset

I’ve downloaded the full parking citation dataset from the City of Los Angeles open data website. It’s updated daily and stretches as far back as 2015. That makes for a lot of parking citations. Let’s check out just how large the dataset is, in terms of memory size and number of elements.

# return file size in GB
os.path.getsize('input/parking_citation.csv') / (1*10**9)

# read full data csv from input folder (no output)
la_ticket_full = pd.read_csv("~/Documents/data_science/py_la_tickets/input/parking_citation.csv")

# return shape of newly created dataframe
la_ticket_full.shape
1.021944555

(7201006, 19)


With 19 variables and over 7 million rows, it looks like we have a whole lot of data to work with! Here’s a a look at the dataframe itself.


# view dataframe
la_ticket_full.head()
Ticket number Issue Date Issue time Meter Id Marked Time RP State Plate Plate Expiry Date VIN Make Body Style Color Location Route Agency Violation code Violation Description Fine amount Latitude Longitude
0 4272349605 2015/12/30 12:00:00 AM 2201.0 NaN NaN CA 201605.0 NaN OLDS PA GN 3069 SAN MARINO ST 00403 54.0 80.56E4+ RED ZONE 93.0 6471840.7 1842349.7
1 4272349616 2015/12/30 12:00:00 AM 2205.0 NaN NaN CA 201508.0 NaN HOND PA WT 2936 8TH ST W 00403 54.0 80.56E1 WHITE ZONE 58.0 6473823.2 1843512.0
2 4272821512 2015/12/30 12:00:00 AM 1725.0 NaN NaN CA 10.0 NaN TOYT PA SL 301 LAUREL AV N 00401 54.0 5204A- DISPLAY OF TABS 25.0 6451207.5 1850273.2
3 4272821523 2015/12/30 12:00:00 AM 1738.0 WF74 NaN CA 2.0 NaN RROV PA BK 8321 3RD ST W 00401 54.0 88.13B+ METER EXP. 63.0 6449387.2 1849063.5
4 4272821534 2015/12/30 12:00:00 AM 1807.0 13 NaN CA 1.0 NaN FORD PA GN 121 CROFT AVE 00401 54.0 80.58L PREFERENTIAL PARKING 68.0 6448347.2 1849662.2


While 1 GB of data isn’t necessarily huge, I’m going to take a smaller slice of it to conduct my data analysis. Github only allows for file submissions that are < 100 mb in size without utilizing Git Large File Storage (LFS) so that’s something to take into account.

I’m going to conduct the analysis on the entire year of 2017 to shorten up the dataset a bit. I’ll also drop some of these variables that seem rather boring, such as ‘Ticket Number’ or ‘Plate Expiry Date’.


2017 Data

Reducing the dataset to include tickets issued in 2017 isn’t much of an issue. I’ll employ string matching and indexing to accomplish the task.

# pull Issue Date variable from the main Dataframe to create a Pandas Series of just date-times
la_ticket_issue = la_ticket_full['Issue Date']
la_ticket_issue.head()

# how many of the issue date variables contain '2017'
sum(la_ticket_issue.str.contains('2017') == True)
0    2015/12/30 12:00:00 AM
1    2015/12/30 12:00:00 AM
2    2015/12/30 12:00:00 AM
3    2015/12/30 12:00:00 AM
4    2015/12/30 12:00:00 AM
Name: Issue Date, dtype: object

2254329


There are 2.2 million entries containing the string ‘2017’. Let’s index these variables, and then pull the corresponding elements out of the main dataframe.


# create a True / False list for indexing the data
la_ticket_2017_index = la_ticket_issue.str.contains('2017')
la_ticket_2017_index.head()

# apply the index to the full dataframe to create a new '2017' Dataframe
la_ticket_2017 = la_ticket_full[la_ticket_2017_index]

# make sure length of dataset matches previously determined 'sum' of 2017 strings
len(la_ticket_2017)
0    False
1    False
2    False
3    False
4    False
Name: Issue Date, dtype: bool

2254329


The lengths match up, but let’s have another look at the dataframe just to make sure everything is as we expect.


# ensure variables remain untouched
la_ticket_2017.head()
Ticket number Issue Date Issue time Meter Id Marked Time RP State Plate Plate Expiry Date VIN Make Body Style Color Location Route Agency Violation code Violation Description Fine amount Latitude Longitude
2735704 1115377911 2017/12/18 12:00:00 AM 2205.0 NaN NaN CA 201712.0 NaN HOND PA BK 1323 S FLOWER ST 00192 1.0 4000A1 NO EVIDENCE OF REG 50.0 6.480729e+06 1.836883e+06
2771883 1114752936 2017/05/11 12:00:00 AM 800.0 NaN NaN CA 201712.0 NaN FRHT TR WH INDIANA/NOAKES CM99 1.0 4000A1 NO EVIDENCE OF REG 50.0 9.999900e+04 9.999900e+04
2777524 4302749861 2017/03/01 12:00:00 AM 104.0 NaN NaN OR 3.0 NaN TOYT PA BL 1822 WINONA BLVD 00402 54.0 80.56E4+ RED ZONE 93.0 6.470239e+06 1.860397e+06
2777558 1120840291 2017/03/28 12:00:00 AM 1050.0 NaN NaN CA 201708.0 NaN HOND PA BK 710 EL CENTRO AV 00001 1.0 4000A1 NO EVIDENCE OF REG 50.0 9.999900e+04 9.999900e+04
2777651 4308029526 2017/05/15 12:00:00 AM 134.0 NaN NaN CA NaN NaN LEXS PA BL 1701 VINE ST 00402 54.0 80.69B NO PARKING 73.0 6.462770e+06 1.859525e+06


Everything looks good! The dataset still contains some 2.2 million observations, which should be more than enough to work with! One last check to make sure we encompass only 2017.


# first issue date in the new 2017 dataset
min(la_ticket_2017['Issue Date'])

# last issue date in the new 2017 dataset
max(la_ticket_2017['Issue Date'])
'2017/01/01 12:00:00 AM'

'2017/12/31 12:00:00 AM'


The first ticket in the new dataset was given in the first minute of January 1st, and the last ticket was given in the last minute of December 31st. Perfect.

The next step in my data-wrangling workflow is to alter and drop initial variables that aren’t pleasing to work with. Starting with the date-time variable.


Create Day of the Year Variable

We’ve been working with the variable ‘Issue Date’. A useful variable, but in a rather unsightly form. Issue time is already reported separately from Issue Date in the dataset, thus all Issue Dates report a time of midnight.

I want to drop the time from issue date, coerce the structure from a string to an actual date-time, and create a corresponding day of the year variable. A day of the year variable will allow me to easily visualize ticket trends throughout the year as a whole.

# create a new Series to work with outside of the main Dataframe
date_time = la_ticket_2017['Issue Date']

# check what data type 'Issue Date' is stored as
type(date_time.iloc[0,])

# split string into date and time based on the space between them
date_time_split = date_time.str.split(' ', n=1, expand = True)
date_time_split.head()
str
0 1
2735704 2017/12/18 12:00:00 AM
2771883 2017/05/11 12:00:00 AM
2777524 2017/03/01 12:00:00 AM
2777558 2017/03/28 12:00:00 AM
2777651 2017/05/15 12:00:00 AM


Date-time has been split into two variables in a new dataframe, date and time. We need to drop the time variable and reformat the date to be more useful.


# use only date, as we have accurate issue times in another dataset variable
date = date_time_split.iloc[:,0]
date.head()

# convert datatype to date-time using pandas function
date = pd.to_datetime(date, format='%Y/%m/%d')
date.head()
2735704    2017/12/18
2771883    2017/05/11
2777524    2017/03/01
2777558    2017/03/28
2777651    2017/05/15
Name: 0, dtype: object



2735704   2017-12-18
2771883   2017-05-11
2777524   2017-03-01
2777558   2017-03-28
2777651   2017-05-15
Name: 0, dtype: datetime64[ns]


Now we’re only working with the date variable, and it has been changed to a datetime64 object. Time to split up the date into the day of the year (1 - 365), month (1 - 12) and day of the month (1 - 31). These variables will be easier to visualize later on.


# use built-in datetime functionality 'dt' to create 3 new variables
day_of_year = date.dt.dayofyear
month = date.dt.month
day = date.dt.day

# create a new_date dataframe, consisting of the date, month, day, and day of the year
new_date = pd.concat([date, month, day, day_of_year], axis = 1)
new_date.columns = ['Date', 'Month', 'Day', 'Day of Year']
new_date.head()
Date Month Day Day of Year
2735704 2017-12-18 12 18 352
2771883 2017-05-11 5 11 131
2777524 2017-03-01 3 1 60
2777558 2017-03-28 3 28 87
2777651 2017-05-15 5 15 135


The job’s almost done, now to add the new variables back in with the main dataset.


# Copy the new_date variables into the original dataframe, la_ticket_2017
la_ticket_2017['Month'] = new_date['Month']
la_ticket_2017['Day'] = new_date['Day']
la_ticket_2017['Day of Year'] = new_date['Day of Year']

# view la_ticket_2017 dataframe
la_ticket_2017.head()
Ticket number Issue Date Issue time Meter Id Marked Time RP State Plate Plate Expiry Date VIN Make Body Style ... Route Agency Violation code Violation Description Fine amount Latitude Longitude Month Day Day of Year
2735704 1115377911 2017/12/18 12:00:00 AM 2205.0 NaN NaN CA 201712.0 NaN HOND PA ... 00192 1.0 4000A1 NO EVIDENCE OF REG 50.0 6.480729e+06 1.836883e+06 12 18 352
2771883 1114752936 2017/05/11 12:00:00 AM 800.0 NaN NaN CA 201712.0 NaN FRHT TR ... CM99 1.0 4000A1 NO EVIDENCE OF REG 50.0 9.999900e+04 9.999900e+04 5 11 131
2777524 4302749861 2017/03/01 12:00:00 AM 104.0 NaN NaN OR 3.0 NaN TOYT PA ... 00402 54.0 80.56E4+ RED ZONE 93.0 6.470239e+06 1.860397e+06 3 1 60
2777558 1120840291 2017/03/28 12:00:00 AM 1050.0 NaN NaN CA 201708.0 NaN HOND PA ... 00001 1.0 4000A1 NO EVIDENCE OF REG 50.0 9.999900e+04 9.999900e+04 3 28 87
2777651 4308029526 2017/05/15 12:00:00 AM 134.0 NaN NaN CA NaN NaN LEXS PA ... 00402 54.0 80.69B NO PARKING 73.0 6.462770e+06 1.859525e+06 5 15 135

5 rows × 22 columns


Remove ‘Useless’ Variables

Using the drop operator, we can make quick work of variables that are redundant or uninteresting.

# drop variables, create reduced dataframe
la_ticket_2017_reduced = la_ticket_2017.drop(['Ticket number', 'Issue Date', 'Marked Time',
                                              'Plate Expiry Date', 'VIN', 'Body Style',
                                              'Route', 'Agency'],
                                             axis=1)

# view reduced dataframe
la_ticket_2017_reduced.head()
Issue time Meter Id RP State Plate Make Color Location Violation code Violation Description Fine amount Latitude Longitude Month Day Day of Year
2735704 2205.0 NaN CA HOND BK 1323 S FLOWER ST 4000A1 NO EVIDENCE OF REG 50.0 6.480729e+06 1.836883e+06 12 18 352
2771883 800.0 NaN CA FRHT WH INDIANA/NOAKES 4000A1 NO EVIDENCE OF REG 50.0 9.999900e+04 9.999900e+04 5 11 131
2777524 104.0 NaN OR TOYT BL 1822 WINONA BLVD 80.56E4+ RED ZONE 93.0 6.470239e+06 1.860397e+06 3 1 60
2777558 1050.0 NaN CA HOND BK 710 EL CENTRO AV 4000A1 NO EVIDENCE OF REG 50.0 9.999900e+04 9.999900e+04 3 28 87
2777651 134.0 NaN CA LEXS BL 1701 VINE ST 80.69B NO PARKING 73.0 6.462770e+06 1.859525e+06 5 15 135


Explore Variables

We’ve trimmed the scale and scope of the data, accomplishing exactly what we set out to do in the beginning of this post. Now let’s have a peak at the variables, so we can start to think of the challenges to come!

# unique plates
pd.unique(la_ticket_2017_reduced['RP State Plate'])

# see if we have 50 entries for the 50 states
len(pd.unique(la_ticket_2017_reduced['RP State Plate']))
array(['CA', 'OR', 'CO', 'TX', 'WA', 'ID', 'TN', 'FL', 'NV', 'MD', 'AZ',
       'SD', 'CT', 'VA', 'US', 'GA', 'IL', 'UT', 'MA', 'MO', 'WI', 'MI',
       'NE', 'PA', 'ND', 'OH', 'NM', 'NY', 'IN', 'LA', 'SC', 'NC', 'NJ',
       'MN', 'AL', 'AB', 'KS', 'OK', 'KY', 'QU', 'AR', 'MT', 'IA', 'HI',
       'CN', 'DC', 'VI', 'MS', 'WV', 'NH', 'XX', 'WY', 'AK', 'MX', 'RI',
       'ON', 'BC', 'VT', 'DE', 'ME', 'SA', 'FN', 'NW', 'NF', 'CZ', 'NB',
       'MB', 'AS', 'NS', 'PE', 'TT', 'GU', 'PR', 'ML'], dtype=object)


74


There are 74 unique license plates recorded! It appears that we have the predicted state license plates, like ‘CA’ and ‘CO’, but there are also less predictable plates like ‘BC’ (British Columbia) and ‘ON’ (Ontario). Let’s use the same method to look into how many different car manufacturers are represented, and how many colors are recorded in the citation data.


# how many different makers?
len(pd.unique(la_ticket_2017_reduced['Make']))
1121


1121 different car manufacturers is a bit obsurd. To work with this variable in a meaningful way, we’ll have to work it down to the top 50 or so manufacturers.


# unique car colors
pd.unique(la_ticket_2017_reduced['Color'])
array(['BK', 'WH', 'BL', 'GY', 'BR', 'TA', nan, 'WT', 'RD', 'GO', 'MR',
       'SL', 'BN', 'GN', 'PR', 'TN', 'YE', 'OT', 'OR', 'BG', 'RE', 'SI',
       'GR', 'BU', 'MA', 'PU', 'CR', 'MY', 'TU', 'BE', 'TE', 'GD', 'MU',
       'CH', 'YL', 'RU', 'CO', 'PK', 'UN', 'WI', 'BZ', 'PI', 'WE', 'OL',
       'AQ', 'AU', 'WR', 'WA', 'AM', 'GL', 'PE', 'RA', 'BI', 'BA', 'ES',
       'SU', 'PP', 'BW', 'GE', 'VA', 'PL', 'LE', 'CA', 'W', 'ME', 'SV',
       'BH', 'SA', 'AP', 'VO', 'VU', 'TL'], dtype=object)


There appears to be a lot of different ways to declare a car’s color. For example, White is recorded as ‘WH’, ‘WI’, ‘WT’, ‘WE’, and W’. This could pose a problem.

How many unique violation codes are present, and do they correspond 1:1 with the Violation Description variable?


len(pd.unique(la_ticket_2017_reduced['Violation code']))

len(pd.unique(la_ticket_2017_reduced['Violation Description']))
231


400


What portion of the citation data includes an expired meter?


sum(la_ticket_2017_reduced['Meter Id'].str.contains('N') == False) / len(la_ticket_2017_reduced)
0.22996510269796466


Write Reduced 2017 Dataset

Now that we’ve had a look at the variables, let’s finish up this part of the project by splitting the data into smaller, easily saved pieces. Remember that files need to be less than 100 mb to be saved on Github without issue.

# write full reduced dataframe
la_ticket_2017_reduced.to_csv('input/la_ticket_2017.csv')

# check size again
os.path.getsize('input/la_ticket_2017.csv') / (1*10**9)
0.25989506


0.26 GB is still too large, we’ll have to break up the dataset parts.


# create 4 sequences of 3 months for indexing the reduced dataset
seq_1 = ['01','02','03']
seq_2 = ['04','05','06']
seq_3 = ['07','08','09']
seq_4 = ['10','11','12']

# create the index
index_1 = la_ticket_2017_reduced['Month'].isin(seq_1)
index_2 = la_ticket_2017_reduced['Month'].isin(seq_2)
index_3 = la_ticket_2017_reduced['Month'].isin(seq_3)
index_4 = la_ticket_2017_reduced['Month'].isin(seq_4)

# apply index to separate dataset into 4 pieces
la_ticket_2017_1 = la_ticket_2017_reduced[index_1]
la_ticket_2017_2 = la_ticket_2017_reduced[index_2]
la_ticket_2017_3 = la_ticket_2017_reduced[index_3]
la_ticket_2017_4 = la_ticket_2017_reduced[index_4]

# save the four pieces
la_ticket_2017_1.to_csv('la_ticket_2017_1.csv')
la_ticket_2017_2.to_csv('la_ticket_2017_2.csv')
la_ticket_2017_3.to_csv('la_ticket_2017_3.csv')
la_ticket_2017_4.to_csv('la_ticket_2017_4.csv')


Breaking the 0.26 GB dataset into 4 pieces should give us small enough files to commit, but let’s check to make sure.


# check size of the jan-mar piece
os.path.getsize('la_ticket_2017_1.csv') / (1*10**9)

# check size of the second piece
os.path.getsize('la_ticket_2017_2.csv') / (1*10**9)

# check size of the third piece
os.path.getsize('la_ticket_2017_3.csv') / (1*10**9)

# check size of the final piece
os.path.getsize('la_ticket_2017_4.csv') / (1*10**9)
0.063464771

0.066494809

0.063882298

0.060089451


And that’s it!

To summarize, we’ve loaded a dataset, reduced its range, altered its variables, and explored the remaining variables. We finished by dividing the dataset into 4 parts of roughly equal size, in order to easily download and upload them.

In part two of this EDA, we’ll work with the newly time variables, visualizing how many citations are given and when. Later we’ll do geospatial work, and categorical analysis. It should be fun!


Until next time,

- Fisher

Comments