L.A. Parking Citations - EDA Part I

Introduction

This is the first of a four-part exploratory data analysis project on an open source parking citation dataset. The project utilizes Python and several of its data science oriented packages (NumPy, SciPy, Matplotlib, Pandas, GeoPandas). This is my first time using Python in a data-science oriented project, so python people, be prepared to be offended by a bunch of non-pythonic code!

The dataset I used for this EDA is available for free on Kaggle or at the Los Angeles open source data website.

The analysis is initially done in Jupyter Lab, which can be exported to a markdown file with its associated images. I add YAML (a simple file configuration language) to the markdown file, then copy and paste finished product and it’s images to my website’s jekyll folder. Once everything is properly updated and running on my local server, I push the changes to Github, which automatically updates my hosted site on Github Pages.

In this post, I wish to explore the dataset’s variables and observations, trim and transform ‘useless’ variables to become more useful, reduce the file size, and store the results for future exploration.

Let’s get to it!

Load Libraries

In the first part of this EDA we’ll only need to load the NumPy and Pandas packages. Importing os allows you to interface with the underlying operating system that python is running on, in my case MacOS. This is necessary to calculate file sizes. Warnings allows you to suppress warning messages.

import numpy as np
import pandas as pd
import os
import warnings

warnings.filterwarnings('ignore')

Verify Dataset

I’ve downloaded the full parking citation dataset from the City of Los Angeles open data website. It’s updated daily and stretches as far back as 2015. That makes for a lot of parking citations. Let’s check out just how large the dataset is, in terms of memory size and number of elements.

# return file size in GB
os.path.getsize('input/parking_citation.csv') / (1*10**9)

# read full data csv from input folder (no output)
la_ticket_full = pd.read_csv("~/Documents/data_science/py_la_tickets/input/parking_citation.csv")

# return shape of newly created dataframe
la_ticket_full.shape

1.021944555

(7201006, 19)

With 19 variables and over 7 million rows, it looks like we have a whole lot of data to work with! Here’s a a look at the dataframe itself.

# view dataframe
la_ticket_full.head()

	Ticket number	Issue Date	Issue time	Meter Id	Marked Time	RP State Plate	Plate Expiry Date	VIN	Make	Body Style	Color	Location	Route	Agency	Violation code	Violation Description	Fine amount	Latitude	Longitude
0	4272349605	2015/12/30 12:00:00 AM	2201.0	NaN	NaN	CA	201605.0	NaN	OLDS	PA	GN	3069 SAN MARINO ST	00403	54.0	80.56E4+	RED ZONE	93.0	6471840.7	1842349.7
1	4272349616	2015/12/30 12:00:00 AM	2205.0	NaN	NaN	CA	201508.0	NaN	HOND	PA	WT	2936 8TH ST W	00403	54.0	80.56E1	WHITE ZONE	58.0	6473823.2	1843512.0
2	4272821512	2015/12/30 12:00:00 AM	1725.0	NaN	NaN	CA	10.0	NaN	TOYT	PA	SL	301 LAUREL AV N	00401	54.0	5204A-	DISPLAY OF TABS	25.0	6451207.5	1850273.2
3	4272821523	2015/12/30 12:00:00 AM	1738.0	WF74	NaN	CA	2.0	NaN	RROV	PA	BK	8321 3RD ST W	00401	54.0	88.13B+	METER EXP.	63.0	6449387.2	1849063.5
4	4272821534	2015/12/30 12:00:00 AM	1807.0	13	NaN	CA	1.0	NaN	FORD	PA	GN	121 CROFT AVE	00401	54.0	80.58L	PREFERENTIAL PARKING	68.0	6448347.2	1849662.2

While 1 GB of data isn’t necessarily huge, I’m going to take a smaller slice of it to conduct my data analysis. Github only allows for file submissions that are < 100 mb in size without utilizing Git Large File Storage (LFS) so that’s something to take into account.

I’m going to conduct the analysis on the entire year of 2017 to shorten up the dataset a bit. I’ll also drop some of these variables that seem rather boring, such as ‘Ticket Number’ or ‘Plate Expiry Date’.

2017 Data

Reducing the dataset to include tickets issued in 2017 isn’t much of an issue. I’ll employ string matching and indexing to accomplish the task.

# pull Issue Date variable from the main Dataframe to create a Pandas Series of just date-times
la_ticket_issue = la_ticket_full['Issue Date']
la_ticket_issue.head()

# how many of the issue date variables contain '2017'
sum(la_ticket_issue.str.contains('2017') == True)

  2015/12/30 12:00:00 AM
  2015/12/30 12:00:00 AM
  2015/12/30 12:00:00 AM
  2015/12/30 12:00:00 AM
  2015/12/30 12:00:00 AM
Name: Issue Date, dtype: object

2254329

There are 2.2 million entries containing the string ‘2017’. Let’s index these variables, and then pull the corresponding elements out of the main dataframe.

# create a True / False list for indexing the data
la_ticket_2017_index = la_ticket_issue.str.contains('2017')
la_ticket_2017_index.head()

# apply the index to the full dataframe to create a new '2017' Dataframe
la_ticket_2017 = la_ticket_full[la_ticket_2017_index]

# make sure length of dataset matches previously determined 'sum' of 2017 strings
len(la_ticket_2017)

  False
  False
  False
  False
  False
Name: Issue Date, dtype: bool

2254329

The lengths match up, but let’s have another look at the dataframe just to make sure everything is as we expect.

# ensure variables remain untouched
la_ticket_2017.head()

	Ticket number	Issue Date	Issue time	Meter Id	Marked Time	RP State Plate	Plate Expiry Date	VIN	Make	Body Style	Color	Location	Route	Agency	Violation code	Violation Description	Fine amount	Latitude	Longitude
2735704	1115377911	2017/12/18 12:00:00 AM	2205.0	NaN	NaN	CA	201712.0	NaN	HOND	PA	BK	1323 S FLOWER ST	00192	1.0	4000A1	NO EVIDENCE OF REG	50.0	6.480729e+06	1.836883e+06
2771883	1114752936	2017/05/11 12:00:00 AM	800.0	NaN	NaN	CA	201712.0	NaN	FRHT	TR	WH	INDIANA/NOAKES	CM99	1.0	4000A1	NO EVIDENCE OF REG	50.0	9.999900e+04	9.999900e+04
2777524	4302749861	2017/03/01 12:00:00 AM	104.0	NaN	NaN	OR	3.0	NaN	TOYT	PA	BL	1822 WINONA BLVD	00402	54.0	80.56E4+	RED ZONE	93.0	6.470239e+06	1.860397e+06
2777558	1120840291	2017/03/28 12:00:00 AM	1050.0	NaN	NaN	CA	201708.0	NaN	HOND	PA	BK	710 EL CENTRO AV	00001	1.0	4000A1	NO EVIDENCE OF REG	50.0	9.999900e+04	9.999900e+04
2777651	4308029526	2017/05/15 12:00:00 AM	134.0	NaN	NaN	CA	NaN	NaN	LEXS	PA	BL	1701 VINE ST	00402	54.0	80.69B	NO PARKING	73.0	6.462770e+06	1.859525e+06

Everything looks good! The dataset still contains some 2.2 million observations, which should be more than enough to work with! One last check to make sure we encompass only 2017.

# first issue date in the new 2017 dataset
min(la_ticket_2017['Issue Date'])

# last issue date in the new 2017 dataset
max(la_ticket_2017['Issue Date'])

'2017/01/01 12:00:00 AM'

'2017/12/31 12:00:00 AM'

The first ticket in the new dataset was given in the first minute of January 1st, and the last ticket was given in the last minute of December 31st. Perfect.

The next step in my data-wrangling workflow is to alter and drop initial variables that aren’t pleasing to work with. Starting with the date-time variable.

Create Day of the Year Variable

We’ve been working with the variable ‘Issue Date’. A useful variable, but in a rather unsightly form. Issue time is already reported separately from Issue Date in the dataset, thus all Issue Dates report a time of midnight.

I want to drop the time from issue date, coerce the structure from a string to an actual date-time, and create a corresponding day of the year variable. A day of the year variable will allow me to easily visualize ticket trends throughout the year as a whole.

# create a new Series to work with outside of the main Dataframe
date_time = la_ticket_2017['Issue Date']

# check what data type 'Issue Date' is stored as
type(date_time.iloc[0,])

# split string into date and time based on the space between them
date_time_split = date_time.str.split(' ', n=1, expand = True)
date_time_split.head()

str

	0	1
2735704	2017/12/18	12:00:00 AM
2771883	2017/05/11	12:00:00 AM
2777524	2017/03/01	12:00:00 AM
2777558	2017/03/28	12:00:00 AM
2777651	2017/05/15	12:00:00 AM

Date-time has been split into two variables in a new dataframe, date and time. We need to drop the time variable and reformat the date to be more useful.

# use only date, as we have accurate issue times in another dataset variable
date = date_time_split.iloc[:,0]
date.head()

# convert datatype to date-time using pandas function
date = pd.to_datetime(date, format='%Y/%m/%d')
date.head()

2735704    2017/12/18
2771883    2017/05/11
2777524    2017/03/01
2777558    2017/03/28
2777651    2017/05/15
Name: 0, dtype: object



2735704   2017-12-18
2771883   2017-05-11
2777524   2017-03-01
2777558   2017-03-28
2777651   2017-05-15
Name: 0, dtype: datetime64[ns]

Now we’re only working with the date variable, and it has been changed to a datetime64 object. Time to split up the date into the day of the year (1 - 365), month (1 - 12) and day of the month (1 - 31). These variables will be easier to visualize later on.

# use built-in datetime functionality 'dt' to create 3 new variables
day_of_year = date.dt.dayofyear
month = date.dt.month
day = date.dt.day

# create a new_date dataframe, consisting of the date, month, day, and day of the year
new_date = pd.concat([date, month, day, day_of_year], axis = 1)
new_date.columns = ['Date', 'Month', 'Day', 'Day of Year']
new_date.head()

	Date	Month	Day	Day of Year
2735704	2017-12-18	12	18	352
2771883	2017-05-11	5	11	131
2777524	2017-03-01	3	1	60
2777558	2017-03-28	3	28	87
2777651	2017-05-15	5	15	135

The job’s almost done, now to add the new variables back in with the main dataset.

# Copy the new_date variables into the original dataframe, la_ticket_2017
la_ticket_2017['Month'] = new_date['Month']
la_ticket_2017['Day'] = new_date['Day']
la_ticket_2017['Day of Year'] = new_date['Day of Year']

# view la_ticket_2017 dataframe
la_ticket_2017.head()

	Ticket number	Issue Date	Issue time	Meter Id	Marked Time	RP State Plate	Plate Expiry Date	VIN	Make	Body Style	...	Route	Agency	Violation code	Violation Description	Fine amount	Latitude	Longitude	Month	Day	Day of Year
2735704	1115377911	2017/12/18 12:00:00 AM	2205.0	NaN	NaN	CA	201712.0	NaN	HOND	PA	...	00192	1.0	4000A1	NO EVIDENCE OF REG	50.0	6.480729e+06	1.836883e+06	12	18	352
2771883	1114752936	2017/05/11 12:00:00 AM	800.0	NaN	NaN	CA	201712.0	NaN	FRHT	TR	...	CM99	1.0	4000A1	NO EVIDENCE OF REG	50.0	9.999900e+04	9.999900e+04	5	11	131
2777524	4302749861	2017/03/01 12:00:00 AM	104.0	NaN	NaN	OR	3.0	NaN	TOYT	PA	...	00402	54.0	80.56E4+	RED ZONE	93.0	6.470239e+06	1.860397e+06	3	1	60
2777558	1120840291	2017/03/28 12:00:00 AM	1050.0	NaN	NaN	CA	201708.0	NaN	HOND	PA	...	00001	1.0	4000A1	NO EVIDENCE OF REG	50.0	9.999900e+04	9.999900e+04	3	28	87
2777651	4308029526	2017/05/15 12:00:00 AM	134.0	NaN	NaN	CA	NaN	NaN	LEXS	PA	...	00402	54.0	80.69B	NO PARKING	73.0	6.462770e+06	1.859525e+06	5	15	135

5 rows × 22 columns

Remove ‘Useless’ Variables

Using the drop operator, we can make quick work of variables that are redundant or uninteresting.

# drop variables, create reduced dataframe
la_ticket_2017_reduced = la_ticket_2017.drop(['Ticket number', 'Issue Date', 'Marked Time',
                                              'Plate Expiry Date', 'VIN', 'Body Style',
                                              'Route', 'Agency'],
                                             axis=1)

# view reduced dataframe
la_ticket_2017_reduced.head()

	Issue time	Meter Id	RP State Plate	Make	Color	Location	Violation code	Violation Description	Fine amount	Latitude	Longitude	Month	Day	Day of Year
2735704	2205.0	NaN	CA	HOND	BK	1323 S FLOWER ST	4000A1	NO EVIDENCE OF REG	50.0	6.480729e+06	1.836883e+06	12	18	352
2771883	800.0	NaN	CA	FRHT	WH	INDIANA/NOAKES	4000A1	NO EVIDENCE OF REG	50.0	9.999900e+04	9.999900e+04	5	11	131
2777524	104.0	NaN	OR	TOYT	BL	1822 WINONA BLVD	80.56E4+	RED ZONE	93.0	6.470239e+06	1.860397e+06	3	1	60
2777558	1050.0	NaN	CA	HOND	BK	710 EL CENTRO AV	4000A1	NO EVIDENCE OF REG	50.0	9.999900e+04	9.999900e+04	3	28	87
2777651	134.0	NaN	CA	LEXS	BL	1701 VINE ST	80.69B	NO PARKING	73.0	6.462770e+06	1.859525e+06	5	15	135

Explore Variables

We’ve trimmed the scale and scope of the data, accomplishing exactly what we set out to do in the beginning of this post. Now let’s have a peak at the variables, so we can start to think of the challenges to come!

# unique plates
pd.unique(la_ticket_2017_reduced['RP State Plate'])

# see if we have 50 entries for the 50 states
len(pd.unique(la_ticket_2017_reduced['RP State Plate']))

array(['CA', 'OR', 'CO', 'TX', 'WA', 'ID', 'TN', 'FL', 'NV', 'MD', 'AZ',
       'SD', 'CT', 'VA', 'US', 'GA', 'IL', 'UT', 'MA', 'MO', 'WI', 'MI',
       'NE', 'PA', 'ND', 'OH', 'NM', 'NY', 'IN', 'LA', 'SC', 'NC', 'NJ',
       'MN', 'AL', 'AB', 'KS', 'OK', 'KY', 'QU', 'AR', 'MT', 'IA', 'HI',
       'CN', 'DC', 'VI', 'MS', 'WV', 'NH', 'XX', 'WY', 'AK', 'MX', 'RI',
       'ON', 'BC', 'VT', 'DE', 'ME', 'SA', 'FN', 'NW', 'NF', 'CZ', 'NB',
       'MB', 'AS', 'NS', 'PE', 'TT', 'GU', 'PR', 'ML'], dtype=object)


74

There are 74 unique license plates recorded! It appears that we have the predicted state license plates, like ‘CA’ and ‘CO’, but there are also less predictable plates like ‘BC’ (British Columbia) and ‘ON’ (Ontario). Let’s use the same method to look into how many different car manufacturers are represented, and how many colors are recorded in the citation data.

# how many different makers?
len(pd.unique(la_ticket_2017_reduced['Make']))

1121 different car manufacturers is a bit obsurd. To work with this variable in a meaningful way, we’ll have to work it down to the top 50 or so manufacturers.

# unique car colors
pd.unique(la_ticket_2017_reduced['Color'])

array(['BK', 'WH', 'BL', 'GY', 'BR', 'TA', nan, 'WT', 'RD', 'GO', 'MR',
       'SL', 'BN', 'GN', 'PR', 'TN', 'YE', 'OT', 'OR', 'BG', 'RE', 'SI',
       'GR', 'BU', 'MA', 'PU', 'CR', 'MY', 'TU', 'BE', 'TE', 'GD', 'MU',
       'CH', 'YL', 'RU', 'CO', 'PK', 'UN', 'WI', 'BZ', 'PI', 'WE', 'OL',
       'AQ', 'AU', 'WR', 'WA', 'AM', 'GL', 'PE', 'RA', 'BI', 'BA', 'ES',
       'SU', 'PP', 'BW', 'GE', 'VA', 'PL', 'LE', 'CA', 'W', 'ME', 'SV',
       'BH', 'SA', 'AP', 'VO', 'VU', 'TL'], dtype=object)

There appears to be a lot of different ways to declare a car’s color. For example, White is recorded as ‘WH’, ‘WI’, ‘WT’, ‘WE’, and W’. This could pose a problem.

How many unique violation codes are present, and do they correspond 1:1 with the Violation Description variable?

len(pd.unique(la_ticket_2017_reduced['Violation code']))

len(pd.unique(la_ticket_2017_reduced['Violation Description']))

231

400

What portion of the citation data includes an expired meter?

sum(la_ticket_2017_reduced['Meter Id'].str.contains('N') == False) / len(la_ticket_2017_reduced)

0.22996510269796466

Write Reduced 2017 Dataset

Now that we’ve had a look at the variables, let’s finish up this part of the project by splitting the data into smaller, easily saved pieces. Remember that files need to be less than 100 mb to be saved on Github without issue.

# write full reduced dataframe
la_ticket_2017_reduced.to_csv('input/la_ticket_2017.csv')

# check size again
os.path.getsize('input/la_ticket_2017.csv') / (1*10**9)

0.25989506

0.26 GB is still too large, we’ll have to break up the dataset parts.

# create 4 sequences of 3 months for indexing the reduced dataset
seq_1 = ['01','02','03']
seq_2 = ['04','05','06']
seq_3 = ['07','08','09']
seq_4 = ['10','11','12']

# create the index
index_1 = la_ticket_2017_reduced['Month'].isin(seq_1)
index_2 = la_ticket_2017_reduced['Month'].isin(seq_2)
index_3 = la_ticket_2017_reduced['Month'].isin(seq_3)
index_4 = la_ticket_2017_reduced['Month'].isin(seq_4)

# apply index to separate dataset into 4 pieces
la_ticket_2017_1 = la_ticket_2017_reduced[index_1]
la_ticket_2017_2 = la_ticket_2017_reduced[index_2]
la_ticket_2017_3 = la_ticket_2017_reduced[index_3]
la_ticket_2017_4 = la_ticket_2017_reduced[index_4]

# save the four pieces
la_ticket_2017_1.to_csv('la_ticket_2017_1.csv')
la_ticket_2017_2.to_csv('la_ticket_2017_2.csv')
la_ticket_2017_3.to_csv('la_ticket_2017_3.csv')
la_ticket_2017_4.to_csv('la_ticket_2017_4.csv')

Breaking the 0.26 GB dataset into 4 pieces should give us small enough files to commit, but let’s check to make sure.

# check size of the jan-mar piece
os.path.getsize('la_ticket_2017_1.csv') / (1*10**9)

# check size of the second piece
os.path.getsize('la_ticket_2017_2.csv') / (1*10**9)

# check size of the third piece
os.path.getsize('la_ticket_2017_3.csv') / (1*10**9)

# check size of the final piece
os.path.getsize('la_ticket_2017_4.csv') / (1*10**9)

And that’s it!

To summarize, we’ve loaded a dataset, reduced its range, altered its variables, and explored the remaining variables. We finished by dividing the dataset into 4 parts of roughly equal size, in order to easily download and upload them.

In part two of this EDA, we’ll work with the newly time variables, visualizing how many citations are given and when. Later we’ll do geospatial work, and categorical analysis. It should be fun!

Until next time,

- Fisher

Introduction

Load Libraries

Verify Dataset

2017 Data

Create Day of the Year Variable

Remove ‘Useless’ Variables

Explore Variables

Write Reduced 2017 Dataset

Comments

Related Posts

The Stringr Package 16 Jun 2019

The Lubridate Package 16 Jun 2019

The Forcats Package 16 Jun 2019