While there are many applications available that provide statistics on crime in New York City, it is not easy to find any that provide those statistics at the city block or several city blocks level. The recent publishing of detailed incident-level data by the NYPD on violent crime in all the five boroughs of New York City, for the 2015 calendar year, have provided a unique opportunity to analyze crime rates for different parts of the city. The very high spatial and temporal granularity of the data means that rather than aggregating crime rates to the neighborhood or borough level, as is often the case, one can look at trends to within 2-3 city blocks and maybe in the process be able to identify potential crime hotspots. In addition, by mapping each crime incident to a census tract which is the smallest census unit for which detailed demographic data is available, one is able to extract demographic information for each crime location. The possibility of using this data to extract meaningful insights was therefore my primary motivation. I also saw this as an opportunity to build an interactive tool that the user could use to get a better understanding of the crime rates in his or her neighborhood. Ultimately, I hope to incorporate a prediction engine that will provide the user with a prediction on the possibility of falling victim to a violent crime at a given location, on a specific date and time.
How it Works:
The chart below provides a summary of the data flow in the application.
Below is a summary of the development to production workflow and the resources utilized.
MySQL database; Flask, a Python based web framework built on Werkzeug and Jinja2; Gunicorn, a WSGI application server that takes requests from Nginx for dynamic content and passes these requests to Flask; Python libraries – NumPy, Pandas, SciPy; Supervisor for process automation
Google API for geocoding; FCC API for location GEOID; Amazon Web Services (AWS) for production deployment; Git for version control; Jupyter Notebook for testing features and APIs
I am currently working on a predictive engine component to the app. To extract features for a logistic regression model, I am combining data from two sources: (1) The incident level crime data used in this app to provide location and time features, and the mapping of each crime incident to a census tract to provide demographic features. (2) Harvesting geotagged tweets using the twitter public API for a bounding box of coordinates defined for the New York City latitude-longtitude coordinates will allow me to add features based on sentiment analysis on the tweets. The idea is to score each geotagged tweet for negativity based on a dictionary of words associated with crime or crime incident. Geotagged tweets also encode information about the dynamics of a location, i.e., the number of people moving in and out of a place at a given time. I intend to leverage these data too to provide additional features to the model.
Visit the App StaySafeNYC
Scripts and notebooks related to this analysis can be found in this GitHub repository.