# Understanding Our Data

In order to observe patterns of crime and poverty rate, unemployment rate, food insecurity rate, and median income based on DC wards, we visualized the data using violin plots, bubble plots, and Chernoff face. We reasoned that using high dimensionality charts like bubble charts and Chernoff face would allow us to see relationships between multiple variables easily. We also performed a network analysis to better see possible connections within the data.

## Violin Plots: Investigating Patterns Within and Between Wards

In the violin plots, the X-axis shows the DC wards and the Y-axis shows the attributes (median income, poverty rate, food insecurity rate, and unemployment rate) associated with the crime reports. The box plot of each ward represents the range of data and gives information on the mean, median, Q1, and Q3. The rotated kernel density plot is shown surrounding each box plot, which gives insight into the distribution of the data.

Crimes that occur in Ward 5, 7, and 8 appear to occur in areas with the lowest median income and the highest poverty, food insecurity, and unemployment rate relative to other wards. This is expected, given that Ward 5, 7, and 8 are the most impoverished in DC. Interestingly, for particular wards and attributes, there seems to be a very wide range of values when observing the box plot component of the violin plot. For example, for the poverty rate plot for Ward 2, the poverty rate associated with the crime reports ranges from 0.022 to 0.513, which may be because Ward 2 has a diverse range of poverty rates throughout the area. Therefore, it may be more interesting to note the distribution of the crime reports; the majority of the data is concentrated around lower poverty rate values for Ward 2.

## Network Analysis

For the network analysis, we manually created the network using different attributes from the dataset (unemployment rate, poverty rate, median income, and food insecurity rate). The network is undirected and unweighted graph. Each crime entry is treated as a distinct node. Nodes are connected to each other based on the value of a specific attribute. Nodes are connected to all other nodes within the range of standard deviation of the attribute divided by a constant. Creating a network of all data entries was too time consuming, so we took a random sample of 1000.

All networks generated by the network analysis were very dense. The density ranged anywhere from .12 to .2. This is understandable considering the nature of our dataset and how the network was constructed; there are a lot of repeating values for the attributes because identical attributes were mapped to different crime reports on the dataset. Instead of working with unique values, we decided to include all the data in the network because the number of entries (the distribution of crime reports) are significant piece of information in our whole analysis. We observed this in the cluster analysis of the network as well. The modularity score for each attribute ranged from .5 – .65. This is a very high score, but it is expected given that data itself is very clusterable due to the identical/similar values of the attribute.

The process of network analysis was time consuming. Therefore, we will be turning in a code that runs a network analysis on “MEDIAN_INCOME.” However, we included the analysis result for all the attributes in the write up and the code for running all attributes has been commented.

## Chernoff Face: Unexpected Relationships Between Crime Rate & Other Attributes

​For each of the wards, the average poverty rate, unemployment rate, medium income, and food insecurity rate, and crime rate was determined. The mean values were calculated by averaging the each metric for every census tract within a given ward. Each attribute was binned into 3 sections, and the face characteristics were determined based on the bin value across attributes for each ward. The table of the wards and their respective bins is shown below.

Surprisingly, the highest crime rates do not occur in the areas with the lowest median income and highest food insecurity, poverty, and unemployment rates, contrary to our predictions. We investigated this further in the bubble charts below.

## Bubble Charts: Characterizing Crime Rate Outliers

The bubble charts graphs all the census tracts in DC, for a total of 175 data points. The crime rate for each census tract with one of the four attributes medium income, poverty rate, food insecurity rate, and unemployment rate was graphed. The size of the data point indicates the raw crime count for a particular census tract. The color of the data point indicates the ward the census tract is located in.

Notably, each of the graphs has a two distinct outliers. The large size for each of these points, which represent two census tracts in Ward 2, corresponds to number of crimes in that census tract. We concluded that the high crime rate in these particular points may be due to the high count of petty theft crimes in this area, given that Ward 2 is one of the more affluent areas in DC; people may be more inclined to commit crimes such as theft and burglary in wards that have higher incomes and lower poverty rates. Given that we wanted to observe the geographic relationship between food insecurity and poverty metrics, we reasoned that it may not be valuable to look at the aggregate crime rate; more dire situations of poverty and food insecurity may have greater relevance to more severe types of crime.