Week 8 – Oct 30

Today I had an overview of the MANOVA test. From what I understand, MANOVA, or Multivariate Analysis of Variance, is a statistical method that expands upon the traditional Analysis of Variance (ANOVA). It is used when you have multiple dependent variables. Instead of analyzing each dependent variable separately, MANOVA treats them as a collective group. The primary goal of MANOVA is to test whether there are significant differences in means among multiple groups, while also considering the interrelationships between these dependent variables.

Key Concepts in MANOVA:

  1. Dependent Variables: In MANOVA, you work with two or more continuous dependent variables. These variables are often interconnected or represent different facets of the same phenomenon.
  2. Independent Variable: Similar to ANOVA, you have one or more categorical independent variables that define the groups you want to compare. These categories can be treatment groups, locations, or any other grouping variable of interest.
  3. Null Hypothesis: The null hypothesis in MANOVA posits that there are no significant differences in the means of the dependent variables across the groups. In other words, all group means are equal.
  4. Alternative Hypothesis: The alternative hypothesis suggests that at least one group mean is different from the others in at least one dependent variable.

MANOVA is a valuable tool when you want to examine the collective effects of multiple dependent variables and ascertain if there are group differences that may not be apparent when considering each dependent variable in isolation

Week 8 – Oct 27

A scatterplot is a type of data visualization that is useful for several purposes, primarily to understand the relationship between two variables. I believe a scatterplot can be helpful for the project because it can be used :

  1. Identify Patterns and Relationships: Scatterplots allow you to quickly visualize and identify patterns or relationships between two variables. For example, you can see if there’s a clear trend, correlation, or any outliers in your data.
  2. Assess Correlation: If the points in a scatterplot tend to form a line or follow a specific pattern, it indicates a correlation between the two variables. Scatterplots are particularly useful for assessing the strength and direction of this correlation, whether it’s positive (as one variable increases, the other also increases) or negative (as one variable increases, the other decreases).
  3. Outlier Detection: Outliers, which are data points that significantly differ from the majority of the data, are easily visible in a scatterplot. Detecting outliers is important for data cleaning and anomaly detection.

Week 8 – Oct 23

Fuzzy clustering is a variant of traditional clustering methods (like K-Means) that allows data points to belong to multiple clusters to varying degrees. In standard clustering, each data point is assigned exclusively to one cluster. In fuzzy clustering, the degree of membership for each data point in each cluster is expressed as a probability or a degree of belonging, hence the term “fuzzy.” Here’s a simple explanation of fuzzy clustering:

  1. Traditional Clustering vs. Fuzzy Clustering:
    • In traditional clustering (e.g., K-Means), each data point belongs to exactly one cluster. It’s like saying a point must be either in Cluster A or Cluster B.
    • In fuzzy clustering, a data point can belong to multiple clusters simultaneously, with degrees of membership indicating the strength of its association with each cluster. It’s like saying a point can be partially in Cluster A and partially in Cluster B.
  2. Degree of Membership:
    • In fuzzy clustering, the degree of membership for each data point is represented as a value between 0 and 1. A higher value indicates a stronger association with a cluster.
    • These degrees of membership are determined through an optimization process that tries to maximize the likelihood of data points belonging to their assigned clusters while minimizing the overlap between clusters.
  3. Use Cases:
    • Fuzzy clustering is useful in situations where data points may have mixed characteristics or don’t strictly belong to a single category.

Week 7 – Oct 20

I had an overview of clustering as a whole.

There are various clustering methods available, but two of the most commonly encountered are:

  1. Hierarchical Clustering:
    • Agglomerative: This approach starts with individual data points and gradually combines them into larger clusters. The result is a hierarchical structure, often depicted as a dendrogram.
    • Divisive: In contrast, divisive clustering begins with all data points grouped together and then progressively splits them into smaller clusters until individual data points are reached.
  2. Partitional Clustering:
    • K-Means: K-Means is a widely used partitional clustering method that divides data into ‘k’ clusters, where ‘k’ is a parameter set by the user. It aims to minimize the distance between data points and the center (centroid) of their assigned cluster.
    • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN identifies clusters based on the density of data points, forming clusters where data points are densely packed and also detecting noisy data.
    • Gaussian Mixture Models (GMM): GMM assumes data points originate from a mixture of Gaussian distributions. It estimates the parameters of these distributions to find clusters.
    • Fuzzy Clustering: Unlike traditional clustering, where each data point belongs exclusively to one cluster, fuzzy clustering allows data points to have partial membership in multiple clusters.

      I believe hierarchical clustering would be quite beneficial in the project

Week 7 – Oct 18

Today I had an overview of the Monte Carlo simulation which is a method used to estimate probabilities for complex problems. This is how you go about it:

  1. Define the problem: Figure out what you want to understand or predict.
  2. Create a model: Make a simplified math model of the situation with key variables.
  3. Use randomness: Randomly guess values for these variables based on their possible outcomes.
  4. Repeat simulations: Run the model with these random values many times.
  5. Analyze results: Look at the outcomes to understand the likelihood of different results and make informed decisions.

It’s like playing out a situation many times with random inputs to see what’s likely to happen

Week 7 – Oct 16

Today I read about DBSCAN:

DBSCAN is a method used in statistics to group similar data points together. It’s good at finding clusters even if they don’t have a specific shape and can handle noisy data.

Here’s how it works in simple terms:

  1. Start with a data point and check if there are other nearby points (within a certain distance).
  2. If there are enough nearby points, consider them part of a group.
  3. Keep checking nearby points and expanding the group until you can’t find any more nearby points.
  4. Any data points left alone are considered outliers.

DBSCAN is handy because it can find clusters of different shapes and sizes, and you don’t need to tell it how many clusters to look for in advance. It could speed up the analysis process as the police database is very large

 

Week 6 – Oct 14

Today I read a little bit about Elbow Method:

From what I understand, in unsupervised algorithms, it’s vital to find the right number of clusters for your data, which isn’t predefined. We use methods like the Elbow Method in K-Means clustering to make this determination.

To find the optimal K, you iterate from K=1 to K=n (where n is a chosen parameter) and calculate the within-cluster sum of squares (WCSS) for each K. WCSS is the sum of squared distances between the centroids and each data point. In simpler words, it is basically how “tidy” our clusters would be

To determine the best K, you create a graph of K against its corresponding WCSS values. Interestingly, this graph often resembles an elbow. At K=1, WCSS is highest, but as K increases, WCSS decreases. The optimal K is usually where the graph starts to straighten out, indicating a point of diminishing returns.

 

Week 6 – Oct 12

Upon going through the data, a few ideas popped into my head. How often were the people who were shot armed themselves? How many people who were fleeing were shot by police men with body cameras? Is there a correlation between the number of fatalities and the state in which they occurred? The socio-educational background of the state also needs to be taken into account in my opinion.

 

Week 6 – Oct 10

The data given for Project 2 comprises of a comprehensive database of police shootings. Washington Post has compiled a significantly larger database compared to its contemporaries, which may provide better insights. The data is also in response to the Micheal Brown incident of 2014, which makes analysis all the more crucial.