This is the most common data format. It is the format we are used to in EXCEL and SPSS, for example.
Indicator ID | Snapshot Year | Region of Supervision | County of Residence | Supervision Level | Gender | Age | Crime | Race/Ethnicity |
---|---|---|---|---|---|---|---|---|
2149980 | 2020 | HUDSON VALLEY | ALBANY | LEVEL 4 | FEMALE | 23 | LEGISLATIVE VFO | BLACK |
2149981 | 2020 | HUDSON VALLEY | ALBANY | LEVEL 3 | FEMALE | 26 | LEGISLATIVE VFO | BLACK |
2149982 | 2020 | HUDSON VALLEY | ALBANY | LEVEL 1 | FEMALE | 27 | MAJOR PROPERTY | WHITE |
2149983 | 2020 | HUDSON VALLEY | ALBANY | LEVEL 4 | FEMALE | 27 | OTHER COERCIVE | BLACK |
2149984 | 2020 | HUDSON VALLEY | ALBANY | LEVEL 3 | FEMALE | 29 | MAJOR PROPERTY | WHITE |
2149985 | 2020 | HUDSON VALLEY | ALBANY | LEVEL 3 | FEMALE | 29 | MAJOR PROPERTY | BLACK |
2149986 | 2020 | HUDSON VALLEY | ALBANY | LEVEL 4 | FEMALE | 29 | LEGISLATIVE VFO | BLACK |
2149987 | 2020 | HUDSON VALLEY | ALBANY | LEVEL 4 | FEMALE | 30 | LEGISLATIVE VFO | WHITE |
The data has clearly labeled variables and values associated with each variable. The values take the form of integer, decimal or text and can have one of four types of measures:
Nominal (categorical)
Ordinal (ranked)
Continuous (interval/ratio or scales)
ID (unique to jamovi, variable that is an identifier)
Itβs relatively easy to give answers to well-specified questions using structured data sources and standard statistical analysis. When there is no structure to the data and the problem is not well-specified standard analyses donβt work. Unstructured data makes up 80-90% of all data produced by organizations (Holzinger, Stocker et al., 2013) including police reports, court documents and human service organizations. Social networking algorithms and text mining can help given meaning to narrative, unstructured text.
Consider the report death report from New Jersey).
According to the Comprehensive Child Abuse and Prevention Act, the Department of Children and Families is allowed to release data on child fatalities and near fatalities. These data are located here (DCF | Child Fatalities (nj.gov)).
The columns we tend to focus on are date of incident, allegation type, disposition while ignoring the most important aspects of the case, included in the case disposition as well as historical information.
A graph is a mathematical structure to model pair-wise relationships between objects. Graph or network data is useful for representing highly interconnected data.
The graph structures use nodes, edges, and properties to represent and store graphical data as the next example will illustrate. The node is the entity itself, such as people in a social network. The edge represents the relationship between two entities.
The properties are attributes that can come from structured or unstructured data sets.
**Graph-based data is a natural way to represent social networks. The structure allows you to calculate specific metrics such as the influence of a person and the shortest path between two people.
ID | Name | Sentence |
---|---|---|
0 | Jose | Murder |
1 | Mark | Drugs |
2 | Jerome | Drugs |
3 | Luke | Murder |
4 | Darrell | Murder |
5 | Cam | Drugs |
6 | Burns | Drugs |
7 | Lee | Other |
8 | Bob | Other |
9 | Owen | Drugs |
Source | Target | Interactions |
---|---|---|
Darrell | Jose | 17 |
Luke | Jose | 13 |
Jabba | Jose | 6 |
Darrell | Mark | 5 |
Han | Jerome | 5 |
Darrell | Luke | 3 |
Don | Darrell | 1 |
Burns | Cam | 7 |
Darrell | Burns | 5 |
Darrell | Lee | 16 |
An Overview of Common Errors | |
---|---|
Errors pointing to false values within one data set | |
Error Description | Possible Solution |
Mistakes during data entry | Manual overrules |
Redundant white space | Use string functions/Excel find and replace |
Impossible values | Manual overrules |
Missing values | Remove observation or value |
Outliers | Validate and, if erroneous, treat as missing value, report both with and w/out outlier |
Errors pointing to inconsistencies between data sets | |
Deviations from code book | Match on keys or else use manual overrules |
Different units of measurement | Recalculate |
Different levels of aggregation | Bring to same level of measurement by aggregation or extrapolation |
EXCEL offers some useful ways to explore your data prior to conducting any analysis. Here are some of the most important EXCEL functions IMHO:
We are going to learn how to do all of these tasks now.
Payee | Facility | Home | Total |
---|---|---|---|
All Other | 894 | 1963 | 2857 |
Medi-Cal | 5836 | 15714 | 21550 |
Medicare | 33839 | 125253 | 159092 |
Private | 4222 | 15994 | 20216 |
Self-Pay | 200 | 657 | 857 |
Grand Total | 44991 | 159581 | 204572 |
Make a pivot table of social vulnerability by counties in Ohio.
Navigate to CDC/ATSDR Social Vulnerability Index (SVI) I used these data for this paper Examining Spatial Regimes of Child Maltreatment Allegations in a Social Vulnerability Framework (sagepub.com)
Click Data and Documentation
Enter the following specifications
Go to the documentation for 2020 and read through it
Unless you want to use a programming package, like R, the excel date/time functions are great. As you will see below, I can make multiple different variables out of one date/time column. Unfortunately, however, Excel understands dates as serial numbers only! This can be a real pain. Below I show you how to use the date and time functions to properly create the following columns: day of week, week number, month, day (numeric) and year. Once the data are formatted in this way, we can make nice visualizations using pivot tables including time series charts and heatmaps!
Here are the steps you need to know:
Highlight date column
Right Click and select "Format cells" and the choose "Text"
You should see the date format change to a serial format, i.e. a number
We use the serial date format to further parse the dates
In the new column type =DATEVALUE(and the column you just inserted, example =DATEVALUE(C2)
= Year(Column), = Month(Column), = Day(Column)
= WEEKDAY(Column)
Default is β1β which returns from Sunday through Saturday
To return the actual day name type = TEXT(COLUMN, "dddd")
Open the file dates and times.csv I am going to go through it. Each column after the 'Date Rptd' has a formula, some of which are correct and some of which are not ...
The data are located in Bootcamp/ex/dates_times/Crime_Data_from_2020_to_Present.csv
Here is how I used these data previously. I created the following chart using the google mobility indices and the data from above to demonstrate the association between domestic violence during the pandemic and mobility patterns during the initial phases of COVID-19. You now have access to all of this data, and can create a similar chart using Excel!
if(the Z-score is greater than 2) then it is extreme --> "Yes"
else if (the Z-score is less than -2) then it is extreme --> "Yes"
else it is not extreme --> "No"
Adding observations from one table to another table
You should have something that looks like this
Next, under plots, choose the following selections
And you should have the following images.