Social Content:

This assignment asks students to use data collected by the Digital Almshouse Project to create python dictionaries to track information regarding Irish Immigration in 1840’s New York. Using this information, students are then instructed to create functions that find various data trends. The goals of this assignment are to teach students how to use dictionaries, basic sorting, and basic indexing through the exploration of the biases, assumptions, and systems of power extant in this data.

Technical Content:

  • Dictionaries
  • Functions
  • If/Elif/Else Statements
  • Loops
  • Strings

Links:

Digital Almshouse Project Data : https://www.nyuirish.net/almshouse/

Downloadable PDF Assignment: Irish Immigration Assignment

Assignment:

CSCI 1100 Critical Computer Science 1

Homework 7

Dictionaries & Cleaning Data

Reading: “Automated Inequality,” by Virginia Eubanks, Chapter 4, “The Allegheny Algorithm”

OVERVIEW 

This homework is worth 100 points total toward your overall homework grade and is due Thursday, November 14th, 2019 at 11:59:59 pm. There will be only one part of this homework, and the file should be submitted as:

hw7Part1.py

README.txt

This homework will examine a data set on Irish immigration from Bellevue Hospital in 1840’s New York City, scrutinize the data, and create statistical analysis. In this assignment, you will use dictionaries, simple sorting, and indexing to explore the biases, assumptions, and systems of power extant in this data. You will examine how data is structured, what decisions data collectors and “social sorters” like immigration admittors make, how those social decisions become naturalized and cleaned in data, and how these decisions have historical impact.

The complete dataset and information about the Digital Almshouse Project can be found at https://www.nyuirish.net/almshouse/.

Input

The bulk of the homework will focus on reading in a (large) data set, parsing it, “cleaning” the data, and storing the data into nested dictionaries. From these dictionaries, you should read through the data and write functions that identify trends and important statistics.

You should first focus on properly reading in the data. The file you read in is a TSV, meaning each entry is separated by a tab character. This file is best opened in excel or another spreadsheet application. Please read the file Excel_Instructions.txt 

The data is not ‘cleaned’, meaning, if you look through the set briefly, you’ll see extraneous characters that don’t make sense. Your code should handle these characters, either ignoring them or some other type of consideration. 

Columns with non-standard data usually have an adjacent column designed to make comparing columns easier. Take these columns into consideration as you decide what information you’ll contribute to your analysis.

As you are cleaning your data, be sure to note down the decisions you’re making in your README. What have you chosen to do with the data that doesn’t fit your program? Why doesn’t it fit? How much have you altered the “original” dataset? What do these “dirty” elements of the dataset represent?

After reading in the data properly, your code needs to store the data in proper entries in nested dictionaries. There should be two main dictionaries: one for Admittors and one for Emigrants. 

Here’s an example structure of what creating a nested dictionary would look like:


emigrantID = 4765

emigrants = dict()

emigrants[emigrantID] = { “admittor_id” : “432511107”,  “gender” : “female”, “disease” : “pregnant”, “admittor_1” : “G.W. Anderson”, …… }

You can retrieve different entries and information in a few different ways with nested dictionaries. 

For example:

This code:


print(emigrants[emigrantID])

Would print out:


{ “admittor_id” : “432511107”,  “gender” : “female”, “disease” : “pregnant”, “admittor_1” : “G.W. Anderson”, ….. }

While this code:


print(emigrants[emigrantID][“gender”])

print(emigrants[emigrantID][“disease”])

print(emigrants[emigrantID][“admittor_1”])

Would print out:


Female

Pregnant

G.W. Anderson

From the provided files, you have four data sets with varying amounts of entries; 1000, 2500, 5000, and 10000. Do not test with the larger files until you are sure your code works. For grading, we will be testing with all of them.

Creating Dictionaries, Constructing Class

From the data you’ve parsed and stored into dictionaries, the trends and stats you’ve chosen to track will be created. You have some choice in this matter, though we do want you to store particular data points for each Emigrant and Admittor. For emigrants, you are required to track their name, diagnosis, and what location they were sent to. For Admittors, you must track their name, number of patients, and the statistics of how many Emigrants they sent to each facility (i.e. M.G. Leonard sent 1508 to Facility A, 265 to Facility B, and 12 to Facility C). As you may have noticed, there’s a lot more data provided than what we’re requiring you to track.  Below are three examples of what you can track and calculate with your data and how you should format input.

With the Bellevue data, these are the three things chosen to track:

1: G.W. Anderson’s total patients and the ratio of the diagnoses given

2: M.G. Leonard’s total patients and the ration of the diagnoses given

3: The comparison of illnesses diagnosed to each gender

What would you like to view?

For your statistics, you should use two of the stats tracked in the example and add two of your own. At the top of your code, you should type out an explanation (approximately 250 words) about your chosen statistics, why you chose them, and some analysis about the results. We’ll reward credit based on what you’ve chosen. 

In your README.txt file, use your designed sorting function to explore and analyze the dataset. Answer the questions in your README.txt and turn it in to get full points on this portion.

For this homework, you will be graded on:

  • Parsing in and cleaning the data
  • Tracking two of the given statistics
  • Creating and tracking two of your own statistics
  • Taking in continuous input to see those statistics
  • Formatting output
  • Using dictionaries
  • Implementing proper code structure
  • Commenting, variable names
  • Completing the README.txt