# -*- coding: utf-8 -*-
"""A2.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1FmIcaxZPXmJiTaZmse0HIuThNiltnszO

# Assignment 2

## Description


## Problem 1: Distributions [65pts]

Use `numpy.random.randint()` as a fair random number generator (RNG) with 10 values (0, ..., 9)

1. Run the RNG for n iterations, where n = 10 * (last 4 digits of your UF ID+1000). Record the number q of times that the first digit of your ID occurs. [5]
2. Run the simulation from part 1 m times, where m = 1000. For each q record N(q), the total number of times q is returned (for example: N(114) is 12) [5]
3. Display the empirical probability distribution (histogram) of N(q).
The x-axis lists possible q, the y-axis N(q). 
Display your ID in the title of the histogram plot. [10]
4. What theoretical distribution fits the histogram of the empirical distribution? [5]
5. Display the theoretical distribution on top of the histogram. [5]


6. Run the RNG until the first digit of your ID is shown and record the number of iterations q. [5]
7. Run the simulation from part 6 m = 1000 times and record N(q), the 
number of iterations prior to the first digit of your ID being shown. [5]
8. Display the empirical probability distribution (histogram) of N(q). [5]
8. What theoretical distribution fits the histogram of the empirical distribution? [5]
9. Display the theoretical distribution on top of the histogram. [5]
10. How many times do you need to run the RNG to be at least $p=0.95$ confident that the first digit of your ID will be generated?
What is the answer based on your simulation (empirically) and what is it theoretically? [10]
"""


"""## Problem 2: Shuffling [35pts]

Load the High School and Beyond dataset (**"hsb2.csv"**). Test the null hypothesis $H_0$: gender is not correlated with high school students' scores in **reading**.

1. Plot the two relevant distributions (male and female) in one graph. [5]
2. Use the statistic "median-difference of the reading score between genders" to decide whether the null hypothesis holds for the p-value 1\%. As no further data is available, perform random shuffles of the male/female label to resample for $n = 10,000$ simulations. [15]
3. Display the empirical distribution of the statistic. [5]
3. Compute and display the 98% confidence interval of the statistic. [5]
4. What theoretical distribution fits the histogram of the empirical distribution? [5]
"""