Assignment 2:
Answer the questions in the provided Jupyter notebook or python file . Use the data .

Q + A
  • A: There are multiple parts to Assignment 2
    1.1: assuming you have the UF ID 12345678 then you run the RNG 66780 times and record how often 1 occurs. Let it be q=5000 times.
    1.2, you run part 1 1000 times. Each time you increment the counter N(q) when you get back q=5000 (and 4963 when 4963 occurs). Then you display (x,y) = (q, N(q))
    (why did we use UFID+1000? Some students have last 4 digits like 0056)
  • 1.5 Q: Should the histograms be normalized to compare to the theoretical distributions?
    A: theoretical distributions are usually normalized to have an integral of 1 -- if your $x$ and $y$ bins are in the 100s unnormalized overlay would look rather strange... (so, yes)
  • 1.10: plot the confidence interval and type out the range if you want to be sure we do not overlook it.
    Q: how to calculate the confidence interval ?
    A: see the data8 book (what should be the limit distribution?)
  • 2.1: Plot the two **relevant** distributions (male and female) in one graph... btw: I uploaded the hsbc data to my google drive to have a quick look at the content
    A: relevant are the reading scores separately for male and female.
  • 2.2: I bootstrapped...
    Problem 2 specifies shuffle of the labels (btw: You should have read the corresponding section in the data8 book by now)
    Take 100 labels "female" and assign it to 100 persons regardless of gender. Then give the label "male" to the rest. According to the null hypothesis, the reading scores do not depend on whether the labels were applied to the correct gender.
  • Q: p-value and alpha value A: you want to see whether the observed distribution has a statistic that lies in (either) tail of the distribution (which distribution?) of the a mean differences.
  • A: slow? vectorize the `for loop'

    you can submit answers as part of your python work

    The grader gave this explanation for Problem 2:
    The intuition behind problem 2 is that we're comparing the observed difference in scores between males and females with differences in scores between two random groups. We're trying to see how likely it is that the observed difference is due to the difference in gender or if it's just a result of sampling variability.
    Random shuffling is one way to create a new sample. It critically depends on the null hypothesis that there is no correlation between the gender and reading scores. We reuse the data since we don't have additional data. If (for example) you have a male pool of size 5 and a female pool of size 6, random shuffling involves taking each of the 11 people and placing them into one of the two pools at random. Of course, you can't have duplicate people so there's no replacement once you've placed a person into a pool already. This is different from bootstrapping which is very similar but where replacement is allowed. In particular, you’re to take the median-difference for each of the 10,000 samples: the difference in median between the two groups associated with each sample. In this way, we’re trying to see how the observed difference compares with differences that arise from variability due to random sampling.

    The confidence interval captures this variability in the median-difference of these random samples. What is meant by displaying the confidence interval is printing out the lower and upper bound values for the median difference associated with the interval. This requires taking all the possible median difference values generated and find which values correspond to the upper and lower bounds, depending on their quantile values. For example, if I wanted to find 90% confidence interval, I would find the 5% (0.05) quantile for the lower bound and 95% (0.95) for the upper bound. As such, I know they 90% of the data falls between these two bounds (95% - 5% = 90%). https://numpy.org/doc/stable/reference/generated/numpy.quantile.html