If 60% of all women are employed outside the home, find the probability that in a sample of 20 women, exactly 15 are employed outside the home.

STA2023

EXAM 3-Part 2

1-) A study was conducted to determine the number of cell phones each household has.  The data are shown here.

Number of cell phones 0 1 2 3 4
Frequency 2 30 48 13 7

 

  • Construct a probability distribution.
  • Find the mean of the probability distribution.

2-) If 60% of all women are employed outside the home, find the probability that in a sample of 20 women, exactly 15 are employed outside the home.

3-) The average height of a certain age group of people is 53 inches.  The standard deviation is 4 inches.  If the variable is normally distributed, find the probability that a selected individual’s height will be greater than 59 inches.

4-) An irate student complained that the cost of textbooks was too high.  He randomly surveyed 36 other students and found that the mean amount of money spent for texts was $121.60.  If the standard deviation of the population was $6.36, find the 90% confidence interval of the true mean.

5-) A recent study stated that if a person chewed gum, the average number of sticks of gum he or she chewed daily was 8.  To test the claim, a researcher selected a random sample of 36 gum chewers and found the mean number of sticks of gum chewed per day was 9.  The standard deviation of the population is 1. At , is the number of sticks of gum a person chews per day actually greater than 8?  Show hypothesis, test statistic or P-value, and the decision about the claim.

How many different combinations are possible for each draw? How many different entries are possible? Why were more lotteries introduced?

Working with Combinatorics Involving the Australian Lottery

Use the information below, Tatts Rules-of-Authorised-Lotteries 11-May-2022 and your knowledge of combinatorics to compare and contrast the different lottery in NSW.

Some ideas you may like to explore in your report are:

How many different combinations are possible for each draw?
How many different entries are possible?

Cost v winnings
Why were more lotteries introduced?
Are the prizes reflective of the division?

You are allowed are encouraged to use tables, graphs or graphics to communicate your ideas.

Let X be a uniform r.v. over the interval (1, 3]. What is cdf and pdf of Y = 1/X? Consider the Laplace variable X with pdf given by

Probability

Consider the limiter g(x) shown in the figure below (left). What is the pdf of the output variable Y = g(X) if the input variable is Gaussian with mean a/2 and variance 1.

Let X be a uniform r.v. over the interval (1, 3]. What is cdf and pdf of Y = 1/X?

Consider the Laplace variable X with pdf given by

f X ( x ) = a2 e − a | x |

What is the cdf and pdf of Z = X3?

 

Which movies tend to be better, originals or their sequels? Why do you think this? What kind of factors or data would you consider to answer this question? Include at least three in your response, including how they would help you answer the question.

Introduction:

More and more new movies today are sequels to earlier, successful movies. In this activity we will explore, using statistics, whether original movies or their sequels are generally better.

This activity will span two weeks of our course. In the first week you research the ratings of several movies and their sequels, and make a prediction, based only on your “gut,” on whether original movies tend to be better than their sequels, or whether sequels tend to be better. In the second week you will explore several statistics to see if the data supports a hypothesis that original movies or their sequels tend to be better.

Learning objectives:

  • Describe data by their average value, distribution, and variation including understanding the characteristics and importance of the normal distribution
  • Apply, a basic-level understanding of statistical inference, statistical significance, margin of error, and hypothesis testing when encountering these topics in the real world

Procedure:

First phase of activity:

  • Create a table in your Word document. It should have 4 columns and 16 rows. Label the columns, “Original title,” “Sequel title,” “Original rating,” and “Sequel rating.”
  • Find at least 15 movies that have sequels, and enter the titles of the movies in the first two columns of the table.
  • Before we can begin any statistical analyses, we need data! Visit the Metacritic site (https://www.metacritic.com/) and look up and enter into your table the ratings for all of the movies and sequels you found.
  • Now answer the First Week Questions below in your Word document.
  • Note: You must include a citation for the Metacritic site you used to research the movie ratings.

Second phase of activity:

  • Now it’s time to use some statistics! Answer the questions below for the Second Week. You must include calculations for all of your data. You can scan or take a picture of the calculations to include in your Word document, just make sure the images in the document you submit are clear and complete. If you use technology to calculate the statistic, please discuss the specific steps you used in with the technology to do the calculations.
  • You will also create a graph. You can use technology to graph, but whether you hand-draw or use technology, you must include the graphs in your Word document.
  • Submit the completed Word Document through this assignment in Blackboard.

Questions:

First phase of activity:

  1. Which movies tend to be better, originals or their sequels? Why do you think this?
  2. What kind of factors or data would you consider to answer this question? Include at least three in your response, including how they would help you answer the question.

Second phase of activity:

  1. Find the Mean, Median, Mode, and Standard Deviation of your two distributions (originals and sequels).
  2. Create a 5-number summary for each distribution
  3. Use your 5-number summary to build two boxplots on the same scale. Do not forget to label the graph appropriately.
  4. Create a histogram for each distribution. Are they skewed in one direction or symmetric? Do not forget to label the graphs appropriately.
  5. Which movies are better, the original or sequel? Use your data analysis to make a case for which is better, not your opinion. Remember, the answer could also be that the data doesn’t support a conclusion that either is consistently better.

 

Construct 95% and 99% confidence intervals using your sample mean for each of the categories listed below.

Complete problem 1-2 below.

For all problems below you must show me work “by hand” as the examples in the textbook illustrate. I have also included video examples that show how to do all of the work, but also show calculator tips. If you are going to do work with pencil/paper then you need to give yourself time to scan your documents into a digital format. A picture of your work with your cell phone is not acceptable. Utilizing a phone scanner app like CamScan is fine, just make sure it is legible. CamScan will allow you to put all pages into one file directly and you can email it to yourself as a PDF file in order to upload it into Canvas. You can choose to type this portion, but you might have difficulty with some of the formulas so it may be easier to just show work by hand and scan.

You must submit ONE pdf file to me; if you use a scanner that takes multiple images you need to import them into ONE file for submission. The file can have as many pages as needed. MAKE SURE THE SCANS ARE LEGIBLE AND READABLE; IF YOUR BOSS WOULDN’T CONSIDER IT ACCEPTABLE THEN NEITHER WILL I. I will only grade ONE file, so if you have more, I will only grade one of them!!!!!!!

As always, work on this ahead of schedule to give yourself time to deal with technical issues that ALWAYS happen unexpectedly.
You will use the mean, standard deviation and sample size of these 2 categories: Cost of Tuition Out-of-State and Dollar Amount of Financial Aid per student.
You will use the mean of the proportions (rounded to 4 decimal places) and the mean of the enrollment (rounded to the nearest integer) for these 2 categories: Proportion of students receiving aid, and overall graduation rate.

Here are the problems you need to complete with your means and standard deviations from Phase 1, Part C:

1) Construct 95% and 99% confidence intervals using your sample mean for each of the categories listed below (2 intervals for each category, you will complete 4 total confidence intervals ) Use your mean and standard deviation from Phase 1 to construct the confidence intervals for each category. Use t-intervals assuming the distributions are normally distributed. Show the work of constructing the intervals by hand similar to example 3 on page 313 and WRITE AN INTERPRETATION (like in example 3) FOR EACH CONFIDENCE INTERVAL YOU BUILD.

a) Cost of tuition: out-of-state

b) Dollar amount of financial aid per student (no loans)

2.) Construct 95% and 99% confidence intervals using your sample mean for each of the categories listed below (2 intervals for each category, you will complete 4 total confidence intervals). Use a 1-proportion z-interval done in example 2 on page 322; use your sample mean (rounded to 4 decimal places) from excel to be the sample proportion (p-hat) for each category, and use the mean of the enrollment as your sample size (round to the nearest integer).

a) Proportion of Student receiving aid

b) Graduation rate

What are your major takeaways from the video? Did you find any of her points to be concerning? If so, which ones? How is this talk relatable to you and your future?

Signature assignment final draft

In this portion of the Signature Assignment, you will include additional information to your draft that will relate your budget to a TED Talk and apply the financial knowledge to solve a problem.

Paragraph 5: Personal Finance Crisis Reflection

Watch Elizabeth White’s TED Talk by clicking the following link. Captioning and transcripts are available at the link.

Write a paragraph summarizing your thoughts about Elizabeth White’s TED Talk.

What are your major takeaways from the video?
Did you find any of her points to be concerning? If so, which ones?
How is this talk relatable to you and your future?

Paragraph 6: Financial Application

Use the periodic payment formula from the Financial Literacy Unit (Chapter 10), and the amount you put aside for savings each month according to your budget, to determine the compound amount in an account that earns 2.5% compounded monthly after 1 year and then after 3 years.
Consider increasing the amount you put aside for savings each month by $100. If you saved $500 originally, now you are saving $600 per month. Now, determine the compound amount in an account that earns 2.5% compounded monthly after 1 year and then after 3 years.
You must thoroughly explain the formula you chose to use, the variables in the formula, and the steps you took to arrive at the answers. Round your answers to the nearest cent.

Paragraphs 7 & 8: Financial Application Reflection and Conclusion

Reflect on your calculations between the original monthly savings and the new amount of savings (increased by $100). How did the compound amounts change after 1 year? How did the compound amounts change after 3 years? What can you conclude from your calculations? Would it be worth to you to reduce some expenses in order to save an additional $100 per month?
Conclude the assignment with a short reflection on the financial literacy assignment and what you have learned from it.

Essay Guidelines

Convert all eight paragraphs, bibliography, and budget worksheet in the appendix to a single pdf file and upload it to Canvas. The final draft should be between five to seven pages long (inclusive of the bibliography and the appendix). Your final draft should meet the following formatting guidelines:

Suppose we further assume constant returns-to-scale: a+/3 = 1. Show that a bivariate regression of ln(Yi/Li) on ln(Ki/Li) (and a constant) identifies the production function parameters, maintaining the independence assumption in (b). How could we test the constant-returns-to-scale assumption here?

Homework (Mathematical Econometrics)

1. [32 points] You observe an iid sample of data (Yi, Li, Ki) across a set of manufacturing firms i. Here Yi denotes the output (e.g. total sales) of the firm in some period, Li measures the labor input (e.g. total wage bill) of the firm in this period, and Ki measures the capital input (e.g. total value of machines and other assets) of the firm in this period. We are interested in estimating a production function: i.e. the structural relationship determining a firm’s ability to produce output given a set of inputs.

 

(a) [6 points] Suppose you estimate a regression of In Yi on In Li and In Ki (and a constant), where In denotes the natural log. Explain how you would interpret the estimated coefficients on In Li and In Ki, without making any assumptions on the structural relationship.

(b) [8 points] Now suppose you assume a Cobb-Douglas production function: Yi = Qi1,71q for some parameters (a, 13), where Qi denotes the (unobserved) productivity of firm i. Suppose we assume productivity shocks are as-good-as-random across firms: i.e. that Qi is independent of (Li, Ki). Show that under this assumption the regression estimated in (a) identifies a and /3.

(c) [8 points] Suppose we further assume constant returns-to-scale: a+/3 = 1. Show that a bivariate regression of ln(Yi/Li) on ln(Ki/Li) (and a constant) identifies the production function parameters, maintaining the independence assumption in (b). How could we test the constant-returns-to-scale assumption here?

(d) [10 points] Let’s now weaken the as-good-as-random assigment assumption in (b). Suppose we model Qi = Sfei where Si denotes the observed size of firm i, 0 is a parameter governing the relationship between firm size and productivity, and Ei is a productivity shock that is independent of (Si, Li, Ki). Specify a regression which identifies and 0 under this assumption, maintaining the assumption of a + /3 = 1. Do you expect the regression estimated in (c) to overstate or understate /3, given the new model?

2. [32 points] Suppose we are interested in estimating the (potentially different) employment effects of minimum wage increases for high school dropouts and high school graduates. As in Card and Krueger (1994), we observe employment outcomes for a sample of individuals of both educational groups in New Jersey and Pennslyvariia, before and after the New Jersey minimum wage increase. Let Yit denote the employment status of individual i at time t, let Di E {0, 1} indicate an individual’s residence in New Jersey (asumirig nobody moves between the two time periods), and let Postt E {O, 1} indicate the latter time period. Furthermore let Gradi E 10,11 indicate high school graduation. Consider the regression of
Yit =/-L + aDi + rPostt + 7Gradi + ADiPostt (1) + APosttGradi + IPDiGradi + 7DiGradiPostt + vit•
Note in that this regression includes all “main effects” (Di, Postt, and Gradi), all two-way interactions (DiPostt, PosttGradi, and DiGradi) as well as the three-way interaction DiGradiPostt.

(a) [7 Points] Suppose we regress Yit on Di, Postt, and DiPostt in the sub-sample of high school dropouts (with Gradi = 0). Derive the coefficients for this sub-sample regression in terms of the coefficients in the full-sample regression (1). Repeat this exercise for the saturated regression of Yit on Di, Postt, and DiPostt in the sub-sample of high school graduates (with Gradi = 1): what do the coefficients for this sub-sample regression equal, in terms of the coefficients in (1)?

(b) [8 Points] Extending what we saw in lecture, state assumptions under which these two sub-sample regressions (in the Gradi = 0 and Grade = 1 subsamples) identify the causal effects of minimum wage increases on employment for high school dropouts and graduates, respectively. Prove your claims.

(c) [7 Points] Under the assumptions in (b), which coefficient in (1) yields a test for whether the minimum wage effects for high school dropouts and graduates differ? Use your answers in (a).

(d) [10 Points] Suppose New Jersey and Pennslyvariia were on different employment trends when the minimum wage was increased, such that your assumptions in (b) fail. However, suppose the difference in employment trends across states is the same for high school dropouts and graduates. Show that under this weaker assumption the coefficient from (c) still identifies the difference in minimum wage effects across the groups.

Evaluate your methods. Use any methods or metrics you deem necessary. Interpret your parameter estimates. Did both inference methods converge on similar parameter estimates? Why or why not?

Probability

The goal of this assignment is to apply the model development and inference tools from class to Gaussian data. Dealing with the particulars of implementation will help in the development of your final project. You should hand in your code and text.

Mixture modelling

A set of observations were generated according to the model

π ∼ Dir(1, 1, 1)

zi ∼ Cat(π), i = 1, …, n

μk ∼ MV N(0, 10I2), k = 1, …, K

xi ∼ MV N(μzi , I2), i = 1, …, n

where Dir is a Dirichlet distribution, Cat is a categorical distribution, MV N is a multivariate normal distribution, I2 is a 2 × 2 identity matrix, K = 3, and n ∈ {250, 1000, 5000} (three distinct simulations).

Part 1a

  1. We gave you the generative model. Write the other two ways to specify a probabilistic model, namely, a plate diagram and joint probability distribution. Two observed random variables, xi and x j are conditionally independent given what model variables, if any?
  1. Implement a Gibbs Sampler for the aforementioned model. Please document your derivation.

Hint: you may want to keep track of p(x, z, μ, π) or a similar quantity to test for convergence.

Hint: You will almost surely want to work in log space when dealing with small probabilities.

 

Part 1b

  1. Implement mean-field variational inference for the aforementioned model. Commenting your code makes it easier to grant partial credit. Document your derivation.
  1. Apply your code to the provided simulated data on the course website (hw1_250.txt, hw1_1000.txt, hw1_5000.txt). How did you decide convergence for both inference algorithms?
  1. Evaluate your methods. Use any methods or metrics you deem necessary (e.g. figures, clustering metrics, runtime comparisons). Interpret your parameter estimates. Did both inference methods converge on similar parameter estimates? Why or why not?

Hint: You may use external libraries for evaluation. For instance, scikit-learn has a number of off-the-shelf options for evaluating clustering metho

 

Consider an execution history with 5 processes. One of the events ‘e’ has vector clock timestamp [1, 3, 1, 8, 4]. Which of the following statements best describes the situation?

Programming

Question #1 (1 point)

Note: ^ is the superscript operator and _ is the subscript operator. Hence, e_1^2 means an event 2 in process 1.

Consider 3 processes: P_1, P_2 and P_3. We have 5 events from P_1 (e_1^1 to e_1^5), 6 events from P_2 (e_2^1 to e_2^6), and 5 events from P_3 (e_3^1 to e_3^5). e_1^2 is a send event and its corresponding receive event is e_2^5. e_1^3 is a send event and its corresponding receive event is e_3^4. e_2^2 is a send event and its corresponding receive event is e_3^3. e_3^1 is a send event and its corresponding receive event is e_1^4. e_3^2 is a send event and its corresponding receive event is e_2^3. e_3^5 is a send event and its corresponding receive event is e_2^6.

Consider a cut C1 such that e_1^3, e_1^4, e_1^5, e_2^4, e_2^5, e_2^6 and e_3^5 are in the future of C1, and all other events are in the past of C1. Which of the following statements is true?

  1. C1 is a consistent cut.
  2. C1 is an inconsistent cut.
  3. C1 is not a cut.

Question #2 (1 point)

For the execution history in question 1, list all the earliest events at each of the processes that e_2^4 causally affects.

Question #3 (1 point)

For the execution history in question 1, list all the events in Max_Past(e_2^6).

Question #4 (1 point)

For the execution history in question 1, what is the scalar clock timestamp for e_2^6?

Question #5 (1 point)

For the execution history in question 1, what is the vector clock timestamp for e_2^6?

Question #6 (1 point)

For the execution history in question 1, what is the vector clock timestamp for e_2^6, when implemented using differential vector clocks?

Question #7 (1 point)

For the execution history in question 1, what is the vector clock timestamp for e_2^6 during execution, when implemented using direct dependency technique?

Question #8 (1 point)

Consider an execution history with 5 processes. One of the events ‘e’ has vector clock timestamp [1, 3, 1, 8, 4]. Which of the following statements best describes the situation?

  1. Exactly 17 events causally precede e.
  2. At most 16 events causally precede e.
  3. At most 17 events causally precede e.
  4. At least 17 events causally precede e.
  5. None of the other options is true.
  6. Exactly 16 events causally precede e.
  7. At least 16 events causally precede e.

Question #9 (1 point)

For the execution history in question 1, what is the timestamp sent by the differential vector clock along with the message from e_3^5 to e_2^6?

Question #10 (1 point)

For the execution history in question 1, what is the timestamp sent by the direct dependency technique along with the message from e_3^5 to e_2^6?

Question #11 (1 point)

In 1 sentence, explain why strong consistency is an important property for developing logical clocks

 

Summarize the problem with the appliance manufacturing firm’s toaster. Propose the statistical inference to use to solve the problem. Support your decision using a scholarly reference.

Case Study: Statistical Inference

Overview

The research department of an appliance manufacturing firm has developed a new bimetallic thermal sensor for its toaster. The new bimetallic thermal sensor can sense the temperature of the bread and move the lever arm to activate the switch. The research department claims that the new bimetallic thermal sensor will reduce appliance returns under the one-year full warranty by 2%–6%. To determine if the claim can be supported, the testing department selects a group of the toasters manufactured with the new bimetallic thermal sensor and a group with the old thermal sensor and subjects them to a normal year’s worth of wear. Out of 250 toasters tested with the new bimetallic thermal sensor, 8 would have been returned. Seventeen would have been returned out of the 250 toasters with the old thermal sensor. As the manager of the appliance manufacturing process, use a statistical procedure to verify or refute the research department’s claim.

Instructions

Create 8–10 slides, including a cover and a sources list, for a presentation to the director of the manufacturing plant in which you:

Summarize the problem with the appliance manufacturing firm’s toaster.
Propose the statistical inference to use to solve the problem. Support your decision using a scholarly reference.

Using Excel:
Develop a flowchart for the proposed statistical inference, including specific steps.

Compute all statistical calculations.
Place your flowchart in a slide.
Determine if you can verify or refute the research department’s claim.
Choose sources that are credible, relevant, and appropriate. Cite each source listed on your source page at least one time within your assignment.