Descriptive Statistics Final Project with Python & R

Overview
Welcome to the Descriptive Statistics Final Project! In this project, you will demonstrate what you have learned in this course by conducting an experiment dealing with drawing from a deck of playing cards and creating a writeup containing your findings. Be sure to check through the project rubric to selfassess and share with others who will give you feedback.
Questions for Investigation
This experiment will require the use of a standard deck of playing cards. This is a deck of fiftytwo cards divided into four suits (spades (♠), hearts (♥), diamonds (♦), and clubs (♣)), each suit containing thirteen cards (Ace, numbers 210, and face cards Jack, Queen, and King). You can use either a physical deck of cards for this experiment or you may use a virtual deck of cards such as that found on random.org (http://www.random.org/playingcards/). For the purposes of this task, assign each card a value: The Ace takes a value of 1, numbered cards take the value printed on the card, and the Jack, Queen, and King each take a value of 10.
 First, create a histogram depicting the relative frequencies of the card values.
 Now, we will get samples for a new distribution. To obtain a single sample, shuffle your deck of cards and draw three cards from it. (You will be sampling from the deck without replacement.) Record the cards that you have drawn and the sum of the three cards’ values. Repeat this sampling procedure a total of at least thirty times.
 Let’s take a look at the distribution of the card sums. Report descriptive statistics for the samples you have drawn. Include at least two measures of central tendency and two measures of variability.
 Create a histogram of the sampled card sums you have recorded. Compare its shape to that of the original distribution. How are they different, and can you explain why this is the case?
 Make some estimates about values you will get on future draws. Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? Make sure you justify how you obtained your values.
Load the libraries
1library(ggplot2)Create card deck
12345678suits < c("D", "C", "H", "S")cards < c("A", as.character(seq(2,10)), "J", "Q", "K")values < c(1, 2:9, rep(10, 4))# Build deckdeck < expand.grid(cards=cards, suits=suits)deck$value < valuesdeck123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354## cards suits value## 1 A D 1## 2 2 D 2## 3 3 D 3## 4 4 D 4## 5 5 D 5## 6 6 D 6## 7 7 D 7## 8 8 D 8## 9 9 D 9## 10 10 D 10## 11 J D 10## 12 Q D 10## 13 K D 10## 14 A C 1## 15 2 C 2## 16 3 C 3## 17 4 C 4## 18 5 C 5## 19 6 C 6## 20 7 C 7## 21 8 C 8## 22 9 C 9## 23 10 C 10## 24 J C 10## 25 Q C 10## 26 K C 10## 27 A H 1## 28 2 H 2## 29 3 H 3## 30 4 H 4## 31 5 H 5## 32 6 H 6## 33 7 H 7## 34 8 H 8## 35 9 H 9## 36 10 H 10## 37 J H 10## 38 Q H 10## 39 K H 10## 40 A S 1## 41 2 S 2## 42 3 S 3## 43 4 S 4## 44 5 S 5## 45 6 S 6## 46 7 S 7## 47 8 S 8## 48 9 S 9## 49 10 S 10## 50 J S 10## 51 Q S 10## 52 K S 101. Create a histogram depicting the relative frequencies of the card values
First, create a histogram depicting the relative frequencies of the card values.
12345ggplot(deck, aes(x=value)) +geom_histogram(binwidth=1, origin=0.5, col="white", fill="royalblue", alpha=0.5) +labs(x="Value", y="Count", title="Card Value Histogram") +scale_x_continuous(breaks = seq(1,10)) +theme_bw()2. Get samples for a new distribution
Now, we will get samples for a new distribution. To obtain a single sample, shuffle your deck of cards and draw three cards from it. (You will be sampling from the deck without replacement.) Record the cards that you have drawn and the sum of the three cards’ values. Repeat this sampling procedure a total of at least thirty times.
12345df < c()for (i in 1:5000) {df$value[i] < sum(sample(deck$value,3,replace = FALSE))}samples < data.frame(df)3. Report descriptive statistics for the samples
Let’s take a look at the distribution of the card sums. Report descriptive statistics for the samples you have drawn. Include at least two measures of central tendency and two measures of variability.
1summary(samples)12345678## value## Min. : 3.00## 1st Qu.:16.00## Median :20.00## Mean :19.55## 3rd Qu.:23.00## Max. :30.004. Create a histogram of the sampled card sums you have recorded
Create a histogram of the sampled card sums you have recorded. Compare its shape to that of the original distribution. How are they different, and can you explain why this is the case?
1234ggplot(samples, aes(x=value)) +geom_histogram(binwidth=1, origin=0.5, col="white", fill="royalblue", alpha=0.5) +labs(x="Value", y="Count", title="3 Card Sum Histogram (n=5000)") +theme_bw()5. Make some estimates about values you will get on future draws
Make some estimates about values you will get on future draws. Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? Make sure you justify how you obtained your values.
a. Within what range will you expect approximately 90% of your draw values to fall?
1quantile(samples$value, probs = seq(.05, .95, 0.9))123## 5% 95%## 11 28b. Approximate probability that you will get a draw value of at least 20?
Calculate the z score
12345atleast < 20mean < mean(samples$value)sd < sd(samples$value)z < (atleastmean)/sdz12## [1] 0.08467613We could lookup the value in the Z score table, but I want to calculate it.
Convert the Z score to a pvalue with the Cumulative Distribution Function. We want to find the probability is larger than the given number so we use the
lower.tail=FALSE
.12cdf < pnorm(z, lower.tail=FALSE)cdf12## [1] 0.4662594As we can see the probability that we will get a draw value of at least 20 is 0.4662594.
Descriptive Statistics Final Project¶
Overview¶
Welcome to the Descriptive Statistics Final Project! In this project, you will demonstrate what you have learned in this course by conducting an experiment dealing with drawing from a deck of playing cards and creating a writeup containing your findings. Be sure to check through the project rubric to selfassess and share with others who will give you feedback.
Questions for Investigation¶
This experiment will require the use of a standard deck of playing cards. This is a deck of fiftytwo cards divided into four suits (spades (♠), hearts (♥), diamonds (♦), and clubs (♣)), each suit containing thirteen cards (Ace, numbers 210, and face cards Jack, Queen, and King). You can use either a physical deck of cards for this experiment or you may use a virtual deck of cards such as that found on random.org (http://www.random.org/playingcards/). For the purposes of this task, assign each card a value: The Ace takes a value of 1, numbered cards take the value printed on the card, and the Jack, Queen, and King each take a value of 10.
 First, create a histogram depicting the relative frequencies of the card values.
 Now, we will get samples for a new distribution. To obtain a single sample, shuffle your deck of cards and draw three cards from it. (You will be sampling from the deck without replacement.) Record the cards that you have drawn and the sum of the three cards’ values. Repeat this sampling procedure a total of at least thirty times.
 Let’s take a look at the distribution of the card sums. Report descriptive statistics for the samples you have drawn. Include at least two measures of central tendency and two measures of variability.
 Create a histogram of the sampled card sums you have recorded. Compare its shape to that of the original distribution. How are they different, and can you explain why this is the case?
 Make some estimates about values you will get on future draws. Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? Make sure you justify how you obtained your values.
Load the libraries¶
In [1]:%matplotlib inline import pandas as pd import matplotlib import numpy as np import matplotlib.pyplot as plt from scipy import stats matplotlib.style.use('seaborncolorblind')
Create card deck¶
In [2]:# Hearts, Spades, Clubs, Diamonds suits = ['H', 'S', 'C', 'D'] card_val = (range(1, 11) + [10] * 3) * 4 base_names = ['A'] + range(2, 11) + ['J', 'K', 'Q'] cards = [] for suit in suits: cards.extend(str(num) + suit for num in base_names) deck = pd.Series(card_val, index=cards) deck
Out[2]:1. Create a histogram depicting the relative frequencies of the card values¶
First, create a histogram depicting the relative frequencies of the card values.
In [3]:deck.hist()
Out[3]:2. Get samples for a new distribution¶
Now, we will get samples for a new distribution. To obtain a single sample, shuffle your deck of cards and draw three cards from it. (You will be sampling from the deck without replacement.) Record the cards that you have drawn and the sum of the three cards’ values. Repeat this sampling procedure a total of at least thirty times.
In [4]:samples = [] for i in range(5000): samples.append(np.random.choice(deck,3, replace=False).sum()) samples = pd.Series(samples)
3. Report descriptive statistics for the samples¶
Let’s take a look at the distribution of the card sums. Report descriptive statistics for the samples you have drawn. Include at least two measures of central tendency and two measures of variability.
In [5]:samples.describe()
Out[5]:4. Create a histogram of the sampled card sums you have recorded¶
Create a histogram of the sampled card sums you have recorded. Compare its shape to that of the original distribution. How are they different, and can you explain why this is the case?
In [6]:samples.hist()
Out[6]:5. Make some estimates about values you will get on future draws¶
Make some estimates about values you will get on future draws. Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? Make sure you justify how you obtained your values.
a. Within what range will you expect approximately 90% of your draw values to fall?¶
In [7]:samples.quantile(q=[.05,.95])
Out[7]:b. Approximate probability that you will get a draw value of at least 20?¶
Calculate the z score
$Z=\frac{X\mu}{\sigma}$
In [8]:z = (20samples.mean())/samples.std() z
Out[8]:We could lookup the value in the Z score table, but I want to calculate it.
Convert the Z score to a pvalue with the Survival function (also defined as 1  cdf (Cumulative Distribution Function), but sf is sometimes more accurate).
$S(t)=P(\{T>t\})=\int _{t}^{\infty }f(u)\,du=1F(t)$
In [9]:sf = stats.norm.sf(z) print(sf)
As we can see the probability that we will get a draw value of at least 20 is 0.470.
I’m creative, imaginative, freethinking, daydreamer and strategic who needs freedom, peace and space to brainstorm and to fantasize about new and surprising solutions. Generates ideas and solves difficult problems, sees all options, judges accurately and wants to get to the bottom of things.
Interested in Data Science, Data Analytics, Running, Crossfit, Obstacle Running and Coffee.
7 comments on “Descriptive Statistics Final Project with Python & R”
Comments are closed.
Since R vectors can not contain data of the different types:
this
cards < c("A", as.character(seq(2,10)), "J", "Q", "K")
is equal to this
cards < c("A", 2:10, "J", "Q", "K")
Hi George, that’s also possible, thanks for the mention.
Good intro to descriptive statistics in R…just one comment.
In the case of your first plot, I believe your aesthetics should be ‘X=value’ not ‘x=values’ …
ggplot(deck, aes(x=values))
…since that is how you named it when building the deck:
deck$value < values
but I may be wrong 🙂
Hi Mirek,
You’re right, it’s not correct. In this case it works because the values will iterate over the data in the deck and stays the same.
I’ve changed my code.
Thanks for the mention 🙂