0%

Coursera | Applied Plotting, Charting & Data Representation in Python(University of Michigan)| Assignment4

   所有assignment相关链接:
  Coursera | Applied Plotting, Charting & Data Representation in Python(University of Michigan)| Assignment1
  Coursera | Applied Plotting, Charting & Data Representation in Python(University of Michigan)| Assignment2
  Coursera | Applied Plotting, Charting & Data Representation in Python(University of Michigan)| Assignment3
  Coursera | Applied Plotting, Charting & Data Representation in Python(University of Michigan)| Week3 Practice Assignment
  Coursera | Applied Plotting, Charting & Data Representation in Python(University of Michigan)| Assignment4
   有时间(需求)就把所有代码放到github上

Assignment 4 - Becoming an Independent Data Scientist


  最后一周作业又一次让我反复去世 :( 。
  然后这个是个独立作业,就是说数据自己找,题目自己定,灵活性非常大,因此我选了好久题目,看了很多参考,沉浸在快乐的维基百科,后来太纠结,就做了这个和example相关的课题,同时也和之前做的Introduction to Data Science in Python| Assignment4有关,大家可以参考下。欢迎讨论、提出建议~

Peer Review

Code

Before working on this assignment please read these instructions fully. In the submission area, you will notice that you can click the link to Preview the Grading for each step of the assignment. This is the criteria that will be used for peer grading. Please familiarize yourself with the criteria before beginning the assignment.

This assignment requires that you to find at least two datasets on the web which are related, and that you visualize these datasets to answer a question with the broad topic of sports or athletics (see below) for the region of Farmington, Michigan, United States, or United States more broadly.

You can merge these datasets with data from different regions if you like! For instance, you might want to compare Farmington, Michigan, United States to Ann Arbor, USA. In that case at least one source file must be about Farmington, Michigan, United States.

You are welcome to choose datasets at your discretion, but keep in mind they will be shared with your peers, so choose appropriate datasets. Sensitive, confidential, illicit, and proprietary materials are not good choices for datasets for this assignment. You are welcome to upload datasets of your own as well, and link to them using a third party repository such as github, bitbucket, pastebin, etc. Please be aware of the Coursera terms of service with respect to intellectual property.

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations. You are welcome to provide multiple visuals in different languages if you would like!

As this assignment is for the whole course, you must incorporate principles discussed in the first week, such as having as high data-ink ratio (Tufte) and aligning with Cairo’s principles of truth, beauty, function, and insight.

Here are the assignment instructions:

  • State the region and the domain category that your data sets are about (e.g., Farmington, Michigan, United States and sports or athletics).
  • You must state a question about the domain category and region that you identified as being interesting.
  • You must provide at least two links to available datasets. These could be links to files such as CSV or Excel files, or links to websites which might have data in tabular form, such as Wikipedia pages.
  • You must upload an image which addresses the research question you stated. In addition to addressing the question, this visual should follow Cairo’s principles of truthfulness, functionality, beauty, and insightfulness.
  • You must contribute a short (1-2 paragraph) written justification of how your visualization addresses your stated research question.

What do we mean by sports or athletics? For this category we are interested in sporting events or athletics broadly, please feel free to creatively interpret the category when building your research question!

Tips

  • Wikipedia is an excellent source of data, and I strongly encourage you to explore it for new data sources.
  • Many governments run open data initiatives at the city, region, and country levels, and these are wonderful resources for localized data sources.
  • Several international agencies, such as the United Nations, the World Bank, the Global Open Data Index are other great places to look for data.
  • This assignment requires you to convert and clean datafiles. Check out the discussion forums for tips on how to do this from various sources, and share your successes with your fellow students!

Example

Looking for an example? Here’s what our course assistant put together for the Ann Arbor, MI, USA area using sports and athletics as the topic. Example Solution File

Project info

Name:

Summary of win percentages for the Big4 sports teams in Michigan

Region:

Michigan, United States

Category:

Sports and Athletics

Question:

How are situations of win percentages for the Big4 sports teams in Michigan and their trends?

Links:

List_of_Detroit_Lions_seasons

List_of_Detroit_Tigers_seasons

List_of_Detroit_Pistons_seasons

List_of_Detroit_Red_Wings_seasons

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import matplotlib.colors as col
import matplotlib.cm as cm
import seaborn as sns
import re

%matplotlib notebook

plt.style.use('seaborn-colorblind')

Load data and clean data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
!pip install lxml
dict_datasets={
"Tigers":"List of Detroit Tigers seasons - Wikipedia.html",
"Lions":"List of Detroit Lions seasons - Wikipedia.html",
"Pistons":"List of Detroit Pistons seasons - Wikipedia.html",
"RedWings":"List of Detroit Red Wings seasons - Wikipedia.html",
}

# Lions
df_lions=pd.read_html(dict_datasets['Lions'])[1][6:92]

lions=pd.DataFrame()
lions['Year']=df_lions['NFL season']['NFL season']
lions['Wins']=df_lions['Regular season']['Wins'].astype(int)
lions['Losses']=df_lions['Regular season']['Losses'].astype(int)

lions['Win %_Lions']=lions['Wins']/(lions['Wins']+lions['Losses'])

# Tigers
df_tigers=pd.read_html(dict_datasets['Tigers'])[3]

tigers=pd.DataFrame()
tigers[['Year','Wins','Losses']]=df_tigers[['Season','Wins','Losses']].copy()
tigers['Year']=tigers['Year'].astype(str)
tigers['Year']=tigers['Year'].astype(object)
tigers['Wins']=tigers['Wins'].astype(int)
tigers['Losses']=tigers['Losses'].astype(int)
tigers['Win %_Tigers']=tigers['Wins']/(tigers['Wins']+tigers['Losses'])

# Pistons
df_pistons=pd.read_html(dict_datasets['Pistons'])[1][11:74]

pistons=pd.DataFrame()
pistons['Year']=df_pistons['Team Season'].str[:4]
pistons[['Wins','Losses']]=df_pistons[['Wins','Losses']]
pistons['Wins']=pistons['Wins'].astype(int)
pistons['Losses']=pistons['Losses'].astype(int)

pistons['Win %_Pistons']=pistons['Wins']/(pistons['Wins']+pistons['Losses'])

# Red Wings
df_redw=pd.read_html(dict_datasets['RedWings'])[2][:94]

redw=pd.DataFrame()
redw['Year']=df_redw['NHL season']['NHL season'].str[:4]
redw[['Wins','Losses']]=df_redw['Regular season[3][6][7][8]'][['W','L']]
redw=redw.set_index('Year')

# missing 2004
redw.loc['2004',['Wins','Losses']]=redw.loc['2003'][['Wins','Losses']]

redw['Wins']=redw['Wins'].astype(int)
redw['Losses']=redw['Losses'].astype(int)

redw['Win %_RedWings']=redw['Wins']/(redw['Wins']+redw['Losses'])
redw=redw.reset_index()

# Merge data for visualize
Big4_Michigan=pd.merge(lions.drop(['Wins','Losses'], axis=1),tigers.drop(['Wins','Losses'], axis=1),on='Year')
Big4_Michigan=pd.merge(Big4_Michigan,pistons.drop(['Wins','Losses'], axis=1),on='Year')
Big4_Michigan=pd.merge(Big4_Michigan,redw.drop(['Wins','Losses'], axis=1),on='Year')

Visualize-KDE

1
2
3
4
5
6
7
%matplotlib notebook
# Draw KDE
kde=Big4_Michigan.plot.kde()
[kde.spines[loc].set_visible(False) for loc in ['top', 'right']]
kde.axis([0,1,0,6])
kde.set_title('KDE of Big4 Win % in Michigan\n(1957-2019)',alpha=0.8)
kde.legend(['Lions','Tigers','Pistons','Red Wings'],loc = 'best',frameon=False, title='Big4', fontsize=10)

Visualize-Line Plot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Big4_Michigan_0019=Big4_Michigan[40:]
fig, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2, 2, sharex=True, sharey=True)
axs=[ax1,ax2,ax3,ax4]

fig.suptitle('Big4 Win % in Michigan\n(2000-2019)',alpha=0.8);

# Properties
columns_w=['Win %_Lions','Win %_Tigers','Win %_Pistons','Win %_RedWings']
colors=['g','b','y','r']
titles=['NFL: Lions','MLB: Tigers','NBA: Pistons','NHL: Red Wings']
axis=[0,20,0,0.8]

y=0.5

for i in range(len(axs)):

# Draw the subplot
ax=axs[i]
# ax.plot(Big4_Michigan_0019['Year'],Big4_Michigan_0019[columns_w[i]],c=colors[i], alpha=0.5)
# sns.lineplot(x=Big4_Michigan_0019['Year'],y=Big4_Michigan_0019[columns_w[i]], alpha=0.5,ax=ax)
sns.pointplot(x=Big4_Michigan_0019['Year'],y=Big4_Michigan_0019[columns_w[i]],scale = 0.7, alpha=0.5,ax=ax)
ax.axhline(y=0.5, color='gray', linewidth=1, linestyle='--')
ax.fill_between(range(0,20), 0.5, Big4_Michigan_0019[columns_w[i]],where=(Big4_Michigan_0019[columns_w[i]]<y), color='red',interpolate=True, alpha=0.3)
ax.fill_between(range(0,20), 0.5, Big4_Michigan_0019[columns_w[i]],where=(Big4_Michigan_0019[columns_w[i]]>y), color='blue',interpolate=True, alpha=0.3)

# Beautify the plot
[ax.spines[loc].set_visible(False) for loc in ['top', 'right']] # Turn off some plot rectangle spines
ax.set_ylabel('Win % ', alpha=0.8)
ax.set_xlabel('Year', alpha=0.8)
ax.set_title(titles[i], fontsize=10, alpha=0.8)
ax.axis(axis)
ax.set_xticks(np.append(np.arange(0, 20, 5),19))
ax.set_xticklabels(['2000','2005','2010','2015','2019'], fontsize=8, alpha=0.8)
for label in ax.get_xticklabels() + ax.get_yticklabels():
label.set_fontsize(8)
label.set_bbox(dict(facecolor='white',edgecolor='white', alpha=0.8))

Discussion

Justification:

Before visualizing, data are loaded from Wikipedia and necessary cleaning processes are made. For example, tie games were dropped and some missing season data are replaced by data from near years.

There are two visualizations for answering the question. What must be prioritized is that the first figure, the KDE, shows the kernel density estimate of four teams’ win percentages from 1957 to 2019. As shown, we can make a conclusion about the statistical information of 4 teams, including mean, standard deviation. For instance, during most seasons, the Tigers always keeps a stable win percentage and shows less variation compared to other teams. Moreover, win percentages of Pistons and Lions seem a little less than 0.5.

When it comes to predicting the win percentage trend of each team, data from 2000 to 2019 are extracted to generate the second visualization. We can clearly find the trend for each team. For example, for the last 20 years, win percentages of the Lions and the Tigers were always less than 0.5, which seems not positive. Furthermore, though win percentages of Red Wings seem well at most time, they are in a decreasing trend and thus it is hard to guarantee that they will keep it in the next season.

Principles: (truthfulness beauty functionality insightfulness)

What must be prioritized is truthfulness. All data are extracted from the Wikipedia, including Wins and Losses, and win percentages are calculated from the formula Wins /(Wins + Losses). All data cleaning has been done carefully.

Concerning beauty. All elements of the visualization are designed properly. For example, for the Big4 W % in Michigan(2000-2019), a horizontal line of 0.5 is drawn for each team. And win percentages larger than 0.5 are filled with blue and win percentages less than 0.5 is filled with red for comparison. All colors are set vivid and smooth.

What is equally worth discussing is functionality. We choose the KDE for visualizing win percentages of 4 teams for a long time, which can clearly show the statistical information. And refering to trend, we use data in 20 years and choose line plot which can help us see the change of win percentages though time.

Lastly, insightfulness is completely shown. The horizontal line is set to show whether the team has won more than loss in that year. And the shadow filled with color can show the general win situation of the team in the last 20 years. For example, we can compare the area of red and blue and find that Red Wings is competent in NHL.

------------------   The End    Thanks for reading   ------------------