Course Repository
Course materials for General Assembly's Data Science course
Instructor: Ian Hansel
Teaching Assistant: Matt Gibson
Location: Level M, 56-58 York St Sydney NSW 2000
Dates: 21/03/2016 - 01/06/2016
Time: 6:00 p.m. - 9:00 p.m.
Schedule
Monday | Wednesday |
---|---|
21/03: Introduction | 23/03: Basics |
28/03: Easter Break | 30/03: Data Visualisation |
04/04: Linear Regression | 06/04: Logistic Regression |
11/04: Model Evaluation | 13/04: Regularisation |
18/04: Clustering | 20/04: Recommendation Engines |
25/04: Anzac Day | 27/04: Dimensionality Reduction |
02/05: Decision Trees | 04/05: Random Forests & Ensembling |
09/05: Cloud Computing | 11/05: Natural Language Processing |
16/05: Time Series | 18/05: Communication |
23/05: Graphs & Network Analysis | 25/05: Neural Networks & Deep Learning |
30/05: Course Review & Project Presentations | 01/06: Project Presentations |
Pre-Work
Installation and Setup
- Install the Anaconda distribution of Python 2.7x.
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "SYD_DAT_4 team" and add your photo!
Resources
Readings
- Read the first two chapters of The Data Science Handbook
- Read the first two chapters of an [Introduction to Statistical Learning][http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf]
Optional
You're also more than welcome to do the following if you're keen to get extra advanced for your first class:
- Python Codecademy course
- Chapters 1, 2 and 5 of Python for Data Analysis
- Learn Python the Hard Way
- Command Line Crash Course
- Khan Academy on Probability
Course Project Information
The final project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should be talking frequently with your instructors and classmates about them.
Address a data-related problem in your professional field or a field you're interested in. Pick a subject that you're passionate about; if you're strongly interested in the subject matter it'll be more fun for you and you'll produce a better project!
Look at past projects on github for some ideas.
Guest Presentations
Over the course of the class we will have guest presenters talking to us about how they run data science teams in the real world. I encourage you to read up on these companies prior to the presentations so you have some background knowledge on these companies and the types of work they do.
Class 1: Introduction
- Slides
- Lab
- Introduction to General Assembly
- Course overview: our philosophy and expectations
- Tools: check for proper setup of Git, Anaconda, overview of Slack
Homework:
- Resolve any installation issues before next class.
- Make sure you have a github profile and created a repo called "SYD_DAT_4"
- Clone the class repo (this one!)
Optional:
- Read Analyzing the Analyzers for a useful look at the different types of data scientists.
- Read about Markdown Techniques
Class 2: Basics
- Slides
- Lab
- Running Jupyter notebooks Pandas
- Data Manipulation in Python
- Know how to submit homework (more git skills)
Extra Reading:
- Here's the airbnb blog I mentioned tonight http://nerds.airbnb.com/scaling-data-science/
Class 3: Data Visualisation
- Slides
- Lab
- Understand the goals of data visualisation
- Visualise a data set
- Understand different graph types and when to use them
Extra Reading:
- http://www.theguardian.com/news/datablog/interactive/2014/apr/15/australia-football-interactive-statistics
- http://www.sbs.com.au/news/map/where-australias-immigrants-were-born-sydney
- http://small.mu/work
- The Largest Ever Analysis of Film Dialogue by Gender: 2,000 scripts, 25,000 actors, 4 million lines http://polygraph.cool/films/index.html
- http://graphics.latimes.com/kobe-every-shot-ever/
- https://twitter.com/upshotnyt/status/721738264022020096
- Things to think of later for choosing colours in a graph http://lisacharlotterost.github.io/2016/04/22/Colors-for-DataVis/
- http://flowingdata.com/2016/04/27/global-shipping-in-a-narrated-interactive-map/
Geographic Visualisation:
- https://developers.google.com/kml/documentation/
- http://www.gdal.org/ogr2ogr.html
- http://postgis.net/
- http://leafletjs.com/
- https://github.com/mbostock/topojson/wiki
Class 4: Linear Regression
- Slides
- Lab
- Understand the differences between supervised and unsupervised learning
- Describe the process of building a linear regression model
- Build a linear regression model and interpret the output
Homework:
Class 5: Logistic Regression
- Slides
- Lab
- Understand when to use logistic regression and how the model is created
- How logistic regression differs from linear regression
- Build a logistic regression model and interpret the output
- Evaluate a logistic regression model
Class 6: Model Evaluation
- Slides
- Lab
- Understand the importance of properly evaluating a model
- Explain Bias-Variance tradeoff
- Explain the basics of Cross-Validation
- Use Cross-Validation when building a model
Extra Reading Some good resources on data pre-processing and feature transformation:
- http://stats.stackexchange.com/questions/18844/when-and-why-to-take-the-log-of-a-distribution-of-numbers
- https://www.quora.com/When-is-it-good-to-do-feature-transformation
- http://topepo.github.io/caret/preprocess.html (in R which we will use later in the course)
- https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-prepare-data/
Class 7: Regularisation
- Slides
- Lab
- Select variables for a regression model
- Know methods to automatically select variables for a model
- Explain the difference between Ridge regression and Lasso Regression
- Tie together the concepts of Regression models, bias-variance trade-off, and Cross Validation
Homework:
Class 8: Clustering
- Slides
- Lab
- Review unsupervised learning
- Understand the k-means algorithm
- Know if a cluster is a good fit
- Summarise a cluster
Pre-Reading:
- http://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify
- http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
Class 9: Recommendation Engines
- Slides
- Lab
- Know 2 different methods for giving recommendations
- Evaluate recommendation performance
- Extract extra information from the recommendation engine
Class 10: Dimensionality Reduction
- Slides
- Lab
- Know what dimensionality reduction is and when to apply it
- Explain Principal Component Analysis
- Explain Singular Value Decomposition
Online Resources:
- Interactive explainer http://setosa.io/ev/principal-component-analysis/
Class 11: Decision Trees
- Slides
- Lab
- Know the advantages and disadvantages of a decision tree
- Understand how a decision tree decides on the split to make
Homework:
Class 12: Random Forests and Ensembling
- Slides
- Lab
- Explain Ensemble learning
- Explain Bagging
- Explain Random Forests
- Explain Boosting
- Run through a Random Forest model and evaluate
Class 13: Cloud Computing
- Slides
- Explain what Spark is
- Be able to setup a spark cluster
- Run a Spark job through Zepplin
Pre-Reading:
- Read Natural Language Processing website - http://www.nltk.org/ (5 mins)
- Read and be able to explain one use case of the Alchemy API (10 mins) http://www.alchemyapi.com/
- Download and install NLTK for Python (10mins)
Class 14: Natural Language Processing
- Slides
- Lab
- Explain the main concepts in Natural Language Processing
- Understand how Sentiment Analysis works
- Be able to perform the preprocessing steps for NLP
Pre-Class Setup:
- Read first 2 chapters of Forecasting Principles and Practice https://www.otexts.org/fpp (15 mins)
- Download and Install R (10 mins)
- Download and Install RStudio (3 mins)
- Download the forecast package in R (2 mins)
Class 15: Time Series
- Slides
- Explain what a time series is
- Why it need different treatment
- Know how to decompose a time series
- Know what an ARIMA model is
- Know what an exponential model is
- Know how to evaluate a time series model
- Learn enough R to be dangerous
Class 16: Communication
- Slides
- Lab
- What makes for good data science communication
- How to present results
- Some approaches to networking with people in the industry
- What to expect in a data science interview
Extra Reading: Here's some of the topics that came up tonight:
- Data Science Hackathons, http://90seconds.com.au/anz-data-science-hackathon-2016-highlights/
- CDO roles, http://www.gartner.com/smarterwithgartner/understanding-the-chief-data-officer-role/ & http://www-07.ibm.com/au/pdf/GBE03607USEN9.pdf
- Yanir's blog, https://yanirseroussi.com/
- The Big Short, http://www.wired.com/2015/12/big-short-understanding-economics/
- Great story about the importance of crafting a story with numbers - http://www.nytimes.com/2016/04/24/opinion/sunday/what-happens-when-baseball-stats-nerds-run-a-pro-team.html
Class 17: Graphs and Network Analysis
- Slides
- Lab
- Explain the main concepts in Network Analysis
- Understand the applications of network analysis
- Explain, centrality, communities, edges and nodes
Extra Materials:
- Interactive example, https://linkurio.us/panama-papers-how-linkurious-enables-icij-to-investigate-the-massive-mossack-fonseca-leaks/
Class 18: Neural Network Analysis and Deep Learning
- Slides
- Lab
- Explain what a Neural Network is
- Why the resurgence in popularity?
- Understand the applications on Neural Networks
- Run a NN in Tensorflow
Extra Materials:
- http://www.wired.com/2016/01/googles-go-victory-is-just-a-glimpse-of-how-powerful-ai-will-be/
- http://www.wired.com/2014/01/geoffrey-hinton-deep-learning
- https://sites.google.com/site/deepernn/home/blog/briefsummaryofthepaneldiscussionatdlworkshopicml2015
- Interactive example for Tensorflow
Class 19: Course Review
Where To Now?
- Keep in touch via slack
- Provide a tailored learning pathway on what your preferences
- Provide feedback on CVs (which you can take or leave)
- Hopefully you’ll get involved in the Sydney Data Science community (see the meetups listed below)
- Keep working on your projects and find new ones to work on (e.g. Kaggle competitions)
Meetups:
- http://www.meetup.com/Sydney-IBM-Open-Cloud-Meetup
- http://www.meetup.com/Sydney-Datapreneurs/
- http://www.meetup.com/tableau-enthusiasts/
- http://www.meetup.com/R-Users-Sydney/
- http://www.meetup.com/Docker-Online-Meetup/
- http://www.meetup.com/Big-Data-Analytics/
- http://www.meetup.com/Data-Science-Sydney/
- http://www.meetup.com/Sydney-Docker-User-Group/
- http://www.meetup.com/Sydney-Apache-Spark-User-Group/
- http://www.meetup.com/The-Sydney-Data-Science-Breakfast-Meetup-Group/