Lesson 1

From Deep Learning Course Wiki
Jump to: navigation, search

Here's the lesson video and associated Lesson 1 Timeline.

The lesson 1 notebook is available in our github repo. Please read our github wiki page if you are new to github or need help accessing the notebooks.

Full notes for this lesson can be found at Lesson 1 Notes . The 'dogs & cats' data can be [found here].

If you are having trouble getting AWS set up, check out the common problems and solutions we've posted on the AWS install wiki page.

Please help by collecting information relevant to lesson 1 here!

Reducing Cost of AWS

  • In this post it is mentioned how the cost can be reduced by using a smaller EBS volume

Fixes for issues in the lesson

  • The aws-alias.sh script Jeremy distributed during the lesson had the home directory hard-coded, which would cause it to fail if you create a username different to 'ubuntu'. Also, he added an additional alias during the class, to get the IP address of the currently running instance. Both of these minor issues are addressed in the new version of the script
  • The free t2.micro instance type recommended in the lesson does not have enough RAM to use the VGG16 model. To switch to t2.large, which has enough memory (but costs $0.10/hour) see the Change EC2 Size page
  • You may need to update kaggle-cli if the data download does not work for you using pip

Overview of homework assignment

  1. Get AWS instance running (either g2 or m4 if not yet approved for g2) after contacting support etc.
  2. Setup ssh keys as per instructions in setup video
  3. install bash setup script onto server instance
  4. launch jupyter notebook on the instance
  5. once the notebook is running, review the lesson 1 notebook notes and run each cell of code to figure out what python and vgg is doing
  6. install kaggle CLI onto the server instance
  7. use the kaggle CLI to download the current data for the Dogs vs. Cats Redux competition
  8. configure the new data to the file structure in the same way that was used in the sample lesson 1 notebook
  9. make a copy of the lesson 1 notebook and use the new copy to draw in the new Dogs Vs. Cats data (if you copy the notebook outside of the course folder, don't forget the utils.py, vgg26.py files etc)
  10. Run the relevant code cells on the sample set of new Dogs v. Cats data to make a prediction on the new image data set.
  11. Once the sample set works, modify the jupyter notebook to use the new test data images
  12. write a script that takes the predict() data of the new Dogs vs. Cats data and writes the data to a new csv file in the format of the sample_submission.csv file that was downloaded with the Dogs vs. Cats
  13. submit that new submission.csv file to the kaggle via the CLI tool
  14. check the public scoreboard for your own ranking
  15. modify or tune current code in the lesson 1 notebook to try to get into the top 50% ranking of the current Dogs v Cats competition
  16. start exploring the other new datasets on kaggle and decide which one you or some teammates would like to study further during the course
  17. download the new data to your EC2 instance and repeat the previous steps with your brand new data.

Links and resources mentioned in the lesson


  • See this wiki page on getting started with Kaggle CLI
  • In the Episode 002 and and Episode 003 of the Startup Data Science podcast, Apurva Naik, Alex Au, and Edderic Ugaddan discuss what it's like to go through the setting up for AWS instance so that they could run the 7 lines of code that abstracted the VGG model and offer some tips to speed up the installation process. We also talk about what we find difficult and what we are excited about with regards to the lesson material.
  • When you need to use a trained model for predictions on unseen/unlabeled test data, you can create a "test/unknown" subdirectory (as in "unknown class label") and place your data there. You can then use a method like get_batches() in the same way you've been using it for data with known labels.
  • Be careful if using the ImageGenerator.flow_from_directory() method. If the directory contains files 1.jpg, 2.jpg... etc., the images are not collected in the order you'd expect (1,2,3, etc.). The most recent version of the method would order images as (1.jpg, 10.jpg, etc.) (as it uses sorted(os.listdir(subpath))) while the old version uses simply os.listdir(subpath) (which returns files in a more random order than you might think, even when you pass shuffle=False).