Full notes for this lesson can be found at Lesson 1 Notes .
If you are having trouble getting AWS set up, check out the common problems and solutions we've posted on the AWS install wiki page.
Please help by collecting information relevant to lesson 1 here!
Reducing Cost of AWS
- In this post it is mentioned how the cost can be reduced by using a smaller EBS volume
Fixes for issues in the lesson
- The aws-alias.sh script Jeremy distributed during the lesson had the home directory hard-coded, which would cause it to fail if you create a username different to 'ubuntu'. Also, he added an additional alias during the class, to get the IP address of the currently running instance. Both of these minor issues are addressed in the new version of the script
- The free t2.micro instance type recommended in the lesson does not have enough RAM to use the VGG16 model. To switch to t2.large, which has enough memory (but costs $0.10/hour) see the Change EC2 Size page
- You may need to update kaggle-cli if the data download does not work for you using pip
Overview of homework assignment
- Get AWS instance running (either g2 or m4 if not yet approved for g2) after contacting support etc.
- Setup ssh keys as per instructions in setup video
- install bash setup script onto server instance
- launch jupyter notebook on the instance
- once the notebook is running, review the lesson 1 notebook notes and run each cell of code to figure out what python and vgg is doing
- install kaggle CLI onto the server instance
- use the kaggle CLI to download the current data for the Dogs vs. Cats Redux competition
- configure the new data to the file structure in the same way that was used in the sample lesson 1 notebook
- make a copy of the lesson 1 notebook and use the new copy to draw in the new Dogs Vs. Cats data (including moving utils.py and vgg16.py to the new folder where the new notebook sits?)
- Run the relevant code cells on the sample set of new Dogs v. Cats data to make a prediction on the new image data set.
- Once, the sample set works, modify the jupyter notebook to use the on the new test data images
- write a script that takes the predict() data of the new Dogs vs. Cats data and writes the data to a new csv file in the format of the sample_submission.csv file that was downloaded with the Dogs vs. Cats
- submit that new submission.csv file to the kaggle via the CLI tool
- check the public scoreboard for your own ranking
- modify or tune current code in the lesson 1 notebook to try to get into the top 50% ranking of the current Dogs v Cats competition
- start exploring the other new datasets on kaggle and decide which one you or some teammates would like to study further during the course
- download the new data to your EC2 instance and repeat the previous steps with your brand new data.
Links and resources mentioned in the lesson
- See this wiki page on getting started with Kaggle CLI
- When you need to use a trained model for predictions on unseen/unlabeled test data, you can create a "test/unknown" subdirectory (as in "unknown class label") and place your data there. You can then use a method like get_batches() in the same way you've been using it for data with known labels.
- Be careful if using the ImageGenerator.flow_from_directory() method. If the directory contains files 1.jpg, 2.jpg... etc., the images are not collected in the order you'd expect (1,2,3, etc.). The most recent version of the method would order images as (1.jpg, 10.jpg, etc.) (as it uses sorted(os.listdir(subpath))) while the old version uses simply os.listdir(subpath) (which returns files in a more random order than you might think, even when you pass shuffle=False).