CS 461 Homework 1

Due: Midnight, January 10, 2008

Part 0: Join the mailing list (5 points)

Go to this URL:

http://lists.wkiri.com/listinfo.cgi/csula-cs461-wkiri.com

and sign up for the course mailing list. Easy!

Part 1: Machine Learning in the real world (30 points)

Where do we find Machine Learning in use outside of this class? Given what we covered in Lecture 1, you should have a good idea of how to spot Machine Learning in action. Your goal for part 1 is to go out on the web and find a

news article,
press release,
product advertisement, or
other noteworthy website

that describes a system, game, application, etc. that uses Machine Learning in some key fashion.

Next, you should compose two paragraphs in legible, polished English (you will be graded on the quality of your writing; use spell-check and proofread carefully):

A summary of the machine learning component of your discovery. What kind of machine learning is being used?
Your opinion, thoughts, and assessment of the system. Does it sound like it actually works, or are you skeptical? (There's a lot of hype out there!) Is it something you yourself would use, or are there drawbacks you see?

Create a text file called <yourlastname>-hw1-ml.txt (fill in your own last name) that includes:

Standard assignment header: your name, the class name and number (CS 461, Machine Learning), the quarter (Winter 2008), the name of the assignment (Homework 1), and the assignment due date (January 10, 2008). You should include a header in this format on everything you turn in.
The URL of your chosen Machine Learning system, game, application, etc.
The two paragraphs described above.

Part 2: Supervised Learning (25 points)

Place your answers to these questions in a file called <yourlastname>-hw1-questions.txt:

What is the difference between classification and regression?
Imagine that you want to train a classifier to automatically label movies by genre (e.g., "drama", "comedy", "documentary", "horror", "science fiction", etc.). List three features you could use to represent the movies for the classifier.
What is a false positive? What is a false negative?
Describe a classification scenario in which false negatives are much worse (more costly) than false positives.
Is k-Nearest Neighbors a parametric method or a nonparametric method? What does "parametric" mean in this context?

Part 3: 1-Nearest Neighbor Classification (40 points)

In this part of the assignment, you will get to implement your own nearest-neighbor classifier in Java.

Download NearestNeighbor.java and update the comments at the top of the file to reflect your own information.

The main method takes in two command-line arguments:
- training data file name
- test data file name
and then calls train() and test() on the classifier object.

There are two (private) object variables in the NearestNeighbor class:
- database: stores the training data as an ArrayList of ArrayLists of Doubles (each instance is an ArrayList of Doubles)
- labels: stores the training data's labels as an ArrayList of Strings
Complete the implementation of the following methods (read the Javadoc-style headers; you'll be required to write your own in future assignments):
- public void train(String trainfilename): Read in the contents of the training file and store each item in the database. The file reading part is already implemented for you. You'll need to add code to store the data in database and labels.
- public String classify(ArrayList<Double> instance): Find the nearest neighbor in the database to the new instance and return its class value. You may want to implement the Euclidean distance calculation as a separate (private) method.
- public double test(String testfilename): Read in the contents of the test file and classify each item using classify(). The file reading part is already implemented for you. You'll need classify each test item, compare the result to the true label for that item, and report the overall average accuracy obtained.
Train and test your 1NN classifier on the following data sets:
- iris-train, iris-test: 150 total iris flowers, described by four features: sepal length in cm, sepal width in cm, petal length in cm, petal width in cm. Three kinds of flowers: Iris Setosa, Iris Versicolor, and Iris Virginica.
- haberman-train, haberman-test: 306 total cancer patients, described by three features: age, year of operation, number of positive axillary nodes detected. The classes indicate whether the patient Survived at least 5 years after surgery or Died within 5 years.
Create a text file called <yourlastname>-1nn-results.txt and describe your results. For each data set, what accuracy did your nearest-neighbor implementation achieve? Which instances (if any) were misclassified? (For each one, show its features, the true label, and the label your classifier assigned.) Which problem do you think is more difficult?

What to turn in

Upload these files to CSNS:

<yourlastname>-hw1-ml.txt (or <yourlastname>-hw1-ml.pdf if you prefer to submit in PDF format)
<yourlastname>-hw1-questions.txt (or .pdf)
NearestNeighbor.java
<yourlastname>-1nnresults.txt (or .pdf)

In addition, email your response to part 1 (without the assignment header, just the URL and your two paragraphs) to the CS 461 mailing list:

  csula-cs461@wkiri.com

Feel free to explore the links posted by other students and discuss which ones you think are most interesting.