158 lines
		
	
	
		
			6.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
		
		
			
		
	
	
			158 lines
		
	
	
		
			6.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| 
								 | 
							
								SUMMARY & USAGE LICENSE
							 | 
						||
| 
								 | 
							
								=============================================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								MovieLens data sets were collected by the GroupLens Research Project
							 | 
						||
| 
								 | 
							
								at the University of Minnesota.
							 | 
						||
| 
								 | 
							
								 
							 | 
						||
| 
								 | 
							
								This data set consists of:
							 | 
						||
| 
								 | 
							
									* 100,000 ratings (1-5) from 943 users on 1682 movies. 
							 | 
						||
| 
								 | 
							
									* Each user has rated at least 20 movies. 
							 | 
						||
| 
								 | 
							
								        * Simple demographic info for the users (age, gender, occupation, zip)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The data was collected through the MovieLens web site
							 | 
						||
| 
								 | 
							
								(movielens.umn.edu) during the seven-month period from September 19th, 
							 | 
						||
| 
								 | 
							
								1997 through April 22nd, 1998. This data has been cleaned up - users
							 | 
						||
| 
								 | 
							
								who had less than 20 ratings or did not have complete demographic
							 | 
						||
| 
								 | 
							
								information were removed from this data set. Detailed descriptions of
							 | 
						||
| 
								 | 
							
								the data file can be found at the end of this file.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Neither the University of Minnesota nor any of the researchers
							 | 
						||
| 
								 | 
							
								involved can guarantee the correctness of the data, its suitability
							 | 
						||
| 
								 | 
							
								for any particular purpose, or the validity of results based on the
							 | 
						||
| 
								 | 
							
								use of the data set.  The data set may be used for any research
							 | 
						||
| 
								 | 
							
								purposes under the following conditions:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								     * The user may not state or imply any endorsement from the
							 | 
						||
| 
								 | 
							
								       University of Minnesota or the GroupLens Research Group.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								     * The user must acknowledge the use of the data set in
							 | 
						||
| 
								 | 
							
								       publications resulting from the use of the data set
							 | 
						||
| 
								 | 
							
								       (see below for citation information).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								     * The user may not redistribute the data without separate
							 | 
						||
| 
								 | 
							
								       permission.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								     * The user may not use this information for any commercial or
							 | 
						||
| 
								 | 
							
								       revenue-bearing purposes without first obtaining permission
							 | 
						||
| 
								 | 
							
								       from a faculty member of the GroupLens Research Project at the
							 | 
						||
| 
								 | 
							
								       University of Minnesota.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								If you have any further questions or comments, please contact GroupLens
							 | 
						||
| 
								 | 
							
								<grouplens-info@cs.umn.edu>. 
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								CITATION
							 | 
						||
| 
								 | 
							
								==============================================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								To acknowledge use of the dataset in publications, please cite the 
							 | 
						||
| 
								 | 
							
								following paper:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
							 | 
						||
| 
								 | 
							
								History and Context. ACM Transactions on Interactive Intelligent
							 | 
						||
| 
								 | 
							
								Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
							 | 
						||
| 
								 | 
							
								DOI=http://dx.doi.org/10.1145/2827872
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								ACKNOWLEDGEMENTS
							 | 
						||
| 
								 | 
							
								==============================================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Thanks to Al Borchers for cleaning up this data and writing the
							 | 
						||
| 
								 | 
							
								accompanying scripts.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								PUBLISHED WORK THAT HAS USED THIS DATASET
							 | 
						||
| 
								 | 
							
								==============================================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
							 | 
						||
| 
								 | 
							
								Framework for Performing Collaborative Filtering. Proceedings of the
							 | 
						||
| 
								 | 
							
								1999 Conference on Research and Development in Information
							 | 
						||
| 
								 | 
							
								Retrieval. Aug. 1999.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
							 | 
						||
| 
								 | 
							
								==============================================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The GroupLens Research Project is a research group in the Department
							 | 
						||
| 
								 | 
							
								of Computer Science and Engineering at the University of Minnesota.
							 | 
						||
| 
								 | 
							
								Members of the GroupLens Research Project are involved in many
							 | 
						||
| 
								 | 
							
								research projects related to the fields of information filtering,
							 | 
						||
| 
								 | 
							
								collaborative filtering, and recommender systems. The project is lead
							 | 
						||
| 
								 | 
							
								by professors John Riedl and Joseph Konstan. The project began to
							 | 
						||
| 
								 | 
							
								explore automated collaborative filtering in 1992, but is most well
							 | 
						||
| 
								 | 
							
								known for its world wide trial of an automated collaborative filtering
							 | 
						||
| 
								 | 
							
								system for Usenet news in 1996.  The technology developed in the
							 | 
						||
| 
								 | 
							
								Usenet trial formed the base for the formation of Net Perceptions,
							 | 
						||
| 
								 | 
							
								Inc., which was founded by members of GroupLens Research. Since then
							 | 
						||
| 
								 | 
							
								the project has expanded its scope to research overall information
							 | 
						||
| 
								 | 
							
								filtering solutions, integrating in content-based methods as well as
							 | 
						||
| 
								 | 
							
								improving current collaborative filtering technology.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Further information on the GroupLens Research project, including
							 | 
						||
| 
								 | 
							
								research publications, can be found at the following web site:
							 | 
						||
| 
								 | 
							
								        
							 | 
						||
| 
								 | 
							
								        http://www.grouplens.org/
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								GroupLens Research currently operates a movie recommender based on
							 | 
						||
| 
								 | 
							
								collaborative filtering:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								        http://www.movielens.org/
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								DETAILED DESCRIPTIONS OF DATA FILES
							 | 
						||
| 
								 | 
							
								==============================================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Here are brief descriptions of the data.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
							 | 
						||
| 
								 | 
							
								                gunzip ml-data.tar.gz
							 | 
						||
| 
								 | 
							
								                tar xvf ml-data.tar
							 | 
						||
| 
								 | 
							
								                mku.sh
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
							 | 
						||
| 
								 | 
							
								              Each user has rated at least 20 movies.  Users and items are
							 | 
						||
| 
								 | 
							
								              numbered consecutively from 1.  The data is randomly
							 | 
						||
| 
								 | 
							
								              ordered. This is a tab separated list of 
							 | 
						||
| 
								 | 
							
									         user id | item id | rating | timestamp. 
							 | 
						||
| 
								 | 
							
								              The time stamps are unix seconds since 1/1/1970 UTC   
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								u.info     -- The number of users, items, and ratings in the u data set.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								u.item     -- Information about the items (movies); this is a tab separated
							 | 
						||
| 
								 | 
							
								              list of
							 | 
						||
| 
								 | 
							
								              movie id | movie title | release date | video release date |
							 | 
						||
| 
								 | 
							
								              IMDb URL | unknown | Action | Adventure | Animation |
							 | 
						||
| 
								 | 
							
								              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
							 | 
						||
| 
								 | 
							
								              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
							 | 
						||
| 
								 | 
							
								              Thriller | War | Western |
							 | 
						||
| 
								 | 
							
								              The last 19 fields are the genres, a 1 indicates the movie
							 | 
						||
| 
								 | 
							
								              is of that genre, a 0 indicates it is not; movies can be in
							 | 
						||
| 
								 | 
							
								              several genres at once.
							 | 
						||
| 
								 | 
							
								              The movie ids are the ones used in the u.data data set.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								u.genre    -- A list of the genres.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								u.user     -- Demographic information about the users; this is a tab
							 | 
						||
| 
								 | 
							
								              separated list of
							 | 
						||
| 
								 | 
							
								              user id | age | gender | occupation | zip code
							 | 
						||
| 
								 | 
							
								              The user ids are the ones used in the u.data data set.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								u.occupation -- A list of the occupations.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								u1.base    -- The data sets u1.base and u1.test through u5.base and u5.test
							 | 
						||
| 
								 | 
							
								u1.test       are 80%/20% splits of the u data into training and test data.
							 | 
						||
| 
								 | 
							
								u2.base       Each of u1, ..., u5 have disjoint test sets; this if for
							 | 
						||
| 
								 | 
							
								u2.test       5 fold cross validation (where you repeat your experiment
							 | 
						||
| 
								 | 
							
								u3.base       with each training and test set and average the results).
							 | 
						||
| 
								 | 
							
								u3.test       These data sets can be generated from u.data by mku.sh.
							 | 
						||
| 
								 | 
							
								u4.base
							 | 
						||
| 
								 | 
							
								u4.test
							 | 
						||
| 
								 | 
							
								u5.base
							 | 
						||
| 
								 | 
							
								u5.test
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								ua.base    -- The data sets ua.base, ua.test, ub.base, and ub.test
							 | 
						||
| 
								 | 
							
								ua.test       split the u data into a training set and a test set with
							 | 
						||
| 
								 | 
							
								ub.base       exactly 10 ratings per user in the test set.  The sets
							 | 
						||
| 
								 | 
							
								ub.test       ua.test and ub.test are disjoint.  These data sets can
							 | 
						||
| 
								 | 
							
								              be generated from u.data by mku.sh.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								allbut.pl  -- The script that generates training and test sets where
							 | 
						||
| 
								 | 
							
								              all but n of a users ratings are in the training data.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								mku.sh     -- A shell script to generate all the u data sets from u.data.
							 |