Machine Learning on Big Data: Lessons Learned from Google Projects
Harvard CS264 2011 Guest Lecture Series
“Massively Parallel Computing” Course (http://www.cs264.org)
Speaker: Max Lin (Google Research)
Host: Nicolas Pinto (Harvard, MIT)
Date: 3-29-2011
Time: 7:35 PM
Location: Harvard Maxwell Dworkin G125 (http://j.mp/eCgV66)
Abstract:
Machine learning researchers and practitioners develop computer algorithms that “improve performance automatically through experience”. At Gogole, machine learning is applied to solve many problems, such as prioritizing emails in Gmail, recommending tags for YouTube videos, and identifying different aspects from online user reviews. Machine learning on big data, however, is challenging. Some “simple” machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to use on billions of records.
In this talk, I will describe lessons drawn from various Google projects on developing large scale machine learning systems. These systems build on top of Google’s computing infrastructure such as GFS and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in these systems, strategies of scaling and speeding up machine learning systems on web scale data.
Speaker biography:
Max Lin is a software engineer with Google Research in New York City office. He is the tech lead of the Google Prediction API, a machine learning web service in the cloud. Prior to Google, he published research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in Computer Science from Carnegie Mellon University.

