MLconf Atalanta was our exciting event on September 19th, a block away of Georgia Tech at the Academy of Medicine. This year MLconf focused on ML platforms, tools and algorithms. We hosted a speaker from Facebook who gave us an overview of how machine learning shapes the biggest social network. We are very happy to have had speakers from companies likeSkyTree with their super fast and scalable machine learning server and 0xdata’s presenting their open source platform and the new deep learning toolbox. We found out more about Systap’s graph analytics and machine learning platform on GPUs and Finally ORACLE taught us how to do machine learning. For those in he NOSQL arena, Cloudera presented the current trends. Professor Manos Antonakis explained why machine learning alone cannot solve problems, without some domain expertise through his experience in internet security. Professor Amy Langville, the guru of ranking and also the author of “Who is #1”, “Google’s, Page rank and Beyond” told us how to rank all kinds of data, from sport teams to movies. Netflix and Meetup, showed us how they do their recommendations, we were surprised by the difference on the constraints and data availability they have.
Ewa Dominowska, Engineering Manager, Facebook
In the last decade, Facebook has quickly risen to become the most popular social networking and communication site in the world, with over 750 million daily active users. In this talk, I describe the advertising systems that we have developed to support and complement that network. With over one million active advertisers, in order to select what ads each person sees, Facebook’s ad ranking system evaluates many trillion candidate matches every day. I will share a few practical lessons gathered from working on machine learning at that scale and cover some of the broad learnings and improvements that we have made. I will also relate this to the application of machine learning in online advertising overall, and suggest that understating what separates the current systems and approaches from the real world is more important than striving to achieve perfection.
Ewa Dominowska joined Facebook in spring of 2014 as an Engineering Manager focused on Science and Metrics for Online Advertising. Before coming to Facebook she designed a large scale predictive analytics platform for mobile devices as a Chief Architect at Medio Systems (acquired by Nokia). Prior to her start-up days, Ewa spent 10 years in various roles at Microsoft. At Microsoft, Ewa joined the Online Services Division to help found adCenter, the second largest online advertising platform in the US. Her work focused on real-time ad ranking, targeting, content analysis, click prediction, and pricing models. As part of the small yet dynamic original team, Ewa designed, architected, and built the alpha version of the contextual advertising product. In 2007, Ewa founded the Open Platform Research and Development team. As part of this effort, she organized the Beyond Search academic program, TROA WWW Workshop, and IRA SIGIR Workshop, resulting in a number of very successful collaborations between academia and industry. During her tenure in the Online Services Division, Ewa spent a year serving as the TA for Satya Nadella, where she advised and assisted in operation and planning for the division. The role encompassed architecture, technology, large-scale data services, and cross-organizational efficiency. Ewa was responsible for the intellectual property process, long-term strategy, and prioritization for the division. In 2010 Ewa started the adCenter Marketplace team responsible for all aspects of the advertising marketplace health and tuning. She architected and built a petabyte-scale distributed data and analytics platform and created a suite of marketplace and experimentation tools. Ewa earned her degrees in Electrical Engineering/Computer Science and Mathematics from MIT. Her research focused on machine learning, natural language processing, and predictive, context aware systems applied in the medical field. Ewa authored several papers and dozens of patents in the areas of online advertising, search, pricing models, predictive algorithms and user interaction.
Evan Estola, Data Scientist, Meetup.com
Abstract: Beyond Collaborative Filtering: using Machine Learning to power recommendations at Meetup
Collaborative filtering and other common recommendation algorithms are a powerful technique for some scenarios. I will cover how to design a recommendation system from the ground up using an ensemble classifier and supervised learning to avoid some of the pitfalls of collaborative filtering. From sampling to deployment, we’ve had to invent our approach with few non-academic and non-toy examples to follow. At Meetup we’re all about sharing information and empowering communities, so I’ll present the details of our model as well as some of the new features we are still developing.
Evan is a Machine Learning Engineer at Meetup, where he is responsible for building intelligent systems that directly affect the user experience. Evan owns the recommendation engine at Meetup from data collection to production. Previously, Evan was on the Machine Learning Team at Orbitz Worldwide and he got his start in the Information Retrieval Lab at the Illinois Institute of Technology.
Amy Langville, Associate Professor of Mathematics, The College of Charleston in South Carolina
My talk will cover four ranking and clustering projects that I consulted on this past year. The projects range from ranking Olympic athletes, mixed martial arts fighters, and cell phone carriers to clustering sentences to rank individuals by how much humility they evidence in their written language. For each project, I will address the particular data challenges and the solutions and techniques we proposed.
Amy is an Associate Professor of Mathematics at The College of Charleston in South Carolina where she regularly teaches graduate courses in Operations Research and Optimization and undergraduate courses in calculus and linear algebra. Her research focuses on ranking and clustering. She also enjoys solving applied mathematics problems from industry and has consulted with a variety of companies from large search engines and software companies to small start-ups and law firms engaged in patent infringement cases. Amy studied Operations Research for her PhD and web information retrieval for her postdoctorate at N.C. State University. When the surf’s up, Amy’s riding it. When it’s not, she’s training jiu-jitsu, peppering a volleyball, or biking around Folly Beach.
Elizabeth Elhassani, Director, Enterprise Analytics & Insights, LexisNexis
Abstract: Bring Analytics to Life with Passionate Business Uptake
Big Data is an effective tool to stem attrition and many other important challenges companies face. In this session, I will share with you a case study of how LexisNexis Risk Solutions has been an industry leader by integrating predictive analytics throughout our organization from ideation through passionate uptake so that we can understand, anticipate and stymie attrition very early on in the customer lifecycle. During this session, I will explain how you can harness the power of analytics internally at your company, communicate the impact it will have on your business, operationalize it for end users, and create 100 percent buy-in and uptake by senior executives, finance, HR, customer service, sales, marketers and many other departments. In addition, I will talk with you about how we regularly use data mining and predictive analytics to make intelligent business decisions that drive real-time, bottom-line-impacting results.
Elizabeth Elhassani is the Director of Enterprise Analytics and Insights at LexisNexis Risk Solutions. Elizabeth is responsible for leading the design and implementation of short and long term analytic strategies to benefit all of our businesses.
An experienced marketing professional, Elizabeth has 14 years of B2B and B2C analytic experience, with emphasis in designing statistical models, CRM strategies, segmentation schemes and cost benefit analyses. She came to LexisNexis Risk from dunnhumby USA where she was responsible for scoping, pricing and designing consumer analytic insight projects for 10+ key consumer package goods clients utilizing many statistical methodologies to study customer behaviors. Prior to her work at dunnhumby, she was a Statistical Project Director for ChoicePoint Precision Marketing where she was responsible for consulting and directing projects for marketing analytics and acquisition models for external clients. In addition to her analytics expertise, she also brings an understanding of the risk industry with previous experience at Experian and Advanta Bank Business Cards.
Parikshit Ram, Senior Machine Learning Scientist, Skytree
Abstract: Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
Parikshit Ram is a member of the technical staff at the machine learning startup Skytree (www.skytree.net) where he develops enterprise grade machine learning algorithms. Prior to this, Pari completed his doctorate in Computer Science at Georgia Tech in the School of Computational Science and Engineering where he was a member of the FASTlab and focused on developing fundamental algorithms and statistical tools for machine learning and data mining. Pari joined Georgia Tech in 2007 after completing his BS and MS in Mathematics and Computing in the department of Mathematics at Indian Institute of Technology, Kharagpur, India. Pari has also contributed to the open source machine learning library MLPACK (mlpack.org)..
Sri Ambati, CEO, 0xdata
Sri is co-founder and CEO of 0xdata (@hexadata), the builders of H2O. H2O democratizes bigdata science and makes hadoop do math for better predictions. Before 0xdata, Sri spent time scaling R over bigdata with researchers at Purdue and Stanford. Prior to that Sri co-founded Platfora and was the Director of Engineering at DataStax. Before that Sri was Partner & Performance engineer at java multi-core startup, Azul Systems, tinkering with the entire ecosystem of enterprise apps at scale.
Before that Sri was at sabbatical pursuing Theoretical Neuroscience at Berkeley. Prior to that Sri worked on nosql trie based index for semistructured data at in-memory index startup RightOrder. Sri is known for his knack for envisioning killer apps in fast evolving spaces and assembling stellar teams towards productizing that vision. Sri is a regular speaker in the BigData, NoSQL and Java circuit.
Sandy Ryza, Software Engineer, Cloudera
Abstract: Unsupervised Learning on Huge Data with Apache Spark
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Spark’s MLLib module contains implementations of several unsupervised learning algorithms that scale to large datasets. In this talk, we’ll discuss how to use and implement large-scale machine learning algorithms with the Spark programming model, diving into MLLib’s K-means clustering and Principal Component Analysis (PCA).
Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera’s Apache Spark development.
Bryan Thompson, Chief Scientist and Founder, SYSTAP, LLC
I will discuss current research on the MapGraph platform. MapGraph is a new and disruptive technology for ultra-fast processing of large graphs on commodity many-core hardware. On a single GPU you can analyze the bitcoin transaction graph in .35 seconds. With MapGraph on 64 NVIDIA K20 GPUs, you can traverse a scale-free graph of 4.3 billion directed edges in .13 seconds for a throughput of 32 Billion Traversed Edges Per Second (32 GTEPS). I will explain why GPUs are an interesting option for data intensive applications, how we map graphs onto many-core processors, and what the future looks like for the MapGraph platform.
MapGraph provides a familiar vertex-centric abstraction, but its GPU acceleration is 100s of times faster than main memory CPU-only technologies and up to 100,000 times faster than graph technologies based on MapReduce or key-value stores such as HBase, Titan, and Accumulo. Learn more at http://MapGraph.io.
Bryan Thompson (SYSTAP, LLC) is the Chief Scientist and co-Founder of SYSTAP, LLC. He is the lead architect for bigdata®, an open source graph database used by Fortune 500 companies including EMC (SYSTAP provides the graph engine for the topology server used in their host and storage management solutions) and Autodesk (SYSTAP provides their cloud solution for graph search). He is the principle investigator for a DARPA research team investigating GPU-accelerated distributed architectures for graph databases and graph mining. He has over 30 years experience related to cloud computing; graph databases; the semantic web; web architecture; relational, object, and RDF database architectures; knowledge management and collaboration; artificial intelligence and connectionist models; natural language processing; metrics, scalability studies, benchmarks and performance tuning; decision support systems.
Justin Basilico, Senior Researcher/Engineer in Recommendation Systems, Netlix
Abstract: Learning to Personalize
Netflix instant video streaming represents an estimated one third of peak broadband traffic in the US. Personalization is at the core of our product with recommendations driving about 75% of all viewing. Building a high-quality recommendation system for millions of users requires a careful balancing act of handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. In this talk, I will discuss how we use machine learning to drive our recommendation approach. I will describe some of the data, algorithms, metrics, and experimental methodology we use to effectively apply machine learning at scale. I will also highlight the evolution of our personalization approach from rating prediction to ranking to page generation.
Justin Basilico is a Research/Engineering manager for Page Algorithms Engineering at Netflix. He leads an applied research team focused on developing the next generation of algorithms used to generate the Netflix homepage through machine learning, ranking, recommendation, and large-scale software engineering. Prior to Netflix, he worked on machine learning in the Cognitive Systems group at Sandia National Laboratories. He is also the co-creator of the Cognitive Foundry, an open-source software library for building machine learning algorithms and applications.
Tao Ye, Senior Scientist, Pandora
Pandora is best known for the Music Genome Project, the most unique and richly labelled 1.5 million+ song data. Naturally a content based approach to music recommendation is used as the foundation to our online radio service. Over the years we have improved and transformed the recommendation platform to incorporate multi-facted data and models on this foundation. Combined with a dynamic ensemble system, this platform now powers the most popular streaming music service in the U.S., with 77 million+ monthly active users. In this talk I will discuss the music recommendation topics we work on at Pandora such as our ever evolving machine learning tasks, give examples on user modeling tasks, and share challenges we still face.
Tao Ye is a Sr. Scientist on the Pandora playlist team since 2010, working on research driven system building for recommendation systems, measurements and user modeling. She has 15 years of experience in the software industry, holding research scientist and lead engineer positions in social media, networking and mobile systems. She holds 11 granted patents and has published 12 peer reviewed papers. She received a Master’s degree from UC Berkeley in Computer Science and duo Bachelor’s degrees from State University of New York at Stony Brook in Computer Science and Engineering Chemistry.
Xia Zhu, Intel
Abstract: Streaming and Online Algorithms for GraphX
GraphX is a resilient distributed graph processing framework on Apache Spark. It is designed for, and is good at, analysis of static graphs. However, it does not support analysis on time evolving graphs yet. In this talk, I will present graph processing research on streaming enhancements for GraphX, which may be used in both pure stream processing or lambda architectures. I will describe an architecture design, and demonstrate how it works with three machine learning algorithms, with detailed evaluation and analysis on performance and scalability.
As a research scientist at Intel Corporation, Xia (Ivy) Zhu works on graph analytics to provide users with end to end solution which includes but not limited to graph ETL, graph building and machine learning. Prior to joining Intel Labs in 2005, Ivy worked as senior scientist at Philips Research East Asia. She holds a Doctorate in Computer Science, and holds 13 patents.
Jacob Mundt, Chief Technology Officer, eBrevia
Abstract: I, Robot, Esquire: Information Extraction and Summarization in Legal Documents
Pundits constantly predict the demise of many types of knowledge workers at the hands of intelligent machines, and few professionals perform more textual document review than lawyers. In this session, I’ll share work that eBrevia has been doing to apply research from the fields of ML and NLP to summarize and extract information from legal contracts to help accelerate corporate mergers and acquisitions. I will look at the unique characteristics of the legal industry, examine some supervised and semi-supervised training strategies and classification models, and discuss the limitations of these techniques and the essential role lawyers will continue to play.
Jacob Mundt is the CTO at legal tech startup eBrevia, applying information extraction and summarization to the text of legal documents and contracts. eBrevia provides software tools that help attorneys to speed their review of legal documents while increasing accuracy. Previously Jacob researched summarization, machine translation, and information extraction under Kathleen McKeown at Columbia University, and led the Research and Development team at Outcome Sciences (acquired by Quintiles) to improve patient health outcomes through collection of clinical data from hundreds of hospitals. He holds a Bachelor of Science from Rice University and a Master of Science from Columbia.
Emmanouil Konstantinos Antonakakis, Assistant Professor of Computer Systems and Software, Georgia Tech
Abstract: So, you think you can model Internet abuse with machine learning?
Abuse in the Internet is an every day problem. Illicit actors are victimizing people, which result to a variety of significant problems — i.e., from losing your private information to have your recourses being used in other criminal activities. The common denominator behind the Internet abuse is a network of infected machines (a.k.a. botnet) under the control of the criminal entity (a.k.a. botmaster). Needless to say, the detection of such “botnet communications” is in the hurt of the security problem that a large organization faces every day. Detection methods based on static methods are doomed fail, simply because they will always be behind the threat. Thus, the community is in great need of scalable abuse detection solutions.
Unsurprisingly, such newly proposed solutions are often based on machine learning. With this talk I will argue that a fancy machine-learning algorithm (and derived pretty graph pictures) “operationally” will simply not “cut-it”. This is true especially in the case where what you are trying to solve is not your company’s marketing problem, rather the security problem your network and security operation center is facing every day. The role of domain knowledge and constant counter intelligence of the malicious actors is fundamental to properly craft generic detection and attribution solutions able to catch up with the constantly changing malicious methodologies, while at the same time you minimize the false and missed detections.
Manos Antonakakis received his engineering diploma in 2004 from the University of the Aegean, Department of Information and Communication Systems Engineering. From November 2004 up to July 2006, he was working as a guest researcher at the National Institute of Standards and Technology (NIST-DoC), in the area of wireless ad hoc network security, at the Computer Security Division. Before joining the ECE faculty, Dr. Antonakakis held the chief scientist role at Damballa, where he was responsible for advanced research projects, university collaborations, and technology transfer efforts. He currently serves as the co-chair of the Academic Committee for the Messaging Anti-Abuse Working Group (MAAWG). In May 2012, he received his Ph.D. in computer science from the Georgia Institute of Technology under Wenke Lee’s supervision. In his free time, he enjoys watching and playing soccer.
Danai Koutra, CMU/Technicolor Researcher, Carnegie Mellon University
Networks naturally capture a host of interactions in the real world spanning from friendships to brain activity. But, given a massive graph, like the Facebook social graph, what can be said about its structure? Which are its most important structures? How does it compare to other networks like Twitter? This talk will focus on my work developing scalable algorithms and models that help us to make sense of large graphs via pattern discovery and similarity analysis.
I will begin by presenting VoG, an approach that efficiently summarizes large graphs by finding their most interesting and semantically meaningful structures. Starting from a clutter of millions of nodes and edges, such as the Enron who-mails-whom graph, our Minimum Description Length based algorithm, disentangles the complex graph connectivity and spotlights the structures that ‘best’ describe the graph.
Then, for similarity analysis at the graph level, I will introduce the problems of graph comparison and graph alignment. I will conclude by showing how to apply my methods to temporal anomaly detection, brain graph clustering, deanonymization of bipartite (e.g., user-group membership) and unipartite graphs, and more.
Danai Koutra is a final-year Ph.D. candidate at the Computer Science Department at Carnegie Mellon University. Her research interests include large-scale graph mining, graph similarity and matching, graph summarization, and anomaly detection. Danai’s research has been applied mainly to social, collaboration and web networks, as well as brain connectivity graphs. She holds 1 “rate-1” patent and has 6 (pending) patents on bipartite graph alignment. Danai has multiple papers in top data mining conferences, including 2 award-winning papers, and her work was covered by popular press, such as MIT Technology Review. She has also worked at IBM Hawthorne, Microsoft Research Redmond, and Technicolor Palo Alto/Los Altos. She earned her M.S. in Computer Science from CMU 2013 and her diploma in ECE at the National Technical University of Athens in 2010.
Hassan Chafi, Research Manager, Oracle Labs
Abstract: PGX: An In-Memory, Parallel Graph Analytic and Query Engine
Brief Description: In-memory (and distributed) graph analytic engine that is tightly coupled with a relational database.
Long Description/Abstract: We present a graph processing system in which a graph database is tightly integrated with a graph analytic engine. Our graph database, based on existing NoSQL and relational databases, provides scalable management of graph data for transactional workloads. Our graph analytic engine, on the other hand, enables rapid execution of analytic workloads.
We first introduce PGX, our in-memory graph analytic engine which initially loads up the graph data from the database and periodically synchronizes afterward. The parallel execution engine of PGX is very efficient – e.g. counting triangles in billion-edge graphs in 2 minutes. The users can also submit their custom graph algorithms written in a domain-specific language; PGX automatically parallelizes them for execution.
Then we introduce PGX.DIST, our distributed graph analytic engine. We show that PGX.DIST is up to orders of magnitude faster than the state-of-art graph analytic engine. The DSL compiler can help running the same algorithm on both PGX and PGX.DIST, transparently.
* Graph database tightly integrated with graph analytic engine
* Fast, parallel in-memory graph analytic engine
* Distributed graph analytic engine
* Use of Domain-Specific Language for graph analytics
Hassan Chafi is a Senior Research Manager at Oracle Labs where he currently leads various projects. His research investigates high-performance, parallel, in-memory Graph Analytics and using domain specific languages (DSLs) to simplify parallel programming. Dr. Chafi received his PhD from Stanford University. His thesis work at Stanford focused on building a Domain Specific Language Infrastructure, Delite. His was advised by Dr. Kunle Olukotun. Prior to that, Hassan worked in the area of hardware transactional memory as part of the Transactional Coherence and Consistency (TCC) project at Stanford where he developed a scalable extension to the original TCC protocol.
Dan Mallinger, Data Science Practice Manager, Think Big Analytics
Abstract: Organizing for Data Science
This talk will introduce a paradigm for enabling access to large, unstructured, and novel datasets in enterprises, while retaining value from existing tools and staff. By following a real world example, the discussion will walk through how small, central data science teams can make data discoveries and data value accessible to others. We will also review the tools, data science approaches, and best practices to uncovering, polishing, and digesting signal in data to support analytics at the front lines of business.
Dan Mallinger is the Data Science Practice Manager for Think Big Analytics. He has deep experience enabling analytics at enterprises and implementing data science solutions, having helped many of the Fortune 100. Dan has extensive experience working with product, business, and marketing teams across a wide variety of industries. His work with them has been focused on driving value from multi-structured and unstructured data sets. He is formally trained in statistics, computer science, and organizational psychology & leadership.