Cloudera Technology Day

Fast Analytics on Fast Data
Learn the latest on Apache Kudu (incubating), the new columnar data store for the Hadoop ecosystem
Risk Management for Data
Understand how to orchestrate modern security architecture for Hadoop under the Cloudera Security Maturity Model
Intuitive Real-Time Analytics
Explore the integrated capabilities of Apache Solr and Hadoop for enabling search-based analytics
Advanced Analytics with Apache Spark
Get an overview of the real-world applications of Spark for machine-learning use cases

Agenda

7:00 – 8:30 AM	Registration and Networking Breakfast
8:30 – 9:00 AM	Keynote -- From MapReduce to Spark: An Ecosystem Evolves Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength. Doug Cutting, Chief Architect, Cloudera
9:00 – 9:40 AM	Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data The Hadoop ecosystem has improved real-time access capabilities recently, narrowing the gap with relational database technologies. However, gaps remain in the storage layer that complicate the transition to Hadoop-based architectures. In this session, the presenter will describe these gaps and discuss the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. The session also will cover Kudu (currently in beta), the new addition to the open source Hadoop ecosystem with outof-the-box integration with Apache Spark and Apache Impala (incubating), that achieves fast scans and fast random access from a single API. Todd Lipcon, Software Engineer, Cloudera / Kudu Founder
9:40 – 10:20 AM	Risk Management for Data: Secured and Governed Protecting enterprise data is an increasingly complex challenge given the diversity and sophistication of threat actors and their cyber-tactics. In this session, participants will hear a comprehensive introduction to Hadoop Security, including the “three A’s” for secure operating environments: Authentication, Authorization, and Audit. In addition, the presenter will cover strategies to orchestrate data security, encryption, and compliance, and will explain the Cloudera Security Maturity Model for Hadoop. Attendees will leave with a greater understanding of how effective INFOSEC relies on an enterprise big data governance and risk management approach. Eddie Garcia, Chief Security Architect, Cloudera
10:20 – 10:35 AM	Networking Break
10:35 – 11:15 AM	Intuitive Real-Time Analytics with Search Text-based search recently has become a critical part of the Hadoop stack, and has emerged as one of the highest-performing solutions for big data analytics. In this session, attendees will learn about the new analytics capabilities in Apache Solr that integrate full-text search, faceted search, statistics, and grouping to provide a powerful engine for enabling next-generation big data analytics applications. Eva Andreasson, Director Product Management, Cloudera
11:15 – 11:55 AM	Introduction to Machine Learning on Apache Spark MLlib Spark MLlib is a library for performing machine learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take only a few lines of code, and leverage hundreds of machines. This talk will demonstrate how to use Spark MLlib to fit an ML model that can predict which customers of a telecommunications company are likely to stop using their service. It will cover the use of Spark's DataFrames API for fast data manipulation, as well as ML Pipelines for making the model development and refinement process easier. Juliet Hougland, Senior Data Scientist, Cloudera
11:55 – 1:00 PM	Lunch
1:00 – 4:00 PM Track A:	A Practitioner’s Guide to Securing Your Hadoop Cluster Why do many Hadoop clusters lack basic security controls? In part because some security features are relatively new and Hadoop security can be complex and daunting. Participants in this tutorial will be led through the process of securing a Hadoop cluster. The instructors will begin with a Hadoop cluster with no security and incrementally add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance considerations. The following topics will be covered for each of the security features above: ● Introduction: what the security feature is, what protection it provides, and best practices and recommendations ● Planning: how to enable the feature in a phased manner with the fewest growing pains and least risk ● Relevance: why it’s important (demonstrated by live attacks against a cluster without the target security feature) ● Implementation: an overview of how the implementation is performed, where are the moving parts, and potential pitfalls During this tutorial, participants will be provided a maximally secure cluster to learn from and attack. Attendees should bring a laptop to the session with Internet access and the ability to run an SSH client. Michael Yoder, Software Engineer, Cloudera Ben Spivey, Solutions Architect, Cloudera Sravya Tirrukovalur, Software Engineer, Cloudera Mubashir Kazia, Solutions Architect, Cloudera
1:00 – 4:00 PM Track B:	Apache Hadoop Operations for Production Systems Hadoop is emerging as the standard for big data processing and analytics; however, as usage of the Hadoop clusters grow, so do the demands of managing and monitoring these systems. In this tutorial, attendees will be provided an overview of the necessary phases for successfully managing Hadoop clusters, with an emphasis on production systems—from installation, to configuration management, service monitoring and troubleshooting and support integration. Participants will receive a review of tooling capabilities and learn which have been most helpful to users, as well as hear lessons learned and best practices from users who depend on Hadoop as a business-critical system. The topics to be covered include: ● Installation (hardware considerations, OS prerequisites, sanity testing, security considerations) ● Configuration (mechanics, key configurations, resource management) ● Troubleshooting (managing, troubleshooting, and debugging Hadoop clusters and applications) ● Enterprise considerations (scaling, logs, failure testing) Sean Kane, Senior Solutions Architect, Cloudera Jake Miller, Customer Operations Engineer Greg Phillips, Solutions Architect, Cloudera

Eva Andreasson
Director of Product Management, Cloudera

Eva Andreasson has been working with JVMs, SOA, Cloud, and infrastructure software for 15+ years. She has two patents on JVM garbage collection heuristics and algorithms. She also pioneered Deterministic GC which was productized as JRockit Real Time at BEA Systems (bef. Oracle). After two years as product manager for Zing at Azul Systems, she joined Cloudera in 2012 to help drive the future of distributed data processing through Cloudera's Distribution of Hadoop. Since, she has worked with the projects Hue, ZooKeeper, Oozie, and other components. In 2013 she initiated and launched Cloudera Search. More recently she drove the partner-showcase and easy-to-get-started trial experience of Cloudera Live.

Doug Cutting
Chief Architect, Cloudera

Doug Cutting is the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera in 2009 from Yahoo!, where he was a key member of the team that built and deployed a production Hadoop storage and analysis cluster for mission-critical business analytics. Doug holds a Bachelor’s degree from Stanford University and is the former Chairman of the Board of the Apache Software Foundation.

Eddie Garcia
Chief Security Architect, Office of the CTO, Cloudera

Eddie Garcia is chief security architect at Cloudera, a leader in enterprise analytic data management. Eddie helps Cloudera enterprise customers reduce security and compliance risks associated with sensitive data sets stored and accessed in Apache Hadoop environments. Working in the office of the CTO, Eddie also provides security thought leadership and vision to the Cloudera product roadmap. Formerly the VP of InfoSec and Engineering for Gazzang prior to its acquisition by Cloudera, Eddie architected and implemented secure and compliant Big Data infrastructures for customers in the financial services, healthcare and public sector industries to meet PCI, HIPAA, FERPA, FISMA and EU data security requirements. He was also the chief architect of the Gazzang zNcrypt product and is author of two patents for data security.

Sean Kane
Senior Solutions Architect, Cloudera

Sean is an experienced solutions architect with an extensive background in software engineering and development. During his three-year tenure with Cloudera, he has assisted many customers with system architecture development, installation, configuration, performance-tuning, and development.

During the past twelve years, Sean has developed an excellent mix of enterprise information integration experience. With Spry, he led the software development team and developed solutions using Hadoop and semantic web technologies. While with Oracle, Sean worked on a team that supported pre-sales with architecture, POCs, and reusable technical product demonstrations. Prior to being acquired by Oracle; at BEA, he developed solutions for customers using SOA middleware products and open source software. He also worked at Preferred Systems Solutions and MetaMatrix where he developed reusable components for the federated query and metadata management product, developed and delivered the product training courses, and provided general information technology support.

Sean holds a Bachelor of Science in Information Sciences and Technology from the Pennsylvania State University. He is certified in service-oriented-architecture (SOA) and has taken numerous courses covering Oracle and BEA software.

Todd Lipcon
Engineer, Cloudera

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. He is a committer and a Project Management Committee member on the Apache Hadoop, HBase, and Thrift projects. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine learning methods for collaborative filtering. Todd received his bachelor’s degree with honors from Brown University.

Jake Miller
Customer Operations Engineer, Cloudera

Jake Miller is a Customer Operations Engineer working in the public sector. Jake helps public sector customers to identify and solve issue that arise during cluster operations. He has a solid background in Linux systems administration and enjoys solving technical problems. Prior to work with Cloudera, Jake spent 14 years working in the public sector as a systems integrator solving challenging technical problems. Jake holds a Master of Science Degree in Cyber Security from NYU-Poly.

Greg Phillips
Solutions Consultant, Cloudera Government Solutions

Greg Phillips helps public sector customers optimize their computing resources by implementing current data management and analytics capabilities. His focus is on developing ETL pipelines to meet customer requirements, and enable extraction, transformation, and delivery of intuitive methods for analysts to interact effectively with live data.

Prior to his tenure with Cloudera, Greg spent seven years working with the U.S. Government, where in his final assignment, he served as Data Science Analytics Technical Team Lead. He supported enterprise programs for cloud processing architecture, implementation of a Cloudera Hadoop system, and in-depth training for new users to gain immediate value from the available datasets and dashboards.

Greg has a Bachelor of Science in Computer Science from the University of Maryland, holds Cloudera Administrator and Cloudera Developer certifications, and is proficient with a number of commonly-used programming languages and data management and analytics applications.