Who should attend
This is an intermediate course, suitable for professionals with some experience in any programming language and data design. If the participants have some business exposure, they can appreciate the case studies discussed better.
This course targets analytics professional including:
- Business and IT professionals seeking analytical skills to handle large amounts of unstructured data (Data lake e.g. customer feedbacks, product reviews on social media, phone call recordings, etc.) for insights to improve business process and decision-making.
- Individuals who have no knowledge or experience in data engineering for analytics and would like to gain some practical skills in this area so that they may explore work opportunities in data engineering.
- Data analysts and Data Engineers, who want to move from the structured to large amounts of unstructured data engineering.
This is an intensive, intermediate course. Our proposed course targets the higher value chain professionals such as data engineers, data application architects, integration architects, software engineers working on data pipeline processing and key technology decision makers.
Participants with experience in programming languages such as Python or Java or Scala will benefit more from the course. Participants also need to have a strong interest in building functional pipelines and be comfortable working with Hadoop platform and Spark framework.
NUS-ISS also offers a range of other basic courses in analytics for participants new to analytics
About the course
This 5-day course helps data engineers focus on essential design and architecture while building a data lake and relevant processing platform.
Participants will learn various aspects of data engineering while building resilient distributed datasets. Participants will learn to apply key practices, identify multiple data sources appraised against their business value, design the right storage, and implement proper access model(s). Finally, participants will build a scalable data pipeline solution composed of pluggable component architecture, based on the combination of requirements in a vendor/technology agnostic manner. Participants will familiarize themselves on working with Spark platform along with additional focus on query and streaming libraries.
This course is part of the Analytics and Intelligent Systems series offered by NUS-ISS.
Upon effective completion of the course, participants will be able to:
- Understand the growth of big data and need for a scalable processing framework. Understand the fundamental characteristics, storage, analysis techniques and the relevant distributions
- Understand the distributed storage essentials, storage needs, and relevant architectural mechanism in processing large amounts of structured, semi-structured and unstructured data.
- Gain expertise with the fault-tolerant computing framework (E.g. YARN) by setting up pseudo cluster nodes or cloud based nodes for processing big data. .
- Construct configurable and executable tasks using the In Memory Processing frameworks (E.g. Spark Core). Understand the nuances of writing functional programs and use the core libraries to manipulate the large corpse of unstructured data residing as Resilient Distributed Datasets.
- Organize, store and manipulate the collected data using processing libraries. For example, using special statistical operation and stream processing data tools (E.g. Spark Special Libraries).
- Understand various data processing, querying and persistence (E.g. Spark QL APIs) available for usage in RDD’s context. Perform tasks such as filtering, selection and categorization.
What Will Be Covered
The course objective is to explore the engineering aspects of big data storage, querying and processing techniques. The course aims to teach the students to apply the newly acquired proficiencies by developing data intensive applications using distributed compute platform (e.g. using the Hadoop platform, Spark Framework and relevant tools).
A brief module description is provided below:
Module 1: Introduction to Data Science, Data Engineering and Big Data
Module 2: Understand Big Data from an Analytics Perspective
Module 3: Architectural Viewpoints in Big Data
Module 4: The Hadoop Ecosystem for Big Data
Module 5: Distributed File Storage
Module 6: NoSQL Databases for Big Data
Module 7: Spark and Functional Programming for Big Data
Module 8: Spark and Resilient Distributed Data Sets
Module 9: Spark QL for Big Data
Module 10: Spark and Real Time Stream Processing
Module 11: Management of Big Data initiatives
Discussion and Project Requirement Elaboration
Project and Assessment
- Project Demonstration, Report Submission and Presentations. Each team will work on a practical case study and submit/present their work done regarding the assigned Big Data project.
Suria has twenty years of teaching and consulting experience in areas such as software engineering, application architecture, crafting cloud services, agile development and big data engineering. Her research interest spans around cloud computing, software engineering, test automation and big dat...
Because of COVID-19, many providers are cancelling or postponing in-person programs or providing online participation options.
We are happy to help you find a suitable online alternative.