Data Engineer

₹30L – ₹40L • No equity
|
Pune •
Bangalore Urban
+2
|5 years of exp
|Full Time

Reposted: 3 years ago

Job Location

Pune •

Bangalore Urban •

Kolkata •

Chennai

Visa Sponsorship

Not Available

Remote Work Policy

In office

RelocationAllowed

Skills

SQL

ETL

Spark

Apache Spark

AWS

Pyspark

Airflow

Jypyter

About the job

Experience: 5+ years in Data Engineering

Key Skills: Cloud platforms (AWS, GCP), Spark, Data Lakehouse, Kubernetes, SQL, Apache, JupyterLab Notebook

Job Overview:

We are seeking a talented Spark Developer with strong expertise in SQL, Kubernetes, Apache Airflow, AWS, Data Lakehouse architecture, and data pipeline development. The ideal candidate will have hands-on experience with large-scale distributed data processing and cloud technologies, along with familiarity using JupyterLab-based notebooks for data analysis and reporting. This role is crucial in building and optimizing scalable, robust data workflows in our cloud-based ecosystem.

Key Responsibilities:

Spark Development: Design, develop, and maintain distributed data processing pipelines using Apache Spark to process large datasets in both batch and stream processing modes.
SQL & Data Transformation: Write complex SQL queries for data extraction, transformation, and aggregation. Work with both relational and non-relational databases to ensure efficient query execution and optimize performance.
Data Lakehouse & Cloud Architecture: Work with Data Lakehouse solutions (e.g., Delta Lake) on AWS to integrate structured and unstructured data into a unified platform for analytics and business intelligence.
AWS Integration: Leverage AWS services like S3, EMR, Glue, Redshift, Lambda, and others for data storage, processing, and orchestration. Build cloud-native data pipelines that are scalable and cost-effective.
Kubernetes for Orchestration: Deploy, scale, and manage data pipelines and Spark jobs using Kubernetes clusters. Utilize containerization for seamless deployment and management of the application lifecycle.
Workflow Automation with Apache Airflow: Create, schedule, and monitor data pipelines with Apache Airflow. Design DAGs (Directed Acyclic Graphs) to orchestrate and automate end-to-end data workflows.
JupyterLab-Based Notebooks: Develop, maintain, and optimize JupyterLab notebooks for interactive data analysis, visualizations, and reporting, supporting data scientists and analysts in their work.
Collaboration with Cross-Functional Teams: Work closely with data engineers, data scientists, business analysts, and other stakeholders to gather requirements, understand business needs, and build data solutions.
Data Quality and Performance Optimization: Ensure high-quality data pipelines, monitor job failures, and troubleshoot issues. Optimize performance by tuning Spark jobs, improving query performance, and resolving bottlenecks in data flows.
Documentation & Best Practices: Maintain clear documentation for data pipelines, architecture, and code. Follow best practices for version control, testing, and continuous integration/continuous delivery (CI/CD).

Required Skills & Experience:

Spark: Strong experience with Apache Spark (both PySpark and Spark SQL) for distributed data processing and optimization of jobs.
SQL: Proficiency in SQL for data wrangling, ETL (Extract, Transform, Load) processes, and performance tuning.
Cloud Platforms (AWS): Hands-on experience with AWS services (S3, EMR, Lambda, Glue, Redshift, etc.) for building scalable cloud data solutions.
Kubernetes: Experience deploying and managing containerized applications on Kubernetes