Data Engineer
- $75k – $100k • No equity
- Remote •San Jose
- 3 years of exp
- Full Time
Not Available
Remote only
About the job
*Job Title
*Data Engineer – Databricks, Prophecy.io
Company Overview
YMT Evo seeks a skilled Data Engineer to design and maintain a robust, high-performance data processing infrastructure. Using Databricks, Prophecy.io, and open-source tools on VEXXHOST’s hybrid cloud, this role will involve building scalable ETL/ELT pipelines, supporting AI model training, and implementing data solutions that align with microservices principles and Kubernetes orchestration. This position includes a focus on Zero Trust architecture, CI/CD pipelines, security protocols, and planning for a transition to a microservices-based infrastructure.
Key Responsibilities
• Data Pipeline Development (ETL and ELT): Develop scalable ETL and ELT pipelines using Databricks Delta Live Tables, Prophecy.io, and Apache Kafka. Ensure seamless data ingestion and transformation, supporting real-time and batch processing.
• Microservices-Ready Data Architecture: Architect data solutions with a focus on modularity and scalability to support future migration to a microservices environment. Collaborate with the team to design independent services that can be deployed, scaled, and managed autonomously.
• Kubernetes and Containerization: Use Kubernetes for container orchestration, managing data pipeline deployments across sandbox and production environments. Implement Zero Trust principles and secure VPN access for service isolation and network segmentation.
• Machine Learning, Deep Learning, and MLOps: Work with data scientists to support deep learning, AI model training, and MLOps workflows. Leverage MLflow and Kubeflow for model lifecycle management, versioning, and deployment within Kubernetes, with plans for a distributed microservices deployment model.
• Data Processing and Distributed Systems: Use Apache Spark and Hadoop for distributed data processing. Support both batch and streaming analytics with scalable data architectures on hybrid cloud and data lakes.
• Security and Zero Trust Architecture: Implement Zero Trust security models by using identity-based access, RBAC policies, and secure communication protocols. Use advanced authentication techniques (e.g., JWT, OAuth 2.0) and RBAC in Kubernetes to manage and restrict access to sensitive data and services.
• CI/CD and GitOps Practices: Develop and automate CI/CD pipelines using tools like ArgoCD and Kustomize for consistent and reliable deployments. Facilitate GitOps practices to synchronize configurations and deployments across environments, maintaining consistency between source code repositories and Kubernetes clusters.
• Observability and Monitoring: Utilize Dynatrace for real-time monitoring, along with Prometheus for observability and OpenTelemetry for distributed tracing in a microservices environment. Implement logging and alerting systems to identify issues and track performance metrics.
• Data Security and Compliance: Establish data governance policies that comply with GDPR, CCPA, and other regulations. Implement secure storage and data encryption, and enforce IAM policies across all data handling operations.
• Centralized Reporting and Integration: Integrate reporting tools from Databricks and Apache Superset into Bitrix24, providing centralized access for insights, metrics, and real-time monitoring across the team.
Qualifications
Education:
Bachelor’s or Master’s degree in Computer Science, Data Engineering, Information Systems, or a related field, or equivalent experience.
Experience:
• 3+ years in data engineering, with experience in microservices architecture, Kubernetes, ETL/ELT processes, big data, and MLOps in cloud-based or hybrid environments.
Technical Skills
Data Architecture and Database Management
• Relational Databases: SQL proficiency, experience with PostgreSQL, MySQL, or SQL Server.
• NoSQL Databases: Familiarity with MongoDB, Cassandra, or DynamoDB.
• Data Warehousing: Knowledge of Snowflake, Amazon Redshift, Google BigQuery, or Azure Synapse.
• Data Modeling: Skilled in designing scalable, microservices-friendly data models.Programming and Scripting
• Python: Essential for scripting and data processing with Pandas, NumPy, and PySpark.
• SQL: Advanced SQL for querying and transforming data across various systems.
• Java/Scala: Helpful for big data frameworks like Spark.Big Data Technologies
• Apache Spark and Hadoop: Expertise in distributed processing and real-time data analytics.
• Data Lakes: Knowledge of data lake platforms such as Amazon S3, Azure Data Lake, or HDFS.ETL/ELT and Data Pipelines
• ETL/ELT Tools: Familiarity with Apache NiFi, Talend, and AWS Glue for data workflows.
• Airflow: Proficiency with Apache Airflow for managing and scheduling data workflows.
• Data Quality: Knowledge of data validation and consistency checks across ETL/ELT processes.Cloud and Kubernetes
• Kubernetes: Experience in Kubernetes for containerized application management, high availability, and load balancing.
• Containerization: Knowledge of Docker for building and deploying containers, aligned with Zero Trust principles.
• Cloud Platforms: Experience with AWS, GCP, or Azure for cloud storage and compute services.Data Processing and Streaming
• Streaming and Batch Processing: Real-time and batch data processing with Apache Kafka, Flink, and Spark.
• Event-Driven Architecture: Experience in building event-driven systems for handling high-throughput data.Machine Learning, Deep Learning, and MLOps
• MLflow and Kubeflow: Experience in tracking, deploying, and managing models in MLflow and Kubernetes-native workflows in Kubeflow.
• Deep Learning: Familiarity with neural network frameworks for deep learning model training and scoring.
• MLOps and AIOps: Knowledge of automated model deployment and AIOps for real-time operational intelligence.Security and Compliance
• Zero Trust Architecture: Implement RBAC, VPN, and segmentation for secure network access.
• Data Encryption and IAM: Experience in secure storage, encryption, and IAM in a cloud-native environment.CI/CD and GitOps
• CI/CD Tools: Experience with ArgoCD, Kustomize, and other GitOps-based tools for continuous deployment.
• Pipeline Orchestration: Knowledge of Zuul or Jenkins for CI/CD, automated testing, and secure deployment.Data Observability and Logging
• Distributed Tracing with OpenTelemetry: Track distributed data flows across microservices.
• Monitoring: Proficiency in using Dynatrace for real-time performance tracking and Prometheus for metrics.
What We Offer
• Innovative Environment: Work with cutting-edge tools like Databricks, Prophecy.io, MLflow, and Kubeflow, in an architecture primed for microservices.
• Professional Growth: Ongoing training and certifications in Kubernetes, CI/CD, and microservices.
• Collaborative Culture: Join a team dedicated to building scalable, real-time data infrastructure.
• Competitive Compensation: Reflecting experience, technical skills, and cultural alignment.