Actively Hiring

Programmatic Synthetic Data for Data Privacy

Early Stage
Startup in initial stages

Early Stage
Startup in initial stages

Senior Data Engineer

$27k – $45k • 0.0% – 0.3%
|
Remote •
India
+1
|4 years of exp
|Full Time

Posted: 1 week ago• Recruiter recently active

Job Location

Remote •

India •

Pakistan

Visa Sponsorship

Not Available

Remote Work Policy

Onsite or remote

Hires remotely

Everywhere

Preferred Timezones

Astana Time, Indochina Time, China Standard Time, Japan Standard Time, Brisbane

Collaboration Hours

10:00 AM - 7:00 PM China Standard Time

RelocationNot Allowed

Skills

Python

Parallel Processing

Apache Spark

Parquet

Dask

PyTorch

About the job

Who Are We Looking For:

We are seeking a Senior Data & Machine Learning Engineer with hands-on experience to transform academic research into scalable, production-ready solutions for synthetic tabular data generation. This is an individual contributor (IC) role suited for someone who thrives in a fast-paced, early-stage startup environment. The ideal candidate has extensive experience scaling systems to handle datasets with hundreds of millions to billions of records and can build and optimise complex data pipelines for enterprise applications.

This role requires someone familiar with the dynamic nature of a startup, capable of rapidly designing and implementing scalable solutions. You'll work closely with research teams to optimize performance and ensure seamless integration of systems, handling data from financial institutions, government agencies, consumer brands, and internet companies.

Key Responsibilities:

Strong understanding of ML concepts and algorithms:

Practical experience working with models in production settings in AI / data science teams to transform AI / data science code into scalable, production-ready systems.

Data Ingestion & Integration:

Ingest data from enterprise relational databases such as Oracle, SQL Server, PostgreSQL, and MySQL, as well as enterprise SQL-based data warehouses like Snowflake, BigQuery, Redshift, Azure Synapse, and Teradata for large-scale analytics.

Data Validation & Quality Assurance:

Ensure ingested data conforms to predefined schemas, checking data types, missing values, and field constraints.
Implement data quality checks for nulls, outliers, and duplicates to ensure data reliability.

Data Transformation & Processing:

Design scalable data pipelines for batch processing, deciding between distributed computing tools like Spark, Dask, or Ray when handling extremely large datasets across multiple nodes, and single-node tools like Polars and DuckDB for more lightweight, efficient operations. The choice will depend on the size of the data, system resources, and performance requirements.
Leverage Polars for high-speed, in-memory data manipulation when working with large datasets that can be processed efficiently in-memory on a single node.
Utilize DuckDB for on-disk query execution, offering SQL-like operations with minimal overhead, suitable for environments that need a balance between memory use and query performance.
Seamlessly transform Pandas-based research code into production-ready pipelines, ensuring efficient memory usage and fast data access without adding unnecessary complexity.

Data Storage & Retrieval:

Work with internal data representations such as Parquet, Arrow, and CSV to support the needs of our generative models, choosing the appropriate format based on data processing and performance needs.

Distributed Systems & Scalability:

Ensure that the system can scale efficiently from a single node to multiple nodes, providing graceful scaling for users with varying compute capacities.
Optimize SQL-based queries for performance and scalability in enterprise SQL environments, ensuring efficient querying across large datasets.

GPU Acceleration & Parallel Processing:

Utilize GPU acceleration and parallel processing to improve performance in large-scale model training and data processing.

Data Lineage & Metadata Management (Reduced Emphasis):

Implement basic data lineage for auditability, ensuring traceability in data transformations when required.
Manage metadata as needed to document pipelines and workflows.

Error Handling, Recovery, & Performance Monitoring:

Design robust error handling mechanisms, with automatic retries and data recovery in case of pipeline failures.
Track performance metrics such as data throughput, latency, and processing times to ensure efficient pipeline operations at scale.

Documentation & Reporting:

Create clear documentation of data pipelines, workflows, and system architectures to enable smooth handovers and collaboration across teams.

Essential Skills and Qualifications:

High Priority:

Hands-on experience scaling data pipelines and machine learning systems to handle hundreds of millions to billions of rows in enterprise environments.
4+ years of experience in building scalable data solutions with Python and distinct libraries such as:
- Data Science Libraries: Pandas, NumPy, Scikit-learn.
- Scaling Libraries: Polars for in-memory processing and DuckDB for efficient on-disk queries.
Ability to choose the right framework (e.g., Dask, Ray, Polars, DuckDB) depending on the workload and environment, with a focus on balancing simplicity and scalability.
Experience in data validation and ensuring data quality with tools like Pandera or Pydantic.
Proficiency in building ETL/ELT pipelines and managing data across relational databases, data warehouses, and cloud storage.
Strong knowledge of GPU parallelization for deep learning models using PyTorch.

Good to Have:

Experience with logging and monitoring in production environments.
Understanding of data lineage and metadata management systems to support data transparency.
Familiarity with Pytest for testing and validating research code.

Why Join Us:

This is a unique opportunity for someone looking to actively build and scale systems in a fast-moving startup. If you’ve successfully scaled machine learning and data systems to billions of rows and thrive in a dynamic, hands-on environment, this role is for you. We offer competitive compensation, equity options, and the chance to directly impact the future of synthetic data for enterprises.