Avatar for Betterdata
Betterdata
Actively Hiring
Programmatic Synthetic Data for Data Privacy
  • Early Stage
    Startup in initial stages

Senior Data Engineer

  • $27k – $45k • 0.0% – 0.3%
  • Remote • 
    +1
  • 4 years of exp
  • Full Time
Posted: 1 week ago• Recruiter recently active
Job Location
Remote • 
India • 
Visa Sponsorship

Not Available

Remote Work Policy

Onsite or remote

Hires remotely
Everywhere
Preferred Timezones
Astana Time, Indochina Time, China Standard Time, Japan Standard Time, Brisbane
Collaboration Hours
10:00 AM - 7:00 PM China Standard Time
RelocationNot Allowed
Skills
Python
Parallel Processing
Apache Spark
Parquet
Dask
PyTorch

About the job

Who Are We Looking For:

We are seeking a Senior Data & Machine Learning Engineer with hands-on experience to transform academic research into scalable, production-ready solutions for synthetic tabular data generation. This is an individual contributor (IC) role suited for someone who thrives in a fast-paced, early-stage startup environment. The ideal candidate has extensive experience scaling systems to handle datasets with hundreds of millions to billions of records and can build and optimise complex data pipelines for enterprise applications.

This role requires someone familiar with the dynamic nature of a startup, capable of rapidly designing and implementing scalable solutions. You'll work closely with research teams to optimize performance and ensure seamless integration of systems, handling data from financial institutions, government agencies, consumer brands, and internet companies.

Key Responsibilities:

Strong understanding of ML concepts and algorithms:

Practical experience working with models in production settings in AI / data science teams to transform AI / data science code into scalable, production-ready systems.

Data Ingestion & Integration:

  • Ingest data from enterprise relational databases such as Oracle, SQL Server, PostgreSQL, and MySQL, as well as enterprise SQL-based data warehouses like Snowflake, BigQuery, Redshift, Azure Synapse, and Teradata for large-scale analytics.

Data Validation & Quality Assurance:

  • Ensure ingested data conforms to predefined schemas, checking data types, missing values, and field constraints.
  • Implement data quality checks for nulls, outliers, and duplicates to ensure data reliability.

Data Transformation & Processing:

  • Design scalable data pipelines for batch processing, deciding between distributed computing tools like Spark, Dask, or Ray when handling extremely large datasets across multiple nodes, and single-node tools like Polars and DuckDB for more lightweight, efficient operations. The choice will depend on the size of the data, system resources, and performance requirements.
  • Leverage Polars for high-speed, in-memory data manipulation when working with large datasets that can be processed efficiently in-memory on a single node.
  • Utilize DuckDB for on-disk query execution, offering SQL-like operations with minimal overhead, suitable for environments that need a balance between memory use and query performance.
  • Seamlessly transform Pandas-based research code into production-ready pipelines, ensuring efficient memory usage and fast data access without adding unnecessary complexity.

Data Storage & Retrieval:

  • Work with internal data representations such as Parquet, Arrow, and CSV to support the needs of our generative models, choosing the appropriate format based on data processing and performance needs.

Distributed Systems & Scalability:

  • Ensure that the system can scale efficiently from a single node to multiple nodes, providing graceful scaling for users with varying compute capacities.
  • Optimize SQL-based queries for performance and scalability in enterprise SQL environments, ensuring efficient querying across large datasets.

GPU Acceleration & Parallel Processing:

  • Utilize GPU acceleration and parallel processing to improve performance in large-scale model training and data processing.

Data Lineage & Metadata Management (Reduced Emphasis):

  • Implement basic data lineage for auditability, ensuring traceability in data transformations when required.
  • Manage metadata as needed to document pipelines and workflows.

Error Handling, Recovery, & Performance Monitoring:

  • Design robust error handling mechanisms, with automatic retries and data recovery in case of pipeline failures.
  • Track performance metrics such as data throughput, latency, and processing times to ensure efficient pipeline operations at scale.

Documentation & Reporting:

  • Create clear documentation of data pipelines, workflows, and system architectures to enable smooth handovers and collaboration across teams.

Essential Skills and Qualifications:

High Priority:

  • Hands-on experience scaling data pipelines and machine learning systems to handle hundreds of millions to billions of rows in enterprise environments.
  • 4+ years of experience in building scalable data solutions with Python and distinct libraries such as:
    • Data Science Libraries: Pandas, NumPy, Scikit-learn.
    • Scaling Libraries: Polars for in-memory processing and DuckDB for efficient on-disk queries.
  • Ability to choose the right framework (e.g., Dask, Ray, Polars, DuckDB) depending on the workload and environment, with a focus on balancing simplicity and scalability.
  • Experience in data validation and ensuring data quality with tools like Pandera or Pydantic.
  • Proficiency in building ETL/ELT pipelines and managing data across relational databases, data warehouses, and cloud storage.
  • Strong knowledge of GPU parallelization for deep learning models using PyTorch.

Good to Have:

  • Experience with logging and monitoring in production environments.
  • Understanding of data lineage and metadata management systems to support data transparency.
  • Familiarity with Pytest for testing and validating research code.

Why Join Us:

This is a unique opportunity for someone looking to actively build and scale systems in a fast-moving startup. If you’ve successfully scaled machine learning and data systems to billions of rows and thrive in a dynamic, hands-on environment, this role is for you. We offer competitive compensation, equity options, and the chance to directly impact the future of synthetic data for enterprises.

How to Apply:

Does this role sound like a good fit to you?

  • Visit our career page to learn more here.

About the company

Betterdata company logo

Betterdata

Actively Hiring
Programmatic Synthetic Data for Data Privacy11-50 Employees
Company Size
11-50
Company Type
Big Data
Company Type
Artificial Intelligence
Company Type
Enterprise Software Company
Company Industries
Governments
  • Early Stage
    Startup in initial stages
Learn more about Betterdata image

Funding

AMOUNT RAISED
$1.6M
FUNDED OVER
1 round
Round
S
$1,650,000
Seed - Apr 2023

Perks

Flexible working hours

Founders

Uzair Javaid
Founder • 3 years
Singapore
image
Kevin Yee
Founder • 3 years
Singapore
image
View the team image

Similar Jobs

Tonbo Imaging Pvt.Ltd company logo
Tonbo Imaging Pvt.Ltd
Imaging and Sensor Systems for Defence, Homeland Security and Complex Environments
Imaginate VR/AR company logo
Imaginate VR/AR
3D Meeting Platform (Metaverse) in VR/AR for Collaborative Training & Support
Gyrus.AI company logo
Gyrus.AI
AI for Video, Business and IoT Predictive Analytics
dresslife company logo
dresslife
Dresslife provides fashion specific 1-to-1 personalization with exceptional accuracy
FilterPixel company logo
FilterPixel
One Click To Select & Edit Photos In Your Style
Codersarts company logo
Codersarts
Programming Expert Help, Training & Mentorship and Software Development Services