Actively Hiring

Hyve specializes in AI-based product development. We offer a wide range of IT services

Python Data Scientist (Web Crawling & Llm)

₹6L – ₹15L
|
Remote •
Dubai
|3 years of exp
|Full Time

Posted: 1 month ago

Visa Sponsorship

Available

Remote Work Policy

Remote only

Hires remotely in

India -

Philippines -

Vietnam

Preferred Timezones

Dubai Time

RelocationAllowed

Skills

Python

Databases

PostgreSQL

Web Scraping

API

Scrapy

Python/Django/Flask

Databases (SQL and NoSQL)

Python Web Scraping (Beautiful Soup/Scrapy)

LLMs

Large Language Models (LLMs)

LLM Frameworks (Langchain, Claude, LLamaIndex) RAG Technologies Embedding Models Vect

About the job

We are looking for a motivated and detail-oriented Junior Python Data Scientist to join our data science team. The ideal candidate will have hands-on experience in web crawling, data cleansing, and data transformation, along with knowledge of building and training machine learning models. Experience with Large Language Models (LLMs) is a plus. You will collaborate with senior data scientists and engineers to support the collection, cleaning, processing, and analysis of large datasets that will drive business insights and model development.

Key Responsibilities:
Web Crawling & Data Collection:
Build and maintain web crawlers to extract large volumes of structured and unstructured data from various online sources using Python libraries like Scrapy, BeautifulSoup, or Selenium.
Data Cleansing & Preprocessing:
Clean, preprocess, and standardize raw data from various sources (e.g., scraped data, databases, APIs). Handle missing data, data inconsistencies, and outliers, ensuring the data is ready for analysis and modeling.
Data Transformation & Feature Engineering:
Apply data transformation techniques, such as normalization, aggregation, and encoding, to convert raw data into useful features for machine learning models. Work on feature extraction and engineering from textual and numerical data sources.
Exploratory Data Analysis (EDA):
Perform exploratory data analysis to uncover patterns, trends, and insights in the data. Generate visualizations using libraries like Matplotlib, Seaborn, or Plotly to summarize and communicate key findings.
Machine Learning Model Training:
Assist in building, training, and optimizing machine learning models for predictive analytics, classification, regression, or clustering using Python frameworks like scikit-learn, TensorFlow, or PyTorch.
Working with Large Language Models (LLMs):
Support senior team members in fine-tuning and deploying Large Language Models (LLMs), such as GPT, BERT, or similar, for NLP tasks like text classification, sentiment analysis, or entity recognition.
Model Evaluation & Optimization:
Evaluate model performance using metrics like accuracy, precision, recall, and F1-score. Assist in optimizing models using techniques such as hyperparameter tuning and cross-validation.
Documentation & Reporting:
Document your data pipelines, methodologies, and model outputs in a clear and structured manner. Communicate results and findings to both technical and non-technical stakeholders through reports, presentations, or dashboards.

Required Skills & Qualifications:
Education:
Bachelor’s degree in Computer Science, Data Science, Statistics, Mathematics, or related field.
Programming Languages:
Proficiency in Python and its data-related libraries, including Pandas, NumPy, scikit-learn, and Matplotlib.
Web Crawling:
Experience with web scraping tools and libraries, such as Scrapy, BeautifulSoup, or Selenium, and handling the challenges of web data collection.
Data Cleansing & Preprocessing:
Strong skills in data wrangling and cleansing, including handling missing data, outliers, and data inconsistencies in large datasets.
Machine Learning:
Familiarity with training basic machine learning models for classification, regression, and clustering using libraries like scikit-learn.
Large Language Models (LLMs):
Understanding of NLP techniques and working knowledge of LLMs (e.g., GPT, BERT) or an eagerness to learn and work with LLM-based tasks.
Data Transformation:
Experience with data transformation techniques, such as feature engineering, scaling, and encoding, to prepare data for model training.
Version Control:
Knowledge of version control systems such as Git for collaborative development.
Preferred Qualifications:
Experience with Databases:
Basic experience with SQL or NoSQL databases for querying and retrieving data.
NLP & LLM Experience:
Hands-on experience working with natural language processing (NLP) tasks like sentiment analysis, named entity recognition, or language generation using pre-trained models or custom solutions.
Cloud & Deployment Tools:
Familiarity with cloud platforms like AWS, Google Cloud, or Azure and experience in deploying models into production.
Data Visualization:
Experience with data visualization tools like Tableau, Power BI, or similar platforms for creating dashboards or reports.

Key Competencies:
Analytical Thinking:
Ability to think critically and analytically to solve problems related to data collection, cleansing, and transformation.
Attention to Detail:
Strong attention to detail, especially in dealing with large datasets, to ensure data accuracy and quality.
Adaptability:
Ability to learn and adapt quickly to new tools, libraries, and processes, especially in the fast-evolving data science and AI landscape.
Team Collaboration:
Work effectively within a team environment, collaborating with senior data scientists, engineers, and stakeholders.