Senior Data Engineer · Genesis Technologies · Noida, India
ISHWAR
GUPTA
Azure · Databricks · Snowflake
PySpark · DBT · Delta Lake

I build the invisible infrastructure that turns raw data chaos into revenue intelligence.
0Years experience
0Records / day
0Platform builds
0POS sources
Scroll
Azure Data FactoryDatabricksSnowflakePySparkDBT CoreDelta LakeMedallion ArchitectureUnity CatalogPower BIAzure SynapseApache AirflowMLflowADLS Gen2Star SchemaPower Apps Azure Data FactoryDatabricksSnowflakePySparkDBT CoreDelta LakeMedallion ArchitectureUnity CatalogPower BIAzure SynapseApache AirflowMLflowADLS Gen2Star SchemaPower Apps
01
[01] About

Building the pipes
that move billions

I'm a Senior Data Engineer at Genesis Technologies — 4+ years of turning raw, messy, multi-source enterprise data into reliable gold-layer assets that executives actually trust.

I've shipped data infrastructure for global retail POS networks, sustainable forestry investment, healthcare distribution, and luxury CPG brands. Every project has the same obsession: clean architecture, zero-downtime pipelines, governed data that doesn't lie.

My specialty is the full Medallion stack — from designing the landing zone schema in ADLS to the final Snowflake Star Schema that feeds the C-suite dashboard. I don't just move data. I make it trustworthy.

Full stack
Azure Data Factory
Databricks
Snowflake
PySpark
DBT Core
Delta Lake
Azure Synapse
Power BI
Unity Catalog
Apache Airflow
MLflow
ADLS Gen2
Python
SQL
Scala
Power Apps
Current role
Senior Data Engineer
Genesis Technologies · Jan 2022 – Present
Education
B.Tech · CGPA 8.2
AKTU Lucknow · 2018 – 2022
Training
Master in Data Science
DIGIPERFORM · 2023 – 2024
Location
Noida, India
Open to remote & hybrid roles
02
[02] Work

5 projects.
1 company.
Countless pipelines.

GENESIS
JAN 2022 — PRESENT
● CURRENT
Project 01 — Global Retail
SCOPE
01
  • Led end-to-end ETL development integrating POS data from 280+ global distributor sources using Azure Data Factory, Power Apps, and Logic Apps.
  • Designed and optimized SQL transforms, views, and stored procedures for high-performance data processing at scale.
  • Implemented FGAC with row-level security and column masking across SQL Server — zero unauthorized data exposure.
  • Delivered curated datasets powering enterprise Power BI KPI dashboards for global retail leadership.
  • Automated cross-functional workflows via Logic Apps and Power Apps — eliminated manual data entry across 3 departments.
ADFSQL ServerPower BIPower AppsLogic AppsFGAC
Project 02 — Sustainable Investment
NEW FORESTS
02
  • Architected a Medallion Data Warehouse on Azure Synapse Analytics (Bronze → Silver → Gold) for sustainable forestry & agriculture investment data.
  • Developed Synapse Pipelines and Spark notebooks transforming ADLS Gen2 data via Python, SQL, and DBT.
  • Built automated data quality validation and monitoring — pipeline reliability improved significantly.
  • Designed Power Apps + Power Automate for data validation and MDM workflows used by non-technical stakeholders.
Synapse AnalyticsADLS Gen2DBTPySparkPower AutomateMedallion
Project 03 — Near Real-Time
DUSK
03
  • Designed a near real-time ELT pipeline — Azure SQL Server to Snowflake, sub-hour data freshness for transactional data.
  • Built a scalable Star Schema in Snowflake Gold layer — fact + dim tables powering sales and purchase analytics dashboards.
  • Implemented SCD Type 1 and SCD Type 2 across all dimension tables — full historical tracking with zero data loss.
SnowflakeAzure SQLStar SchemaSCD Type 1/2Power BI
Project 04 — Healthcare Distribution
MEDTRITION
04
  • Built ingestion pipelines for 30+ distributors handling CSV/Excel files via ADF, Databricks, and AWS Lambda.
  • Designed Dataflow transformation layers in Python + Pandas — standardized raw data into analytics-ready formats.
  • Implemented Apache Airflow orchestration of Databricks jobs — custom sensors, retry logic, Slack alerting on failure.
DatabricksAWS LambdaAirflowADFPython
Project 05 — Enterprise CPG
ESTEE LAUDER
05
  • Developed Databricks notebooks and workflows for enterprise-grade processing across global business units.
  • Built event-driven ADF pipelines — trigger-based ingestion, dramatically reduced data availability latency.
  • Deployed MLflow ML pipelines for intelligent field-value prediction and automated data enrichment.
  • Architected Snowflake Star Schema with RBAC + Unity Catalog governance — secure enterprise BI integration.
DatabricksMLflowADFSnowflakeRBACUnity Catalog
03
[03] Stack

Tools I actually
use in production

ishwar@genesis:~$ skills --show-all --with-proficiency
cloud-platforms
Azure Data Factory
Azure Databricks
Snowflake
Azure Synapse
ADLS Gen2
Microsoft Fabric
data-engineering
Medallion Arch.
Delta Lake
DBT Core
Star Schema / SCD
Apache Airflow
ETL / ELT Design
programming
PySpark
Python
SQL
Pandas
Scala
Scikit-learn
governance & bi
Unity Catalog
RBAC
Power BI
Power Apps
CI/CD (DevOps)
MLflow
04
[04] Projects

Things I've shipped
outside work

Drag to explore
2BHK 3BHK 4BHK Villa PREDICTED PRICE ₹ 84.5 L · 94.2% accuracy
ML / MLOps · Capstone
SmartEstateHub
PythonScikit-learnStreamlitMLOps
ML / MLOps · Capstone Project
SmartEstateHub
A full-stack real estate analytics platform that predicts property prices, recommends similar listings, and visualises India's housing market on an interactive map — all deployed on Streamlit.
  • Price prediction model using Random Forest + feature engineering on 10k+ property listings
  • Recommendation engine with cosine similarity for "similar properties" suggestions
  • Interactive choropleth maps showing price heatmaps by city sector via Plotly
  • Full ML pipeline: EDA → feature selection → model training → Streamlit deployment
  • Achieved 94%+ accuracy on price prediction after hyperparameter tuning
Search a movie... Find 🎬 INCEPTION ★★★★★ 🎥 Interstellar 94% match 🎥 The Matrix 91% match 🎥 Tenet 88% match +more TMDB API · Cosine Similarity · Content-Based Filtering
Recommendation System
Movie Recommender
PythonNLPCosine SimStreamlit
Recommendation System
Movie Recommender
An intelligent movie recommendation engine combining content-based and collaborative filtering to suggest personalised movies based on user's watch history and preferences via the TMDB API.
  • Content-based filtering using TF-IDF vectorisation on plot, genre, cast, and director data
  • Cosine similarity matrix computed across 5000+ TMDB movies for fast nearest-neighbour lookup
  • Collaborative filtering layer using user preference signals to personalise results
  • Movie posters fetched live from TMDB API and displayed in a clean Streamlit UI
  • NLP preprocessing pipeline: stemming, stop-word removal, tag vectorisation
IPL 2024 MI vs CSK MI 186/4 (18.2) CSK 142/7 (20.0) WIN PROBABILITY MI 65% CSK 35% TOP SCORER Rohit Sharma 89(54) ECONOMY 8.2 | SR: 164.8 4s/6s 8 fours / 5 sixes
Sports Analytics · Dashboard
IPL Analytics
PythonPandasPlotlyEDA
Sports Analytics · Dashboard
IPL Analytics
A comprehensive cricket analytics web app covering 17 IPL seasons — player performance trends, team strategies, win probability models, and head-to-head matchup stats with rich Plotly visualisations.
  • 17 seasons of IPL data analysed — 900+ matches, 12k+ player records
  • Win probability model computed over ball-by-ball delivery data
  • Player performance dashboards: batting avg, strike rate, economy, wickets by phase
  • Team strategy heatmaps showing bowling/batting patterns by over and ground
  • Interactive head-to-head comparison charts between any two players or teams
HR Analytics Dashboard TABLEAU HEADCOUNT 1,247 ATTRITION 16.2% AVG TENURE 3.8y SATISFACTION 74% BY DEPT 16.2% SENTIMENT ANALYSIS POSITIVE 70% NEUTRAL 30% NEGATIVE 10% NLP · VADER Sentiment · Employee Feedback 2024
HR Analytics · Tableau
HR Dashboard
TableauPythonNLPSentiment
HR Analytics · Tableau
HR Dashboard
An end-to-end HR analytics platform combining Tableau visualisations with Python NLP sentiment analysis on employee feedback — enabling proactive HR decision-making on attrition, performance, and morale.
  • Attrition analysis across department, tenure, age group, and performance band
  • VADER sentiment analysis on employee survey feedback using Python NLTK
  • Performance evaluation trends with YoY comparison and cohort breakdowns
  • Interactive Tableau dashboards with drill-down filters by region, department, grade
  • Identified top 3 attrition risk factors with actionable recommendations for HR
DENSITY Very High High Medium Low Very Low POPULATION 1.42 B LITERACY 74.0% SEX RATIO 940 / 1000 STATES 28 + 8 UTs
Census Analytics · Geospatial
India Census
PythonPlotlyStreamlitGeospatial
Census Analytics · Geospatial
India Census Dashboard
An interactive geospatial dashboard exploring India's demographic landscape across all 28 states and 8 UTs — population trends, literacy rates, gender distribution, and growth analysis from census data.
  • Choropleth maps built with Plotly + GeoJSON for state-level population density
  • Literacy & sex ratio breakdowns by state with drill-down to district level
  • Population growth analysis with decade-on-decade comparison (2001 vs 2011)
  • Dynamic filter panel — filter by state, metric, and demographic group simultaneously
  • Visualised urban vs rural split and top 10 most / least populous states
US Consumer Complaints · Tableau LOD Dashboard NY/NJ High Midwest West Coast SE
Consumer Analytics · Tableau
US Complaints
TableauLOD CalcData Viz
Consumer Analytics · Tableau
US Complaints Dashboard
An advanced Tableau dashboard leveraging Level of Detail (LOD) calculations to analyse CFPB consumer complaint data across financial sectors — geographic heatmaps, trend analysis, and resolution patterns.
  • LOD calculations (FIXED, INCLUDE, EXCLUDE) for accurate aggregation across dimensions
  • Geographic heatmap of complaint density across all 50 US states by product category
  • YoY trend lines across mortgage, credit card, student loan, and debt collection sectors
  • Resolution rate analysis — timely vs disputed vs closed without relief breakdown
  • Ranked top 10 financial companies by complaint volume with drill-through capability
05
[05] Credentials

Certified & levelling up

Earned credentials
🎓
Data Science — ML & AI Program
DIGIPERFORM · Dec 2024
Statistics · Tableau · Power BI · Python · ML · Deep Learning · NLP · Gen AI · RDBMS
HON196267 · DS/2023-24/DSA/110
🏛
Introduction to Data Science
CISCO · Credly
View credential →
📊
Business Analytics with Excel
Simplilearn
View credential →
🐍
Python (Problem Solving)
HackerRank
View credential →
🗄️
SQL Intermediate
HackerRank
View credential →
06
[06] Knowledge

Deep dives
worth reading

01
Architecture · Lakehouse
Medallion Architecture on Azure Databricks
Bronze, Silver, Gold — how I've applied this pattern at scale across 5 enterprise platforms.

What is Medallion?

Three quality layers — Bronze, Silver, Gold. Raw data arrives in Bronze, gets cleaned in Silver, and is modeled for BI in Gold. Principle: separate ingestion, quality, and consumption concerns completely.

Bronze — zero transformation, full fidelity

# Bronze ingestion — Delta + metadata columns df_raw = spark.read.option("header", "true").csv("abfss://[email protected]/pos/") df_raw \ .withColumn("_ingestion_ts", F.current_timestamp()) \ .withColumn("_source_file", F.input_file_name()) \ .write.format("delta").mode("append") \ .partitionBy("source_id", "load_date") \ .save("abfss://[email protected]/pos_data/")

Silver — SCD Type 2 merge

DeltaTable.forPath(spark, silver_path).alias("target") \ .merge(df_updates.alias("source"), "target.id = source.id AND target.is_current = true") \ .whenMatchedUpdate( condition="source.updated_at > target.updated_at", set={"is_current": "false", "end_date": "source.updated_at"} ).whenNotMatchedInsertAll().execute()

Gold — DBT Star Schema

SELECT s.sale_date, p.product_key, d.distributor_key, SUM(s.quantity) AS total_units, SUM(s.revenue) AS total_revenue FROM {{ ref('silver_pos_cleaned') }} s JOIN {{ ref('dim_product') }} p ON s.product_id = p.product_id GROUP BY 1,2,3
02
Data Modeling · Warehouse
Star Schema & SCD Types: The Practical Guide
Dimensional modeling in action — how I design fact and dim tables for real warehouses on Snowflake and Synapse.

Why Star Schema?

Central fact table surrounded by denormalized dimensions. Optimized for query speed and human readability. Power BI and Snowflake both perform best against star schemas.

Fact table — Snowflake Gold

CREATE OR REPLACE TABLE gold.fct_sales ( sale_sk NUMBER AUTOINCREMENT PRIMARY KEY, date_key NUMBER NOT NULL, product_key NUMBER NOT NULL, quantity_sold NUMBER(10,2), revenue_usd NUMBER(14,4) ) CLUSTER BY (date_key);

SCD Type 1 vs Type 2

-- SCD1: overwrite (no history) MERGE INTO dim_product AS t USING updates AS s ON t.product_id = s.product_id WHEN MATCHED THEN UPDATE SET t.product_name = s.product_name -- SCD2: full history (is_current + effective dates) CREATE TABLE dim_dist ( dist_sk NUMBER, dist_id VARCHAR(50), tier VARCHAR(50), effective_from DATE, effective_to DATE, is_current BOOLEAN DEFAULT TRUE );
03
PySpark · Performance
PySpark Optimization Patterns I Use in Production
Processing 1B+ records daily — the patterns that make the difference between 4 minutes and 4 hours.

1. AQE + salting for data skew

spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") df = df.withColumn("key_salted", F.concat("key", F.lit("_"), (F.rand()*10).cast("int")))

2. Smart caching + broadcast joins

large_df.cache(); large_df.count() # materialize apac = large_df.filter(F.col("region")=="APAC").agg(...) large_df.unpersist() result = fact.join(broadcast(dim), on="product_id") # zero shuffle

3. Delta Z-Ordering

DeltaTable.forPath(spark, "/gold/fct_sales") \ .optimize().executeZOrderBy("sale_date", "region_key") DeltaTable.forPath(spark, "/gold/fct_sales").vacuum(168)
04
Power Platform · Low-Code
Power Apps for Data Engineers: Workflows Non-Tech Teams Love
Power Apps + Power Automate wired to SQL and Dataverse becomes a serious enterprise workflow layer.

Why Power Apps?

My job doesn't end at pipelines. Business users need to validate data, manage master records, trigger workflows — without touching Azure Portal. Built multi-screen apps on SCOPE, New Forests, and Medtrition.

Power
Apps
Power
Automate
Dataverse
SQL
Power
BI
SharePoint
Teams

3-screen validation app (New Forests)

  • Screen 1 — Dashboard: pending records, color-coded by quality score
  • Screen 2 — Record detail: field comparison (new vs historical avg)
  • Screen 3 — Approval: triggers Power Automate → updates Silver → fires ADF
Filter(SilverRecords, validation_status = "PENDING" && region in User().Department) Patch(SilverRecords, LookUp(SilverRecords, record_id = selectedId), {validation_status: "APPROVED", approved_by: User().Email}); PowerAutomateFlow.Run(selectedId, "APPROVED", commentInput.Text)
05
Architecture · Interactive Diagram
Modern Data Warehouse on Azure: Every Layer, Every Tool — Clickable
Complete 5-layer architecture. Click any tab to isolate a layer. Click any box to see what it does and why I chose it.

Click a layer tab, then click any component box

Ingest
ADFBatch pipelines
Event HubReal-time stream
FunctionsServerless
Logic AppsAutomation
Storage
ADLS Gen2Data Lake
Delta LakeACID · Time-travel
SnowflakeGold warehouse
SynapseEnterprise SQL
Transform
DatabricksPySpark
DBT CoreSQL · Tests
AirflowDAG orchestration
MLflowML lifecycle
Govern
Unity CatalogLineage · ACL
RBACAccess control
MonitorObservability
CI/CDDevOps
Consume
Power BIBI dashboards
Power AppsOp workflows
Databricks SQLAd-hoc queries
REST APIsIntegration
06
Statistics · ML · Medium
Non-Parametric vs Parametric Density Estimation on Non-Gaussian Data
When to use KDE vs GMM vs parametric fitting — and why the choice matters more than most realize.

The core distinction

Parametric assumes a known distribution, estimates its parameters. Non-parametric makes zero assumptions — the data defines its own shape. Real data is never Gaussian. Sales are skewed. Latency has heavy tails.

KDE in practice

from sklearn.neighbors import KernelDensity kde = KernelDensity(bandwidth=0.5, kernel='gaussian') kde.fit(revenue_data.reshape(-1, 1)) density = np.exp(kde.score_samples(x_range.reshape(-1, 1)))

Decision guide

  • Unimodal, symmetric → Parametric (Gaussian)
  • Unimodal, skewed → KDE or log-normal
  • Multimodal → Gaussian Mixture Model (GMM)
  • Unknown shape → KDE with cross-validated bandwidth
07
[07] Contact

Let's build
something
worth running.

Open to Senior Data Engineer and Lead Data Engineer roles — product-first companies where data is treated as a first-class citizen.