Udemy Enrollment Prediction
About Me
Pratham Kamble
London, UK
Tech + Data Science = Me.
-
I drive meaningful outcomes with every project I touch.
-
I simplify the complex so everyone can grasp it.
-
I create clear, beautiful data visuals.
Udemy Student Enrollment Prediction
Udemy can use this model to help instructors estimate enrollments and optimize courses
Project Highlights
1. Data collected from Udemy using a custom web scraper: 8,364 courses, 36 languages, 3 main categories
Scraped topics | ||
---|---|---|
machine-learning | web-development | python |
data-science | unity | c-sharp |
artificial-intelligence | google-flutter | javascript |
data-analysis | sql | java |
generative-ai | microsoft-power-bi | c-plus-plus |
business-intelligence | unreal-engine | angular |
business-analytics | game-development | css |
deep-learning | docker | react |
data-modeling | tableau | dax |
business-analysis |
Feature | Information | How We Handled It |
---|---|---|
web-scraper-order | Metadata generated by the scraper to indicate the order of scraping. | Dropped as it was unnecessary for analysis. |
web-scraper-start-url | Contains the URL from which the data was scraped, indicating the topic of the course. | Used to extract the course topic for categorization. |
course-title | The title of the course, often with additional information appended. | Dropped, as the course topic from the URL was sufficient for categorization. |
course-price | Price of the course, often missing or marked as free; stored as a string. | Converted string values to numeric. Dropped missing prices |
course-rating | Average course rating, stored as a string (e.g., "Rating: 4.5 out of 5"). | Extracted numeric rating from the string for analysis. |
course-num-of-reviews | Total number of reviews for the course, stored as a string (e.g., "235 reviews" or "1 review"). | Extracted numeric value and standardized singular/plural differences. |
course-total-hour-length | Duration of the course in hours, stored as a string (e.g., "3.5 total hours"). | Extracted the numeric value from the string for analysis. |
course-num-of-lectures | Number of lectures in the course, stored as a string (e.g., "22 lectures"). | Extracted numeric value from the string for analysis. |
course-instructional-level | The difficulty level of the course (e.g., Beginner, Intermediate, etc.). | Kept as-is for analysis, categorized into four distinct levels. |
course-short-description | A brief description of the course content. | Dropped due to limited relevance and complexity in processing text data within the project timeline. |
course-link | A URL leading to the course page. | Dropped as it was redundant and unnecessary for the analysis. |
course-link-href | Another URL leading to the course page. | Dropped as it was redundant. |
course-instructor | Name of the instructor(s) for the course. | Retained the name of the first listed instructor, noting the potential bias in excluding secondary instructors. |
course-language | Language of the course. | Retained for analysis as a categorical variable. |
course-enrolled-student | Number of students currently enrolled in the course, stored as a localized string (e.g., "1,679人の受講生"). | Extracted numeric values. |
raw_stat_texts | Contains instructor-related statistics (e.g., rating, reviews, students, courses), often in a single string. | Split into separate columns for each statistic. Only processed data for the first instructor listed. Converted strings to nums. |
2. Most courses are in Programming, IT & Software, and Analytics/AI/ML
3. Outlier handling: capped and log-transformed variables to reduce skew
4. Correlation analysis: instructor students, reviews, and course price are top predictors
Correlation level table
Correlation Level | Variable | Correlation Value | Correlation Type |
---|---|---|---|
High | Number of students instructor taught | 0.66 | Positive |
High | Instructor reviews | 0.57 | Positive |
Moderate | Course price | 0.35 | Positive |
Moderate | Number of lectures | 0.32 | Positive |
Moderate | English language (is_english) | 0.29 | Positive |
Low | Total course hours | 0.19 | Positive |
Low | Number of courses instructor launched | 0.18 | Positive |
Low | Instructor rating | 0.14 | Positive |
Low | Programming Language category | 0.11 | Positive |
Low | Course difficulty (All Levels) | 0.11 | Positive |
Low | Course difficulty (Intermediate) | -0.03 | Negative |
Low | Analytics, AI & ML category | -0.05 | Negative |
Low | Course difficulty (Expert) | -0.05 | Negative |
Low | IT & Software category | -0.06 | Negative |
Low | Course difficulty (Beginner) | -0.07 | Negative |
5. Decision Tree and Random Forest models trained with cross-validation
Models performance
Model | Mean Test R² (Log Scale) | Mean Test RMSE (Log Scale) | Mean Test R² (Original Scale) | Mean Test RMSE (Original Scale) |
---|---|---|---|---|
Random Forest | 0.6821 ± 0.0187 | 1.4967 ± 0.0363 | 0.5154 ± 0.0252 | 19,633.30 ± 1,274.57 |
Decision Tree | 0.5899 ± 0.0134 | 1.7005 ± 0.0195 | 0.3626 ± 0.0816 | 22,431.38 ± 1,248.31 |
Mean Baseline | — | — | 0.0000 | 31,205.18 |
Median Baseline | — | — | -0.0982 | 32,701.18 |
6. Random Forest outperformed Decision Tree (R² = 0.51 vs 0.36 on original scale)
7. Learning Curves
8. Validation curves for hyperparameter tuning
Limitation
The dataset may be biased due to missing values excluded during data cleaning, caused by network errors and challenges in capturing JavaScript-rendered content. For instance, missing course-price data likely excluded free courses, skewing the dataset toward paid offerings. This cleaning reduced the dataset size from 19,425 to 8,148 observations, diminishing diversity and completeness. The smaller dataset increases the risk of overfitting as it is less likely to reflect the population, leading to the generalization problem accurately.
The dataset may also introduce potential bias due to omitted variables, resulting in biased and inconsistent estimates . Some potentially impactful variables, such as course ranking and elapsed time since launch were not included in the dataset and model. These key feature omissions could cause misattribution of effects and skew predictions. These omissions limit the model's ability to capture the complex factors influencing student decisions, potentially reducing its generalizability.
Contact Me
Linkedin: www.linkedin.com/in/prathamskk/
Explore My Other Projects
-
Web Scraping BigQuery Data Pipeline Topic ModellingLookerK-MeansGCPVertex AIGemini
A powerful tool built for Sense Worldwide, an innovation consulting company, that collects and analyzes social media conversations to identify trends and patterns, presenting key findings through easy-to-use interactive charts and reports.
-
React Vite Firebase NoSQLGCP
A food ordering app that served 800+ orders and onboarded 600+ users in a single day, featuring real-time order tracking for our college festival.
-
Food Fiesta: Landing Website
HTML CSS Javascript ParcelBootstrap
A Vibrant website promoting our college's Food Fiesta event and our new food ordering app, with details about the festival, featured food items, and easy ways to order through the app.
-
XGBoost EDA Python Data VisualizationMachine Learning
Leveraged XGBoost and customer purchase history to predict product reorder probability with 70% accuracy, analyzing 3 million orders and 50,000 products to help stores manage inventory better and improve the shopping experience.
-
React Vite Typescript AWSDynamoDBVoice Transcription
A training platform that helps nurses practice and improve their patient handoff communication skills through practice scenarios, instant feedback, and progress tracking. Features voice recording capabilities that automatically convert speech to text for easier review.
-
Web Scraping Machine Learning Python PandasRegressionRandom ForestHyparameter Tuning
Built a predictive model analyzing 9000+ Udemy courses to forecast enrollment numbers using features like course pricing, content length, and instructor ratings. Used Random Forest regression to help course creators optimize their offerings.
-
AI Competitor Intelligence Tool
RAG Gen AI LLM MCPStreamlitRAG EvaluationSpark
Designed an AI RAG system to analyze ~3 million tweets, understanding social media customer support. Optimized Python pipeline by converting it to Spark, reducing processing time from 2 hours to 5mins! Built a user-friendly web interface for the tool using Streamlit.
-
OpenCV YOLOv8 Deep Learning PythonData AugmentationDataset Generation
Built a real-time object detection system at BARC Robotics using YOLOv8 and OpenCV. Calibrated cameras for position measurement and improved accuracy by training on real and synthetic images.
-
Azure Data Lake + ETL Pipeline
Azure Databricks ETL ADLS Gen2Data LakeSpark
A modern data platform on Azure cloud that processes e-commerce data through automated pipelines. Azure Data Factory and Databricks transform raw data into clean, organized layers. Data marts implemented through DBT.