Feature scaling is a technique used in machine learning to bring all numerical features to a similar scale. This helps algorithms work better and faster by preventing certain features from dominating due to their large values.
Example:
Imagine you have a dataset with height (in cm) and salary (in dollars):
πΉ Height: 150 cm, 160 cm, 170 cm
πΉ Salary: 50,000, 60,000, 70,000
Since salary values are much larger, they can overpower height when used in a model. Feature scaling transforms them to a similar range (0 to 1 or -1 to 1), so that no feature dominates.
Outliers are data points that are significantly different from the rest of the dataset. They can be much higher or lower than other values and may be caused by errors, variability, or rare events.
Example:
Imagine you collect students' ages in a classroom:
π Ages: 18, 19, 20, 21, 50
Here, 50 is an outlier because it's much higher than the other ages. It could be a mistake or a rare case (e.g., a teacher's age mistakenly included).
Feature selection is the process of choosing the most important features (columns) in a dataset that contribute the most to making predictions. This helps improve model performance, reduce overfitting, and speed up training.
Example:
Imagine you're predicting house prices using these features:
β
Square Footage (Important)
β
Number of Bedrooms (Important)
β House Color (Not important)
β Ownerβs Name (Not important)
Removing unimportant features improves model performance.