Comprehensive notes, MCQs, and Short Questions for Chapter 5 Data Analytics. Covers statistical concepts, data collection, regression, clustering, and visualization.
Definition: A model is a simplified representation of a real-world problem. It has three parts: Input, Process, and Output.
Why Models are Important:
- Help make decisions
- Save time and resources
- Predict future outcomes
Examples: Weather forecasting, predicting sales, studying disease spread.
Mean (Average): Sum of all values divided by the number of values.
Mode: The value that appears most frequently in a data set.
Median: The middle value when data is arranged in order. For even numbers, it's the average of the two middle values.
Variance: Shows how far each number in the data is from the mean. High variance means values are spread out.
Standard Deviation: The square root of variance. Shows the average distance from the mean. Easier to interpret than variance.
Definition: Probability is the study of how likely an event is to happen.
Formula: Probability = Favorable Outcomes / Total Outcomes
Example: Probability of heads when tossing a coin = 1/2 = 50%
Uses: Weather forecasting, games, medical testing.
Surveys: Questionnaires given to a group to collect standardized data quickly. Can be online, phone, or paper.
Observations: Watching people or events in their natural setting without asking questions.
Experiments: Changing one variable to see its effect on another. Used to test cause-and-effect relationships.
Data Cleaning: Fixing or removing errors like incorrect entries, missing values, or duplicates.
Data Transformation: Changing data into a better format (e.g., making new columns, rearranging data).
Handling Missing Data:
- Imputation: Filling missing values with an average.
- Flagging: Marking data as missing.
- Removal: Deleting incomplete records.
Linear Regression: Predicts a numeric value (dependent variable) based on another variable (independent). Formula: Y = a + bX.
Logistic Regression: Predicts a Yes/No outcome. Gives a probability between 0 and 1.
Clustering (K-Means): Groups similar items together without predefined labels. Useful for finding patterns.
Bar Chart: Compares different categories using bars.
Line Graph: Shows changes over time.
Histogram: Shows how data is distributed across ranges.
Scatter Plot: Shows relationship between two variables.
Box Plot: Shows data spread, median, and outliers.
Tools: MS Excel, Google Sheets.