Sample: Use when your data is a subset of a larger group (most common). Divides by n-1. | Population: Use when your data includes every single member of the group. Divides by N.
Enter numbers above to see your statistics instantly.
The Ultimate Guide to Descriptive Statistics and Data Analysis
Whether you are a student, researcher, or analyst, understanding descriptive statistics is one of the most valuable skills you can develop. This guide breaks down every statistic this tool computes, explains the math behind it, and tells you when and why to use each one.
The Mean (or arithmetic average) is calculated by adding all the values in a dataset together and dividing by the total count. It is the single most commonly referenced statistic in everyday life - from average test scores to average income. The formula is straightforward:
The Median is the literal middle value of a sorted dataset. If you sort all your numbers from smallest to largest and pick the one in the center, that is the median. If there is an even number of values, you average the two middle numbers. The median does not involve any arithmetic on the values themselves - it is purely about position.
So when should you use each? Use the Mean when your data is relatively symmetric and does not contain extreme outliers. Average height, average temperature, or average product ratings are good examples. Use the Median when your data is skewed or contains outliers. Income and housing prices are classic examples: a few billionaires or multi-million-dollar mansions can pull the mean far above what a typical person earns or pays. The median income gives you a much more accurate picture of the "typical" person. A dataset where the mean and median are very different is almost always a sign of skewness or the presence of outliers.
This is one of the most asked questions in introductory statistics, and the answer involves a concept called Bessel's Correction. When you collect a sample from a larger population, you are only seeing part of the full picture. The sample mean you calculate is an estimate of the true population mean - and it is, by definition, closer to the data points you collected than the true population mean would be. This causes the raw sum of squared deviations to systematically underestimate the true population variance.
By dividing by n-1 instead of N, you slightly inflate the result to correct for this bias, producing an unbiased estimator of the true population variance. The formulas side by side make the distinction clear:
In practice: if you surveyed 200 people out of a city of 2 million to estimate average commute time, use Sample (n-1). If you had data for all 2 million residents, use Population (N). The Standard Deviation is simply the square root of the variance in either case, which brings the unit back to the original scale of your data (e.g., minutes instead of minutes-squared).
The Interquartile Range (IQR) measures the spread of the middle 50% of your data. To understand it, picture dividing your sorted dataset into four equal quarters using three dividers called quartiles:
Q1 (First Quartile) is the 25th percentile - the point where 25% of your data falls below. Q3 (Third Quartile) is the 75th percentile - the point where 75% of your data falls below. The IQR is simply Q3 minus Q1:
The IQR is powerful because it is completely resistant to outliers - extreme values on either end do not affect Q1, Q3, or the IQR at all. This makes it a far more reliable measure of spread than the Range when dealing with real-world, messy data.
The most widely used method for outlier detection is Tukey's Fences, also known as the 1.5 x IQR Rule. You compute a lower fence and an upper fence:
Any data point that falls below the lower fence or above the upper fence is classified as an outlier. This is the method used by this calculator. It is the same method used in box-and-whisker plots. Keep in mind: outliers are not automatically "bad" data - they may represent genuine, important findings. Always investigate why a value is extreme before removing it.
Variance and Standard Deviation both measure how spread out your data is around the mean - but they exist on different scales. Understanding both requires walking through how they are computed.
Step 1: For each data point, compute how far it is from the mean (this is called the deviation). Step 2: Square each of those deviations (to make them all positive and to amplify larger deviations). Step 3: Average the squared deviations (using n-1 or N depending on sample vs. population). The result is the Variance.
The problem with variance is its unit. If you measured heights in centimeters, the variance is in centimeters-squared, which is not intuitive. To fix this, take the square root - and you get the Standard Deviation, which is back in the original unit (centimeters). Standard deviation is what most people report and use day-to-day because it is directly interpretable. Variance is more useful in advanced statistical formulas and proofs where the square-root property would create complications. For everyday data analysis, standard deviation is your go-to measure of spread.
The Mode is the value (or values) that appear most frequently in your dataset. Unlike the mean and median, the mode can be used with non-numerical data - for example, the most common eye color in a group, or the most popular item sold in a store. In a numerical context, the mode answers the question: "which value shows up the most?"
A dataset with one mode is called unimodal. A dataset with two equally frequent values is called bimodal, and one with more than two is called multimodal. Bimodal and multimodal distributions are genuinely interesting in data science - they often suggest that two or more distinct subgroups are present in your data. For example, if you measured the heights of a mixed group of adults and children, the distribution might be bimodal, with one peak around adult height and one around child height.
When every value in a dataset appears exactly once, the dataset technically has no mode (or every value is a mode, which conveys no useful information). This calculator will display "No mode" in that case. The mode is rarely used alone for serious analysis, but it is a valuable first-pass check to identify the most representative or popular value in a dataset.