Previous Section  < Day Day Up >  Next Section

5.5 Aggregate Functions, GROUP BY, and HAVING

A SELECT statement can produce a list of rows that match a given set of conditions. The list provides the details about the selected rows, but if you want to know about the overall characteristics of the rows, you'll be more interested in getting a summary instead. When that's your goal, use aggregate functions to calculate summary values, possibly combined with a GROUP BY clause to arrange the selected rows into groups so that you can get summaries for each group.

Grouping can be based on the values in one or more columns of the selected rows. For example, the Country table indicates which continent each country is part of, so you can group the records by continent and calculate the average population of countries in each continent:






SELECT Continent, AVG(Population) FROM Country GROUP BY Continent;


Functions such as AVG() that calculate summary values for groups are known as "aggregate" functions because they're based on aggregates or groups of values.

This section describes the aggregate functions available to you and shows how to use GROUP BY to group rows appropriately for the type of summary you want to produce. It also discusses the HAVING clause that enables you to select only groups that have certain characteristics.

5.5.1 Computing Summary Values

An aggregate function calculates a summary value from a group of individual values. There are several types of aggregate functions. Those discussed here are as follows:

  • SUM() and AVG() summarize numeric values to produce sums (totals) and averages.

  • MIN() and MAX() find smallest and largest values.

  • COUNT() counts rows, values, or the number of distinct values.

Aggregate functions may be used with or without a GROUP BY clause that places rows into groups. Without a GROUP BY clause, an aggregate function calculates a summary value based on the entire set of selected rows. (That is, MySQL treats all the rows as a single group.) With a GROUP BY clause, an aggregate function calculates a summary value for each group. For example, if a WHERE clause selects 20 rows and the GROUP BY clause arranges them into four groups of five rows each, a summary function produces a value for each of the four groups.

5.5.1.1 The SUM() and AVG() Aggregate Functions

The SUM() and AVG() functions calculate sums and averages. For example, the Country table in the world database contains a Population column, so we can calculate the total world population and the average population per country like this:






mysql> SELECT SUM(Population), AVG(Population) FROM Country;

+-----------------+-----------------+

| SUM(Population) | AVG(Population) |

+-----------------+-----------------+

|      6078749450 |   25434098.1172 |

+-----------------+-----------------+


SUM() and AVG() are most commonly used with numeric values. If you use them with other types of values, those values are subjected to numeric conversion, which might not produce a sensible result:






mysql> SELECT SUM(Name), SUM(Continent) FROM Country;

+-----------+----------------+

| SUM(Name) | SUM(Continent) |

+-----------+----------------+

|         0 |            754 |

+-----------+----------------+


In this case, the Name values are strings that don't look like numbers. Each converts to zero, so the sum is zero as well. The Continent column is an ENUM, so its values are summed according to their internal numeric representation. In neither case are the results meaningful.

SUM() and AVG() ignore NULL values.

5.5.1.2 The MIN() and MAX() Aggregate Functions

MIN() and MAX() are comparison functions. They return smallest or largest numeric values, lexically first or last string values, and earliest or latest temporal values. The following queries determine the smallest and largest country populations and the lexically first and last country names:






mysql> SELECT MIN(Population), MAX(Population) FROM Country;

+-----------------+-----------------+

| MIN(Population) | MAX(Population) |

+-----------------+-----------------+

|               0 |      1277558000 |

+-----------------+-----------------+

mysql> SELECT MIN(Name), MAX(Name) FROM Country;

+-------------+-----------+

| MIN(Name)   | MAX(Name) |

+-------------+-----------+

| Afghanistan | Zimbabwe  |

+-------------+-----------+


For string values, the behavior of MIN() and MAX() depends on whether the strings are binary (case sensitive) or nonbinary (not case sensitive). Consider a table t containing the following string values:






mysql> SELECT name FROM t;

+--------+

| name   |

+--------+

| Calvin |

| alex   |

+--------+


If the names are nonbinary strings (for example, if they're CHAR values), MAX(name) returns 'Calvin' because C is greater than a in a collating order that isn't case sensitive:






mysql> SELECT MAX(name) FROM t;

+-----------+

| MAX(name) |

+-----------+

| Calvin    |

+-----------+


But if the values are binary strings (for example, CHAR BINARY values), the values are compared using the actual numeric byte values of the characters and thus are treated as case sensitive. If lowercase letters have larger byte values than uppercase letters (as is true for the default character set), a has a larger byte value than C and MAX(name) returns 'alex':






mysql> ALTER TABLE t MODIFY name CHAR(20) BINARY;

mysql> SELECT MAX(name) FROM t;

+-----------+

| MAX(name) |

+-----------+

| alex      |

+-----------+


MIN() and MAX()ignore NULL values.

5.5.1.3 The COUNT() Aggregate Function

The COUNT() function can be used in several ways to count either rows or values. To illustrate, the examples here use the following table that has several rows containing various combinations of NULL and non-NULL values:






mysql> SELECT i, j FROM t;

+------+------+

| i    | j    |

+------+------+

|    1 | NULL |

| NULL |    2 |

|    1 |    1 |

|    1 |    1 |

|    1 |    3 |

| NULL | NULL |

|    1 | NULL |

+------+------+


COUNT() may be used as follows:

  • COUNT(*) counts the total number of rows:

    
    
    
    

    
    mysql> SELECT COUNT(*) FROM t;
    
    +----------+
    
    | COUNT(*) |
    
    +----------+
    
    |        7 |
    
    +----------+
    
    

  • COUNT(expression) counts the number of non-NULL values of the given expression. It's common for expression to be a column name, in which case, COUNT() counts the number of non-NULL values in the column:

    
    
    
    

    
    mysql> SELECT COUNT(i), COUNT(j) FROM t;
    
    +----------+----------+
    
    | COUNT(i) | COUNT(j) |
    
    +----------+----------+
    
    |        5 |        4 |
    
    +----------+----------+
    
    

  • COUNT(DISTINCT expression) counts the number of distinct (unique) non-NULL values of the given expression. expression can be a column name to count the number of distinct non-NULL values in the column:

    
    
    
    

    
    mysql> SELECT COUNT(DISTINCT i), COUNT(DISTINCT j) FROM t;
    
    +-------------------+-------------------+
    
    | COUNT(DISTINCT i) | COUNT(DISTINCT j) |
    
    +-------------------+-------------------+
    
    |                 1 |                 3 |
    
    +-------------------+-------------------+
    
    

    It's also possible to give a list of expressions separated by commas. In this case, COUNT() returns the number of distinct combinations of values that contain no NULL values. The following query counts the number of distinct rows that have non-NULL values in both the i and j columns:

    
    
    
    

    
    mysql> SELECT COUNT(DISTINCT i, j) FROM t;
    
    +----------------------+
    
    | COUNT(DISTINCT i, j) |
    
    +----------------------+
    
    |                    2 |
    
    +----------------------+
    
    

5.5.1.4 Aggregate Values for an Empty Set

A SELECT statement might produce an empty result set if the table is empty or the WHERE clause selects no rows from it. If the set of values passed to an aggregate function is empty, the function computes the most sensible value. For COUNT(), the result is zero. But functions such as SUM(), MIN(), MAX(), and AVG() return NULL. They also return NULL if a nonempty result contains only NULL values. These behaviors occur because there is no way for such functions to compute results without at least one non-NULL input value.

5.5.2 Grouping Rows with GROUP BY

If the query does not contain a GROUP BY clause to place rows of the result set into groups, an aggregate function result is based on all the selected rows. A GROUP BY clause may be added to generate a more fine-grained summary that produces values for subgroups within a set of selected rows.

Suppose that a table named personnel contains the following information about company employees:






mysql> SELECT * FROM personnel;

+---------+--------+---------+-------------+----------+

| pers_id | name   | dept_id | title       | salary   |

+---------+--------+---------+-------------+----------+

|       1 | Wendy  |      14 | Supervisor  | 38000.00 |

|       2 | Wally  |       7 | Stock clerk | 28000.00 |

|       3 | Ray    |       7 | Programmer  | 41000.00 |

|       4 | Burton |      14 | Secretary   | 32000.00 |

|       5 | Gordon |      14 | President   | 78000.00 |

|       6 | Jeff   |       7 | Stock clerk | 29000.00 |

|       7 | Doris  |       7 | Programmer  | 48000.00 |

|       8 | Daisy  |       7 | Secretary   | 33000.00 |

|       9 | Bea    |       7 | Accountant  | 40000.00 |

+---------+--------+---------+-------------+----------+


A count for a query with no GROUP BY produces a single value for the entire set of rows:






mysql> SELECT COUNT(*) FROM personnel;

+----------+

| COUNT(*) |

+----------+

|        9 |

+----------+


Adding a GROUP BY clause arranges rows using the values in the grouping column or columns with the result that COUNT(*) produces a count for each group. To find out how many times each title occurs, do this:






mysql> SELECT title, COUNT(*) FROM personnel

    -> GROUP BY title;

+-------------+----------+

| title       | COUNT(*) |

+-------------+----------+

| Accountant  |        1 |

| President   |        1 |

| Programmer  |        2 |

| Secretary   |        2 |

| Stock clerk |        2 |

| Supervisor  |        1 |

+-------------+----------+


To count the number of people in each department, group by department number:






mysql> SELECT dept_id, COUNT(*) FROM personnel

    -> GROUP BY dept_id;

+---------+----------+

| dept_id | COUNT(*) |

+---------+----------+

|       7 |        6 |

|      14 |        3 |

+---------+----------+


A GROUP BY that names multiple columns arranges rows according to the combinations of values in those columns. For example, to find out how many times each job title occurs in each department, group by both department and title:






mysql> SELECT dept_id, title, COUNT(*) FROM personnel

    -> GROUP BY dept_id, title;

+---------+-------------+----------+

| dept_id | title       | COUNT(*) |

+---------+-------------+----------+

|       7 | Accountant  |        1 |

|       7 | Programmer  |        2 |

|       7 | Secretary   |        1 |

|       7 | Stock clerk |        2 |

|      14 | President   |        1 |

|      14 | Secretary   |        1 |

|      14 | Supervisor  |        1 |

+---------+-------------+----------+


The preceding queries use COUNT(*) to count rows, but you can also use summary functions to compute results based on values in specific columns of the rows in each group. For example, numeric functions can tell you about the salary characteristics of each title or department:






mysql> SELECT title, MIN(salary), MAX(salary), AVG(salary)

    -> FROM personnel

    -> GROUP BY title;

+-------------+-------------+-------------+--------------+

| title       | MIN(salary) | MAX(salary) | AVG(salary)  |

+-------------+-------------+-------------+--------------+

| Accountant  |    40000.00 |    40000.00 | 40000.000000 |

| President   |    78000.00 |    78000.00 | 78000.000000 |

| Programmer  |    41000.00 |    48000.00 | 44500.000000 |

| Secretary   |    32000.00 |    33000.00 | 32500.000000 |

| Stock clerk |    28000.00 |    29000.00 | 28500.000000 |

| Supervisor  |    38000.00 |    38000.00 | 38000.000000 |

+-------------+-------------+-------------+--------------+

mysql> SELECT dept_id, MIN(salary), MAX(salary), AVG(salary)

    -> FROM personnel

    -> GROUP BY dept_id;

+---------+-------------+-------------+--------------+

| dept_id | MIN(salary) | MAX(salary) | AVG(salary)  |

+---------+-------------+-------------+--------------+

|       7 |    28000.00 |    48000.00 | 36500.000000 |

|      14 |    32000.00 |    78000.00 | 49333.333333 |

+---------+-------------+-------------+--------------+


Note that in each of these queries, the output columns consist only of the columns listed in the GROUP BY clause, and values produced by summary functions. If you try to display table columns other than those listed in the GROUP BY clause, the values displayed for the extra columns are unpredictable.

5.5.3 GROUP BY and Sorting

In MySQL, a GROUP BY clause has the side effect of sorting rows. If you already have a GROUP BY clause in your query that produces the desired sort order, there's no need for an ORDER BY. Use of ORDER BY is necessary with GROUP BY only to produce a different sort order than that resulting from the GROUP BY. However, this isn't a portable behavior; for database engines other than MySQL, GROUP BY might not sort rows. To write more portable queries, add an ORDER BY even if MySQL does not require it.

For examples that show the effect of using ORDER BY in conjunction with GROUP BY to produce a different sort order than the grouping does, see the sample exercises at the end of the chapter.

5.5.4 Selecting Groups with HAVING

It could be when you use GROUP BY that you're interested only in groups that have particular summary characteristics. To retrieve just those groups and eliminate the rest, use a HAVING clause that identifies the required group characteristics. HAVING acts in a manner somewhat similar to WHERE, but occurs at a different stage of query processing:

  1. WHERE, if present, identifies the initial set of records to be selected from a table.

  2. GROUP BY arranges the selected records into groups.

  3. Aggregate functions compute summary values for each group.

  4. HAVING identifies which groups to retrieve for the final result set.

The following example shows how this progression works, using the personnel table from the previous section:

  1. A query with no GROUP BY clause or aggregate functions selects a list of records. This list provides details, not overall characteristics:

    
    
    
    

    
    mysql> SELECT title, salary
    
        -> FROM personnel WHERE dept_id = 7;
    
    +-------------+----------+
    
    | title       | salary   |
    
    +-------------+----------+
    
    | Stock clerk | 28000.00 |
    
    | Programmer  | 41000.00 |
    
    | Stock clerk | 29000.00 |
    
    | Programmer  | 48000.00 |
    
    | Secretary   | 33000.00 |
    
    | Accountant  | 40000.00 |
    
    +-------------+----------+
    
    

  2. Adding GROUP BY and aggregate functions arranges rows into groups and computes summary values for each.

    
    
    
    

    
    mysql> SELECT title, COUNT(*), AVG(salary)
    
        -> FROM personnel WHERE dept_id = 7
    
        -> GROUP BY title;
    
    +-------------+----------+--------------+
    
    | title       | COUNT(*) | AVG(salary)  |
    
    +-------------+----------+--------------+
    
    | Accountant  |        1 | 40000.000000 |
    
    | Programmer  |        2 | 44500.000000 |
    
    | Secretary   |        1 | 33000.000000 |
    
    | Stock clerk |        2 | 28500.000000 |
    
    +-------------+----------+--------------+
    
    

  3. Finally, adding HAVING places an additional constraint on the output rows. In the following query, only those groups consisting of two or more people are displayed:

    
    
    
    

    
    mysql> SELECT title, salary, COUNT(*), AVG(salary)
    
        -> FROM personnel WHERE dept_id = 7
    
        -> GROUP BY title
    
        -> HAVING COUNT(*) > 1;
    
    +-------------+----------+----------+--------------+
    
    | title       | salary   | COUNT(*) | AVG(salary)  |
    
    +-------------+----------+----------+--------------+
    
    | Programmer  | 41000.00 |        2 | 44500.000000 |
    
    | Stock clerk | 28000.00 |        2 | 28500.000000 |
    
    +-------------+----------+----------+--------------+
    
    

Sometimes you can place selection criteria in either the WHERE clause or in the HAVING clause. In such cases, it's better to do so in the WHERE clause because that eliminates rows from consideration sooner and allows the query to be processed more efficiently. Choosing values in the HAVING clause might cause the query to perform group calculations on groups in which you have no interest.

    Previous Section  < Day Day Up >  Next Section