
Title:
Complex Queries 2  Intro to Data Science

Description:

The last thing that I want to quickly

discuss ,are some functions that usually exist in query

languages like Sequel. Such as ,group bys and

aggregate functions. So let's say that I wanted to

create some transformed version of my data. For

example, what are the total number of enrollments per

district? . I could write something like this. Select

district,. Sum (aadhaar  generated) from aadhaar  data,

group by district. Let's talk about this

a little bit more. What's happening exactly?

. Sum is what we call an aggregate function. An aggregate function takes some

set of values, usually numbers and performs

a mathematical operation on them. We've used

sum ,but other aggregate functions include count.

Min, mean, max. You get the idea.

Operations that one could perform on a collection

of numbers. But wait. Every single row is

only one number. So how do we get

to collections of numbers? What we basically say here

is take each distinct district. And then for

all of the different values of aadhaar_generated corresponding

to a row ,for that district. Sum them

up. So, we start with our aadhaar data table.

Take each district ,and sum up the count aadhaar

generated. Now in order for our results to make

sense, we are only going to want one row

in our output for each district. So we throw in

this group by clause on the end, which essentially

says. Let's only have one row per district in

our results. There can be numerous clauses in a

group buy. So we could also say this, select district,

subdistrict, sum aadhaar generated, from aadhaar data, group

by district, subdistrict. Note that whatever columns we

select, that we don't aggregate, we need to

group by. In this case, district and subdistrict. We

could also put a where clause here, so

in order to sum up aadhaar generated for people

over 60 in each district, I can just add a where clause in after the table name,

as we discussed earlier. If we were to run this query. Giving us select

district, subdistrict, sum aadhaar generated, from aadhaar data,

where age greater than 60, group by district, subdistrict. If we

were to run this query, we would have

a row for every combination of district and

subdistrict. And we would also have for each

row, account of how many aadhaar were generated ,for

people over the age of 60.