Categorizing ranges of data in a dataframe using Pandas

Issue

I have a timeseries dataframe with data from multiple sites that looks like this:

Site    Date        Variable
1       01/01/2021  -1
1       02/01/2021  0
1       03/01/2021  1
1       04/01/2021  0
1       05/01/2021  -1
1       06/01/2021  0
1       07/01/2021  1
1       08/01/2021  2
1       09/01/2021  1
1       10/01/2021  0
2       01/01/2021  -5
2       02/01/2021  3
2       03/01/2021  2
2       04/01/2021  6
2       05/01/2021  -3
2       06/01/2021  3
2       07/01/2021  1
2       08/01/2021  -4
2       09/01/2021  -5
2       10/01/2021  -1

The plotted data looks like this, with certain sites having a high range and others with a low range: Plotted Data

I’d like to find a way to categorize the data into groups of ‘high’ and ‘low’ range, for example Site 1 would be put into a category that has a range from -2 to 2. I would imagine these would have to be set manually by me, that’s fine.

I’ve had a try with bins and dynamic bins, but from what I can tell these only categorize the individual variables, whereas I need the [Site] as a whole to be looked at and split into categories based on the full range of data within each site. In the end I need something that looks like this:

Site    Date        Variable    Type
1       01/01/2021  -1          LOW
1       02/01/2021  0           LOW
1       03/01/2021  1           LOW
1       04/01/2021  0           LOW
1       05/01/2021  -1          LOW
1       06/01/2021  0           LOW
1       07/01/2021  1           LOW
1       08/01/2021  2           LOW
1       09/01/2021  1           LOW
1       10/01/2021  0           LOW
2       01/01/2021  -5          HIGH
2       02/01/2021  3           HIGH
2       03/01/2021  2           HIGH
2       04/01/2021  6           HIGH
2       05/01/2021  -3          HIGH
2       06/01/2021  3           HIGH
2       07/01/2021  1           HIGH
2       08/01/2021  -4          HIGH
2       09/01/2021  -5          HIGH
2       10/01/2021  -1          HIGH

Solution

You can calculate the range per group (=max-min) and define HIGH/LOW based on a threshold (I used 3 here):

df['Type'] = (df.groupby('Site')
                ['Variable']
                .transform(lambda g: 'HIGH' if g.max()-g.min() > 3 else 'LOW')
             )

output:

    Site        Date  Variable  Type
0      1  01/01/2021        -1   LOW
1      1  02/01/2021         0   LOW
2      1  03/01/2021         1   LOW
3      1  04/01/2021         0   LOW
4      1  05/01/2021        -1   LOW
5      1  06/01/2021         0   LOW
6      1  07/01/2021         1   LOW
7      1  08/01/2021         2   LOW
8      1  09/01/2021         1   LOW
9      1  10/01/2021         0   LOW
10     2  01/01/2021        -5  HIGH
11     2  02/01/2021         3  HIGH
12     2  03/01/2021         2  HIGH
13     2  04/01/2021         6  HIGH
14     2  05/01/2021        -3  HIGH
15     2  06/01/2021         3  HIGH
16     2  07/01/2021         1  HIGH
17     2  08/01/2021        -4  HIGH
18     2  09/01/2021        -5  HIGH
19     2  10/01/2021        -1  HIGH

For an arbitrary number of categories, use pandas.cut:

df['range'] = (df.groupby('Site')['Variable']
                .transform(lambda g: g.max()-g.min())
             )

# group_name: upper bound
groups = {'LOW': 0, 'MEDIUM': 3, 'HIGH': 12}
df['Type'] = pd.cut(df['range'],
                    bins=list(groups.values())+[float('inf')],
                    labels=list(groups)
                   )

output:

    Site        Date  Variable    Type  range
0      1  01/01/2021        -1     LOW      3
1      1  02/01/2021         0     LOW      3
2      1  03/01/2021         1     LOW      3
3      1  04/01/2021         0     LOW      3
4      1  05/01/2021        -1     LOW      3
5      1  06/01/2021         0     LOW      3
6      1  07/01/2021         1     LOW      3
7      1  08/01/2021         2     LOW      3
8      1  09/01/2021         1     LOW      3
9      1  10/01/2021         0     LOW      3
10     2  01/01/2021        -5  MEDIUM     11
11     2  02/01/2021         3  MEDIUM     11
12     2  03/01/2021         2  MEDIUM     11
13     2  04/01/2021         6  MEDIUM     11
14     2  05/01/2021        -3  MEDIUM     11
15     2  06/01/2021         3  MEDIUM     11
16     2  07/01/2021         1  MEDIUM     11
17     2  08/01/2021        -4  MEDIUM     11
18     2  09/01/2021        -5  MEDIUM     11
19     2  10/01/2021        -1  MEDIUM     11

Answered By – mozway

Answer Checked By – Marie Seifert (AngularFixing Admin)

Leave a Reply

Your email address will not be published.