Indexing Rows in Data.Table
During the day (and night), I’m a military scientist at Ft. Detrick. I have to tell you…I use R to do everything on a daily basis. Whether it’s a hard core research project, administrative tasks, such as a manpower analysis, financial modeling, or communicating my work to senior leaders with reproducible research documents in RMarkdown, I use R. R adds efficiency, effectiveness and innovation to the way I manage and perform daily tasks. Imagine performing your daily work without the use of Excel. That’s right, you likely need it to be effective. But, R is Excel on Human Growth Hormone, plus a few other super Steroids. The problem is you’ve learned to think about data through an Excel lens (kudos Bill Gates). Anyways, I digress…
{{ instagram BgMG8vUnQck }}
One of the most common tasks I’ve run into is having to index a dataset. The package data.table provides powerful tools to address many problems like this. When you have a chance, check out my blog on the data.table package here. Like, Share, and Subscribe!
Solution
- Use special read-only symbols that can be used in the j position of a data.table: .SD, .N, .I, .GRP, .BY.
- Use .I to index the data.table
Let’s Try it Out!
Create a Data.Table to Work With
- First let’s make a data.table with letters in the ‘x’ column.
- I’m using the set.seed function to ensure it is reproducible if you try this out later.
- Also, we will use the head function to see the first few rows of data.
- BTW, most of the time, I use ‘dt’ to name objects that are a data.table, ‘df’ for data.frame, ‘ls’ for list, etc…
set.seed(111)
l <- sample(letters[21:26], 20, TRUE)
dt <- data.table(x = l)
head(dt)
x 1: z 2: w 3: x 4: w 5: u 6: w
- By the way, you can set ‘letters’ to ‘LETTERS’ to return capitalized versions.
- Also, change 20:26 to the specific index of each letter in the alphabet to return only the letter you want. For example, 1:5 returns a, b, c, d, and e.
How to Index the Data.Table?
- You may want to index the whole data.table, especially if you plan to split it up and merge it back together at later time .
- While you can simple just add dt$index <- 1:length(nrow(dt)) to add an indexed column, data.table provides special read-only symbols that can be used in the j position: .SD, .N, .I, .GRP, .BY.
- recall that a data.table has similar SQL terminology: dt[i,j,by]
- i is the row index/filter etc…WHERE
- j is to SELECT
- by is to GROUP BY
- Take dt, subset the rows using ‘i’, then calculate ‘j’ grouped by ’by
- Here we will use .I to index in various ways.
How Many Times Does ‘x’ Occur?
- You may want to determine how many times a given variable occurs in your dataset
- WHERE x == ‘x’
- SELECT .I (the indexed number)
- GROUP BY n/a
dt[x == "x", .I]
[1] 1 2 3
At Which Positions Does ‘x’ Occur?
- You may want to determine where (i.e. similar to the function which) a given variable occurs in your dataset
- WHERE x == ‘x’ (inside the j position)
- SELECT .I (the indexed number)
- GROUP BY n/a
dt[, .I[x == "x"]]
[1] 3 9 15
At Which Positions Does Each Unique Letter First Occur?
- You may want to determine the row of the first occurrence for each unique value.
- WHERE n/a
- SELECT .I[1] (the indexed first occurrence)
- GROUP BY x (the column of interest)
dt[, .I[1], by = x]
x V1 1: z 1 2: w 2 3: x 3 4: u 5 5: y 7 6: v 10
At Which Positions Does Each Unique Letter Occur a Second Time?
- You may want to determine the row of the second occurrence for each unique value.
- WHERE n/a
- SELECT .I[2] (the indexed second occurrence)
- GROUP BY x (the column of interest)
dt[, .I[2], by = x]
x V1 1: z 16 2: w 4 3: x 9 4: u 11 5: y 12 6: v 14
Data.table is a power package to manipulate data and has the backbone to handle big data. Once you transition to thinking about data in terms of dt[i,j,by] you can apply these tools to a lot of your daily tasks.
Until next time…