During the day (and night), I’m a military scientist at Ft. Detrick. I have to tell you…I use R to do everything on a daily basis. Whether it’s a hard core research project, administrative tasks, such as a manpower analysis, financial modeling, or communicating my work to senior leaders with reproducible research documents in RMarkdown, I use R. R adds efficiency, effectiveness and innovation to the way I manage and perform daily tasks. Imagine performing your daily work without the use of Excel. That’s right, you likely need it to be effective. But, R is Excel on Human Growth Hormone, plus a few other super Steroids. The problem is you’ve learned to think about data through an Excel lens (kudos Bill Gates). Anyways, I digress…

One of the most common tasks I’ve run into is having to index a dataset. The package data.table provides powerful tools to address many problems like this. When you have a chance, check out my blog on the data.table package here. Like, Share, and Subscribe!

Solution

Use special read-only symbols that can be used in the j position of a data.table: .SD, .N, .I, .GRP, .BY.
Use .I to index the data.table

Let’s Try it Out!

Create a Data.Table to Work With

First let’s make a data.table with letters in the ‘x’ column.
I’m using the set.seed function to ensure it is reproducible if you try this out later.
Also, we will use the head function to see the first few rows of data.
BTW, most of the time, I use ‘dt’ to name objects that are a data.table, ‘df’ for data.frame, ‘ls’ for list, etc…

set.seed(111)
l <- sample(letters[21:26], 20, TRUE)
dt <- data.table(x = l)
head(dt)

x 1: z 2: w 3: x 4: w 5: u 6: w

By the way, you can set ‘letters’ to ‘LETTERS’ to return capitalized versions.
Also, change 20:26 to the specific index of each letter in the alphabet to return only the letter you want. For example, 1:5 returns a, b, c, d, and e.

How to Index the Data.Table?

You may want to index the whole data.table, especially if you plan to split it up and merge it back together at later time .
While you can simple just add dt$index <- 1:length(nrow(dt)) to add an indexed column, data.table provides special read-only symbols that can be used in the j position: .SD, .N, .I, .GRP, .BY.
recall that a data.table has similar SQL terminology: dt[i,j,by]
- i is the row index/filter etc…WHERE
- j is to SELECT
- by is to GROUP BY
- Take dt, subset the rows using ‘i’, then calculate ‘j’ grouped by ’by
Here we will use .I to index in various ways.

How Many Times Does ‘x’ Occur?

You may want to determine how many times a given variable occurs in your dataset
WHERE x == ‘x’
SELECT .I (the indexed number)
GROUP BY n/a

dt[x == "x", .I]

[1] 1 2 3

At Which Positions Does ‘x’ Occur?

You may want to determine where (i.e. similar to the function which) a given variable occurs in your dataset
- WHERE x == ‘x’ (inside the j position)
- SELECT .I (the indexed number)
- GROUP BY n/a

dt[, .I[x == "x"]]

[1] 3 9 15

At Which Positions Does Each Unique Letter First Occur?

You may want to determine the row of the first occurrence for each unique value.
- WHERE n/a
- SELECT .I[1] (the indexed first occurrence)
- GROUP BY x (the column of interest)

dt[, .I[1], by = x]

x V1 1: z 1 2: w 2 3: x 3 4: u 5 5: y 7 6: v 10

At Which Positions Does Each Unique Letter Occur a Second Time?

You may want to determine the row of the second occurrence for each unique value.
- WHERE n/a
- SELECT .I[2] (the indexed second occurrence)
- GROUP BY x (the column of interest)

dt[, .I[2], by = x]

x V1 1: z 16 2: w 4 3: x 9 4: u 11 5: y 12 6: v 14

Data.table is a power package to manipulate data and has the backbone to handle big data. Once you transition to thinking about data in terms of dt[i,j,by] you can apply these tools to a lot of your daily tasks.

Until next time…

Jonathan D. Stallings #DataSolutions #DataScience #DataStories
jdstallings@DataInDeed.io

Indexing Rows in Data.Table