R unnest ragged column into fixed length chunks
R unnest ragged column into fixed length chunks
I have a dataset where one column is a string made of 1 number indicating day of week, followed any number of 10 digit chunks:
# A tibble: 7 x 3
respid record_type record_data
<chr> <chr> <chr>
1 20163911123050111 6 1000456561200035759120000989800
2 20163911123050111 6 2000405161200031719120000999900
3 20163911123050111 6 30004071212000320212200032832220003545620
4 20163911123050111 6 40004051612000326272200033032220003545620
5 20163911123050111 6 5036803031200040404120004051812000434361200045556120003575910
6 20163911123050111 6 6000411161200031720120003283121000344462100035759120004707410
7 20163911123050111 6 70004111312000314261200043334120004535610
I would like an elegant way to turn this into a long format by:
1. splitting column 3 in fixed chunks of length 1 then a series of length 10 characters
2. going from wide to long
so that for instance the first line above would become 3 lines:
respid record_type dayofweek chunk
<chr> <chr> <chr> <chr>
1 20163911123050111 6 1 0004565612
2 20163911123050111 6 1 0003575912
3 20163911123050111 6 1 0000989800
So far I'm using this code for the first part but it's a loop...:
my_list<-list()
for(i in 1:nrow(mydf)){
temp_list<-list()
temp_list
temp_list$respid <- mydf[i,1]
temp_list$record_type <- mydf[i,2]
temp_list$dayofweek <- stringi::stri_sub(t6[i,3],1,1)
temp_list$chunk <- stringi::stri_sub(mydf[i,3],
seq(2, stringi::stri_length(mydf[i,3]), by = 10),
length = 10)
my_list[[i]] <- temp_list
}
is there a way of doing this with something like purrr::map and tidyr::unnest?
2 Answers
2
The approach is first take out 1st character from the record_data as dayofweek. Now one can replace every 10th character and to append a separator (say ,) to prepare record_data to apply tidyr::separate_rows.
1st
record_data
dayofweek
,
record_data
tidyr::separate_rows
library(tidyverse)
df %>%
# 1st character as dayofweek
mutate(dayofweek = substring(record_data, 1,1)) %>%
# Every 10th character appended with ,
mutate(record_data = gsub("(\d{10})","\1,",substring(record_data,2))) %>%
# Remove last ,
mutate(record_data = gsub(",$","",record_data)) %>%
# Expand rows
separate_rows(record_data)
# respid record_type dayofweek record_data
# 1 20163911123050112 6 1 0004565612
# 2 20163911123050112 6 1 0003575912
# 3 20163911123050112 6 1 0000989800
# 4 20163911123050112 6 2 0004051612
# 5 20163911123050112 6 2 0003171912
# 6 20163911123050112 6 2 0000999900
# 7 20163911123050112 6 3 0004071212
# 8 20163911123050112 6 3 0003202122
# 9 20163911123050112 6 3 0003283222
# 10 20163911123050112 6 3 0003545620
# 11 20163911123050112 6 4 0004051612
# 12 20163911123050112 6 4 0003262722
# 13 20163911123050112 6 4 0003303222
# 14 20163911123050112 6 4 0003545620
# 15 20163911123050112 6 5 0368030312
# 16 20163911123050112 6 5 0004040412
# 17 20163911123050112 6 5 0004051812
# 18 20163911123050112 6 5 0004343612
# 19 20163911123050112 6 5 0004555612
# 20 20163911123050112 6 5 0003575910
# 21 20163911123050112 6 6 0004111612
# 22 20163911123050112 6 6 0003172012
# 23 20163911123050112 6 6 0003283121
# 24 20163911123050112 6 6 0003444621
# 25 20163911123050112 6 6 0003575912
# 26 20163911123050112 6 6 0004707410
# 27 20163911123050112 6 7 0004111312
# 28 20163911123050112 6 7 0003142612
# 29 20163911123050112 6 7 0004333412
# 30 20163911123050112 6 7 0004535610
Data:
df <- read.table(text ="
respid record_type record_data
20163911123050111 6 1000456561200035759120000989800
20163911123050111 6 2000405161200031719120000999900
20163911123050111 6 30004071212000320212200032832220003545620
20163911123050111 6 40004051612000326272200033032220003545620
20163911123050111 6 5036803031200040404120004051812000434361200045556120003575910
20163911123050111 6 6000411161200031720120003283121000344462100035759120004707410
20163911123050111 6 70004111312000314261200043334120004535610",
header = TRUE, colClasses = c("numeric", "integer", "character"))
We can define a function that can split the string for every 10 digits and returned a list. We can then split the day of the week and the rest of the chunk using the separate function. We can finally apply the function we defined, and unnest the data frame.
separate
unnest
# Define a function to split the string in every 10 digits
string_split <- function(string, width = 10){
lst <- list()
i <- 1
while (nchar(string) > 0){
lst[[i]] <- substring(string, 1, width)
string <- substring(string, width + 1)
i <- i + 1
}
return(lst)
}
library(tidyverse)
dat2 <- dat %>%
# Split dayofweek and chunk
separate(record_data, into = c("dayofweek", "chunk"), sep = 1) %>%
# Apply the string_split function
mutate(chunk = map(chunk, string_split)) %>%
unnest()
head(dat2)
# respid record_type dayofweek chunk
# 1 20163911123050111 6 1 0004565612
# 2 20163911123050111 6 1 0003575912
# 3 20163911123050111 6 1 0000989800
# 4 20163911123050111 6 2 0004051612
# 5 20163911123050111 6 2 0003171912
# 6 20163911123050111 6 2 0000999900
DATA
dat <- read.table(text = "respid record_type record_data
1 20163911123050111 6 1000456561200035759120000989800
2 20163911123050111 6 2000405161200031719120000999900
3 20163911123050111 6 30004071212000320212200032832220003545620
4 20163911123050111 6 40004051612000326272200033032220003545620
5 20163911123050111 6 5036803031200040404120004051812000434361200045556120003575910
6 20163911123050111 6 6000411161200031720120003283121000344462100035759120004707410
7 20163911123050111 6 70004111312000314261200043334120004535610",
header = TRUE, colClasses = "character")
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
perfect, thanks so much!
– chrisjacques
Jun 30 at 7:13