Clean special character, numeric and character


Clean special character, numeric and character



I have a variable like below in my dataframe


df$emp_length(10+ years, <1 year, 8 years)



I need to clean this variable for better analysis. Example, I want to compare this variable with other categorical or numerical variable. What is the best way to seperate this variable in to multiple columns.



I am thinking to separate this variable based on space something like below,


df$emp_length = c(10+, <1, 8)
df$years = c(years, years, years)



Also I would like to know if the number with special characters like + and < will be considered as numeric in R or I have to separate special character and numbers?



I want to have emp_length variable as numeric and years variable as character.



Please help!





What's value are you looking for 10+ and <1?
– MKR
Jun 30 at 10:01


10+


<1




1 Answer
1



One can use tidyr::extract to first separate emp_length in 2 columns. Then replace any symbol (anything other than 0-9) to "" in column with number and then convert it to numeric.


tidyr::extract


emp_length


0-9


""



Option#1: Keep the symbol with number


library(tidyverse)
df <- df %>% extract(emp_length, c("emp_length", "years"),
regex="([[:digit:]+<]+)\s+(\w+)")

df
# emp_length years
# 1 10+ years
# 2 <1 year
# 3 8 years



Option#2: Just number but column is numeric


library(tidyverse)

df <- df %>%
extract(emp_length, c("emp_length", "years"), regex="([[:digit:]+<]+)\s+(\w+)") %>%
mutate(emp_length = as.numeric(gsub("[^0-9]","\1",emp_length)))

df
# emp_length years
# 1 10 years
# 2 1 year
# 3 8 years



Data:


df <- data.frame(emp_length = c("10+ years", "<1 year", "8 years"),
stringsAsFactors = FALSE)





I dont want to create seperate column for special characters. The problem is that that data <1 year (starts with < symbol) has been changed to NA 1 instead of 1 and year. How to fix this?
– Krishna
Jun 30 at 13:04





Output: num [1:39717] 10 NA 10 10 1 3 8 9 4 NA ...
– Krishna
Jun 30 at 13:14





I just noticed the space between less than < symbol and 1. The actual data is < 1 year. I think that is the problem. How to fix this?
– Krishna
Jun 30 at 13:40





Perfect! thank you
– Krishna
Jul 1 at 7:18





I did the same. thank you
– Krishna
2 days ago






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Render GeoTiff to on browser with leaflet

How to get chrome logged in user's email id through website

using states in a react-navigation without redux