r - Tidy data frame: German characters being removed -

January 15, 2011

i using following code convert data frame tidy data frame:

replace_reg <- "https://t.co/[a-za-z\\d]+|http://[a-za-z\\d]+|&amp;|&lt;|&gt;|rt|https" unnest_reg <- "([^a-za-z_\\d#@']|'(?![a-za-z_\\d#@]))" tidy_tweets <- tweets %>%  filter(!str_detect(text, "^rt")) %>% mutate(text = str_replace_all(text, replace_reg, "")) %>% unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>% filter(!word %in% custom_stop_words2$word,      str_detect(word, "[a-zäöüß]"))

however, produces tidy data frame german characters üäöß removed newly-created word column, example, "wählen" becomes 2 words: "w" , "hlen," , special character removed.

i trying tidy data frame of german words text analysis , term frequencies.

could point me in right direction how approach problem?

you need replace a-za-z\\d in bracket expressions [:alnum:].

the posix character class [:alnum:] matches unicode letters , digits.

replace_reg <- "https://t.co/[[:alnum:]]+|http://[[:alnum:]]+|&amp;|&lt;|&gt;|rt|https" unnest_reg <- "([^[:alnum:]_#@']|'(?![[:alnum:]_#@]))"

if using these pattern stringr functions, may consider using [\\p{l}\\p{n}] instead, in

unnest_reg <- "([^\\p{l}\\p{n}_#@']|'(?![\\p{l}\\p{n}_#@]))"

where \p{l} matches unicode letter , \p{n} matches unicode digit.

Search This Blog

TY

r - Tidy data frame: German characters being removed -

Comments

Post a Comment

Popular posts from this blog

python - What's the Pythonic way to report nonfatal errors in a parser? -

angular - Converting AngularJS deffered promise to AngularX observable for JSOM -

python - AssertionError when trying to assert return value from two dictionaries with py.test -