[Wordcloud] Problema com Encoding

27 Set 2017

      Pessoal, boa noite!

Com base em um exemplo pego na internet, tentei fazer uma nuvem de palavras
de um arquivo notepad (peguei uma pequena matéria na internet e salvei, em
.txt, com o nome SaoBento).
O código tem funcionado quase corretamente. O problema é que eu não tenho
tido sucesso em corrigir o *encoding* do texto.

- Tentei usar encoding = "UTF-8" na linha do readlines, mas sem sucesso.
- Também tentei usar enc2native() na última linha do passo 7, mas ocorre
erro argument is not a character vector.
- Salvei o arquivo SaoBento.txt no Notepad++, usando UTF-8 e, também, a
nuvem final acusou problemas de *encoding*.
- Deixei o arquivo SaoBento.txt salvo sozinho em uma pasta e, também, não
tive sucesso em acertar o *encoding*.

Sendo assim, gostaria de saber se alguém poderia fornecer uma dica de como
posso driblar esse problema.

Agradeço pela atenção.

Saudações,
-Max Lara

PS: A variável "AQUI_ERRO" é onde leio o texto "distorcido".

#==============================================
#                             WORDCLOUD
#==============================================

*#1) INSTALL REQUIRED PACKAGES*
install.packages("tm")                         #for text mining
install.packages("SnowballC")             #for text stemming
install.packages("wordcloud")             #wordcloud generator
install.packages("RColorBrewer")        #color palletes

*#2) LOAD REQUIRED PACKAGES*
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

*#3) TEXT MINING*
#LOAD THE TEXT (SAVED LOCALLY)
text <- readLines(file.choose())

*#4) LOAD THE DATA AS A CORPUS*
docs <- Corpus(VectorSource(text))          #VectorSource() function
creates a corpus of character vectors
docs <- tm_map(docs, PlainTextDocument)

*#5) TEXT TRANSFORMATION*
#tm_map() function (to replace, for instance, special characters from the
text).
#Replacing "/", "@" and "|" with space:

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ",
x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

*#6) TEXT CLEANING*
#tm_map() (remove unnecessary white space, to convert the text to lower
case)
#Removing common stopwords

docs <- tm_map(docs, content_transformer(tolower))                #Convert
the text to lower case
docs <- tm_map(docs, removeNumbers)
#Remove numbers
docs <- tm_map(docs, removeWords, stopwords("portuguese"))   #Remove
Portuguese common stopwords
docs <- tm_map(docs, removePunctuation)
#Remove punctuations
docs <- tm_map(docs, stripWhitespace)
#Eliminate extra white spaces
docs <- tm_map(docs, stemDocument)
#Text stemming

*#7) BUILD A TERM-DOCUMENT MATRIX (TDM)*
#TDM is a table containing the frequency of the words.
#Column names are words
#Rown names are documents

dtm <- TermDocumentMatrix(docs)
terms(dtm)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
AQUI_ERRO <- d$word

*#8) GENERATE THE WORD CLOUD*
wordcloud(
  words = AQUI_ERRO,
  freq = d$freq,
  min.freq = 1,
  max.words=200,
  random.order=FALSE,
  rot.per=0.35,
  colors=brewer.pal(8, "Dark2"))

Max

etiquetas

participantes (1)