<html><body><div style="color:#000; background-color:#fff; font-family:times new roman, new york, times, serif;font-size:12pt"><div id="yiv371183567"><div style="background-color: rgb(255, 255, 255); font-family: 'times new roman', 'new york', times, serif; "><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389110">Olá a todos</span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389123"><br></span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389180">Duas coisas sobre tabulação de dados. Considerar ambas pode tornar possível executar uma tarefa com um BIG data.frame em
R.</span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389187"><br></span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; ">1) <span style="font-size:12pt;" id="yiv371183567yui_3_2_0_18_1340976957389203">o pacote dataframe foi criado para melhorar a performance em trabalhar com data.frame. O autor desse pacote, Tim Hesterberg, trabalha o Google. Ele apresentou seu pacote no useR! 2012 e as idéias de programação utilizadas. O r-core member Luke Tierney implementou essas idéias no R base, na versão 2.15.1 (released a uma semana). Li isso em </span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span
id="yiv371183567yui_3_2_0_18_1340976957389135"> http://www.r-bloggers.com/r-2-15-1-includes-performance-improvements-inspired-by-dataframe-package/ </span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389140">e sugiro fortemente o upgrade do
R para a versão 2.15.1, pois: </span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389151"> - Tim Hesterberg created the dataframe package to speed up R for 500+ R users at Google, and the talk from his colleague Karl Millar on using Google's big-data infrastructure with R. </span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389160"> - Tim reported that using the dataframe package with R 2.15.0 improved performance by 21% for creation and column subscripting, and by 14% for row subscripting.</span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span
id="yiv371183567yui_3_2_0_18_1340976957389236"> - Além dessa sugestão,
o Tim Hesterberg fez também sugestão que melhora a performance da função tabulate(), usada pela função table(). A sugestão dele é usar dup=FALSE no .C() da tabulate(), que as vezes usamos papa chamar nossos códigos em C, correndo um certo risco. Esta sugestão evita duplicação de memória. Ou seja, a diferença está em conseguir ou não fazer uma tabela com um BIG data em R.</span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389285"><br></span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389272">Nos dois testes abaixo, fiz com as duas versões do R,</span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-family: 'times new roman', 'new york',
times, serif; font-size: 12pt; "><span id="yiv371183567yui_3_2_0_18_1340976957389241"><div id="yiv371183567yui_3_2_0_18_134097695738948"> n <- 1e5; ns=10</div><div id="yiv371183567yui_3_2_0_18_134097695738948"> system.time(replicate(ns, data.frame(a=rep(gl(3,5),n))))</div><div id="yiv371183567yui_3_2_0_18_134097695738948"> system.time(replicate(ns, table(rep(gl(3,5),n))))<br></div><div id="yiv371183567yui_3_2_0_18_134097695738948">houve redução, da
versão 2.15.1 em relação a versão 2.15.0, de 22% no primeiro teste e 8% no segundo teste.<br></div><div style="font-size: 12pt; font-family: times, serif; " id="yiv371183567yui_3_2_0_18_1340976957389249"><br></div></span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389167"><br></span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span id="yiv371183567yui_3_2_0_18_1340976957389208">2) Ha alguns dias o Walmes mostrou-me um post no r-bloggers mostrando a eficiência do pacote data.table </span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span
id="yiv371183567yui_3_2_0_18_1340976957389215"> </span>http://www.r-bloggers.com/transforming-subsets-of-data-in-r-with-by-ddply-and-data-table/</div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><span style="font-size: 12pt; ">Foi reportado nesse post que, usando a R 2.15.0, usar o pacote data.table é 95 vezes mais rápido que usar do.call com by e 120 vezes mais rápido que usar ddply (do pacote plyr). </span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-family: times, serif; font-size: 12pt; " class="yui_3_2_0_18_134099516272194"><span style="font-size: 12pt; "><br></span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" style="color: rgb(0, 0, 0); font-family: times, serif; font-size: 12pt; " class="yui_3_2_0_18_134099516272194"><span style="font-size: 12pt; ">Testei o seguinte script com R 2.15.0 e R
2.15.1 (similar ao postado no blog)</span></div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"><span><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"><br></div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> require(plyr)</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> require(data.table)</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> </div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> set.seed(1)</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> df <- data.frame(Company=rep(paste("Company", 1:100),1000),</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194">
Product=gl(50,200), </div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> Year=sort(rep(2002:2011,10000)), </div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> Sales=rnorm(100000,100,10))</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"><br></div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> fn <- function(x) x/sum(x)</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> </div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> r1 <- system.time(</div><div id="yiv371183567yui_3_2_0_18_134097695738948"
class="yui_3_2_0_18_134099516272194"> R1 <- do.call("rbind", as.list(</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> by(df, df[c("Company","Year")], </div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> transform, Share=fn(Sales))</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> ))</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> )</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> </div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> r2 <- system.time(</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> R2 <- ddply(df, c("Company", "Year"), </div><div
id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> transform, Share=fn(Sales))</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> )</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"><br></div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> r3 <- system.time({</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> dt <- data.table(df)</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> setkey(dt, "Year", "Company")</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> X <- dt[, list(SUM=sum(Sales)), by=key(dt)]</div><div id="yiv371183567yui_3_2_0_18_134097695738948"
class="yui_3_2_0_18_134099516272194"> R3 <- dt[X, list(Company, Sales, Product, Share=Sales/SUM)]</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> })</div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194"> </div><div id="yiv371183567yui_3_2_0_18_134097695738948" class="yui_3_2_0_18_134099516272194">Como o data.table já era otimizado, não houve redução significativa da versão R 2.15.0 para a R 2.15.1. Houve redução de 20% usando do.call com by ou usando ddply. Ou seja, usar o data.table ainda é muito mais rápido que usar do.call com by ou ddply.</div><div style="color: rgb(0, 0, 0); font-family: times, serif; font-size: 12pt; "><br></div></span></div><div id="yiv371183567yui_3_2_0_18_134097695738954" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; ">Att.</div><div id="yiv371183567yui_3_2_0_18_1340976957389114"
style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; ">Elias T. Krainski</div><div id="yiv371183567yui_3_2_0_18_1340976957389114" style="color: rgb(0, 0, 0); font-size: 12pt; font-family: times, serif; "><br></div> </div></div></div></body></html>