library(RWordPress)
library(XMLRPC)
library(RISmed)
library(tm)
library(wordcloud2)
library(RColorBrewer)
library(topicmodels)

wordcloud

この記事はStanアドカレ14日目のエントリーです。今回のテーマはwordcloud、LDAです。wordcloudはしたの図にあるように、文章中で出現頻度が高い単語を複数選び出し、その頻度に応じた大きさで図示する手法で、単語の出現頻度が視覚的に理解できて便利。なによりイケてる。RではテキストデータをCorpusやdataframeとして整えてあげれば(ここがくっそ面倒)、wordcloudやwordcloud2といったパッケージで簡単にwordcloudを作成できます。特にwordcloud2は、以下のようにカッコ良い感じに仕上げてくれます。wordcloud2では文字の上に単語をのせるlettorcloudも可能です。

letterCloud(subset(term.f,freq>=3), word = "Stan", 
            fontFamily = "Helvetica",
            color=brewer.pal(3,"Paired"), 
            wordSize=0, backgroundColor="white")

pubmedから論文情報をRStudioにぶち込む

上記のワードクラウドは、Pubmedでbayesianという検索用語でヒットした48本の英語論文のアブストラクトのテキストデータの単語の出現頻度に応じてプロットしたものです。以下のようにRISmedを用いると、Pubmedから簡単に論文情報を入手できます。

res <- EUtilsSummary("bayesian",#検索用語
                     type = "esearch", 
                     db = "pubmed", 
                     datetype = "pdat", 
                     mindate = 2014, #検索開始年 
                     maxdate = 2017, #検索終了年
                     retmax = 50 #最高何件記録するか
                     )

EUtilsSummary関数は検索演算子を発生させる関数になります。検索用語や検索開始、終了年を指定します。EUtilsGet関数で、論文情報を抜き出し格納します。そこからTitle情報を抜き出したいときはArticleTitle関数、アブストテキスト情報を抜き出したいときはAbstractText関数が使えます。

  records = EUtilsGet(res) #論文情報の抜き出し
  pubmed_data <- data.frame(Title = ArticleTitle(records), 
                            Abstract = AbstractText(records))

こんな感じで論文のアブストテキストデータが抽出されました。今回は、この論文一つ一つのアブストテキストデータに対してトピックモデルの代表格であるLDAを適用し、それぞれの論文を潜在的なトピックにわりふって、トピックのまとまりごとにwordcloudを作ってやろう。というのが解析の目的になります。

  #データを確認
  library(knitr)
  kable(head(pubmed_data,10))
Title Abstract
Phylogenetic analysis of HSP70 and cyt b gene sequences for Chinese Leishmania isolates and ultrastructural characteristics of Chinese Leishmania sp. Leishmaniasis is a worldwide epidemic disease caused by the genus Leishmania, which is still endemic in the west and northwest areas of China. Some viewpoints of the traditional taxonomy of Chinese Leishmania have been challenged by recent phylogenetic researches based on different molecular markers. However, the taxonomic positions and phylogenetic relationships of Chinese Leishmania isolates remain controversial, which need for more data and further analysis. In this study, the heat shock protein 70 (HSP70) gene and cytochrome b (cyt b) gene were used for phylogenetic analysis of Chinese Leishmania isolates from patients, dogs, gerbils, and sand flies in different geographic origins. Besides, for the interesting Leishmania sp. in China, the ultrastructure of three Chinese Leishmania sp. strains (MHOM/CN/90/SC10H2, SD, GL) were observed by transmission electron microscopy. Bayesian trees from HSP70 and cyt b congruently indicated that the 14 Chinese Leishmania isolates belong to three Leishmania species including L. donovani complex, L. gerbilli, and L. (Sauroleishmania) sp. Their identity further confirmed that the undescribed Leishmania species causing visceral Leishmaniasis (VL) in China is closely related to L. tarentolae. The phylogenetic results from HSP70 also suggested the classification of subspecies within L. donovani complex: KXG-918, KXG-927, KXG-Liu, KXG-Xu, 9044, SC6, and KXG-65 belong to L. donovani; Cy, WenChuan, and 801 were proposed to be L. infantum. Through transmission electron microscopy, unexpectedly, the Golgi apparatus were not observed in SC10H2, SD, and GL, which was similar to previous reports of reptilian Leishmania. The statistical analysis of microtubule counts separated SC10H2, SD, and GL as one group from any other reference strain (L. donovani MHOM/IN/80/DD8; L. tropica MHOM/SU/74/K27; L. gerbilli MRHO/CN/60/GERBILLI). The ultrastructural characteristics of Leishmania sp. partly lend support to the phylogenetic inference that Chinese Leishmania sp. is in close relationship with reptilian Leishmania.
Molecular linkage tracing of HIV-1 transmission events in seroconcordant couples in Guangxi Province, Southeastern China. BACKGROUND: Guangxi Province in Southeastern China has one of the highest HIV-1 infection and transmission rates in stable couples. However, the mode of transmission at the molecular level has seldom been reported amongst this group. It is important to investigate this issue to support the treatment-as-prevention approach and for efficient interventions.METHODS: HIV-1 subgenomic regions (1.2 kb of pol and a 660-bp env C2V5 fragment) were sequenced in 42 couples. A couple linkage assessment was performed by phylogenetic analysis of sequences and Bayesian analysis of genetic distances. A subset of pairs was selected for single-genome amplification.RESULTS: Thirty-five pairs (83.3 %, 35/42) were identified as linked, 3 pairs (7.1 %, 3/42) were identified as indeterminate, and 4 pairs (9.5 %) were identified as unlinked. The predominant intra-couple-transmitted HIV-1 subtype was CRF01_AE (80 %, 28/35). The median genetic distance of linked couples was 0.5 %.CONCLUSION: The majority of HIV-1 transmission events in this study occurred within the partnership, and the predominant HIV-1 subtype was CRF01_AE. Further research on the mode of HIV transmission in other locations is needed.
Phylogeography of the small Indian civet and origin of introductions to western Indian Ocean islands. The biogeographic dynamics affecting the Indian subcontinent, East and Southeast Asia during the Plio-Pleistocene has generated complex biodiversity patterns. We assessed the molecular biogeography of the small Indian civet (Viverricula indica) through mitogenome and cytochrome b + control region sequencing of 89 historical and modern samples to (i) establish a time-calibrated phylogeography across the species’ native range and (ii) test introduction scenarios to western Indian Ocean islands. Bayesian phylogenetic analyses identified three geographic lineages (East Asia, sister-group to Southeast Asia and the Indian subcontinent + northern Indochina) diverging 3.2 - 2.3 Mya, with no clear signature of past demographic expansion. Within Southeast Asia, Balinese populations separated from the rest 2.6 - 1.3 Mya. Western Indian Ocean populations were assigned to the Indian subcontinent + northern Indochina lineage and had the lowest mitochondrial diversity. Approximate Bayesian computation did not distinguish between single vs multiple introduction scenarios. The early diversification of the small Indian civet was likely shaped by humid periods in the Late Pliocene - Early Pleistocene that created evergreen rainforest barriers, generating areas of intra-specific endemism in the Indian subcontinent, East and Southeast Asia. Later Pleistocene dispersals through drier conditions in South and Southeast Asia were likely, giving rise to the species’ current natural distribution. Our molecular data supported the delineation of only four subspecies in V. indica, including an endemic Balinese lineage. Our study also highlighted the influence of pre-first millennium AD introductions to western Indian Ocean islands, with Indian and/or Arab traders probably introducing the species for its civet oil.
Using simulation to interpret experimental data in terms of protein conformational ensembles. In their biological environment, proteins are dynamic molecules, necessitating an ensemble structural description. Molecular dynamics simulations and solution-state experiments provide complimentary information in the form of atomically detailed coordinates and averaged or distributions of structural properties or related quantities. Recently, increases in the temporal and spatial scale of conformational sampling and comparison of the more diverse conformational ensembles thus generated have revealed the importance of sampling rare events. Excitingly, new methods based on maximum entropy and Bayesian inference are promising to provide a statistically sound mechanism for combining experimental data with molecular dynamics simulations.
In the shadows: Phylogenomics and coalescent species delimitation unveil cryptic diversity in a Cerrado endemic lizard (Squamata: Tropidurus). The recognition of cryptic diversity within geographically widespread species is gradually becoming a trend in the highly speciose Neotropical biomes. The statistical methods to recognise such cryptic lineages are rapidly advancing, but have rarely been applied to genomic-scale datasets. Herein, we used phylogenomic data to investigate phylogenetic history and cryptic diversity within Tropidurus itambere, a lizard endemic to the Cerrado biodiversity hotspot. We applied a series of phylogenetic methods to reconstruct evolutionary relationships and a coalescent Bayesian species delimitation approach (BPP) to clarify species limits. The BPP results suggest that the widespread nominal taxon comprises a complex of 5 highly supported and geographically structured cryptic species. We highlight and discuss the different topological patterns recovered by concatenated and coalescent species tree methods for these closely related lineages. Finally, we suggest that the existence of cryptic lineages in the Cerrado is much more common than traditionally thought, highlighting the value of using NGS data and coalescent techniques to investigate patterns of species diversity.
HTLV-1aA introduction into Brazil and its association with the trans-Atlantic slave trade. INTRODUCTION: Human T-lymphotropic virus (HTLV) is an endemic virus in some parts of the world, with Africa being home to most of the viral genetic diversity. In Brazil, HTLV-1 is endemic amongst Japanese and African immigrant populations. Multiple introductions of the virus in Brazil from other epidemic foci were hypothesized. The long terminal repeat (LTR) region of HTLV-1 was used to infer the origin of the virus in Brazil, using phylogenetic analysis.METHODS: LTR sequences were obtained from the HTLV-1 database (http://htlv1db.bahia.fiocruz.br). Sequences were aligned and maximum-likelihood and Bayesian tree topologies were inferred. Brazilian specific clusters were identified and molecular-clock and coalescent models were used to estimate each cluster’s time to the most recent common ancestor (tMRCA).RESULTS: Three Brazilian clusters were identified with a posterior probability ranged from 0.61 to 0.99. Molecular clock analysis of these three clusters dated back their respective tMRCAs between the year 1499 and the year 1668. Additional analysis also identified a close association between Brazilian sequences and new sequences from South Africa.CONCLUSION: Our results support the hypothesis of a multiple introductions of HTLV-1 into Brazil, with the majority of introductions occurring in the post-Colombian period. Our results further suggest that HTLV-1 introduction into Brazil was facilitated by the trans-Atlantic slave trade from endemic areas of Africa. The close association between southern African and Brazilian sequences also suggested that greater numbers of the southern African Bantu population might also have been part of the slave trade than previously thought.
Ozone and hypertensive disorders of pregnancy in Florida: Identifying critical windows of exposure. INTRODUCTION: Ozone (O3) has been linked to hypertensive disorders of pregnancy (HDP). However, inconsistent results have been reported, and no study has examined the critical exposure windows during pregnancy.MATERIALS AND METHODS: We used Florida birth vital statistics records to investigate the association between HDP and O3 exposure among 655,529 pregnancies with conception dates between 2005 and 2007. Individual O3 exposure was assessed at mothers’ home address at the time of delivery using the Hierarchical Bayesian space-time statistical model. We examined the association during three predefined exposure windows including trimester 1, trimester 2, and trimesters 1&2, as well as in each week of the first two trimesters using distributed lag models.RESULTS: Pregnancies with HDP had a higher mean exposure to O3 (39.07 in trimester 1, 39.02 in trimester 2, and 39.06 in trimesters 1&2, unit: ppb) than those without HDP (38.65 in trimester 1, 38.57 in trimester 2, and 38.61 in trimesters 1&2, unit: ppb). In the adjusted logistic regression model, increased odds of HDP were observed for each 5 ppb increase in O3 (ORTrimester1=1.04, 95% CI: 1.03, 1.06; ORTrimester2=1.03, 95% CI: 1.02, 1.04; ORTrimester1&2=1.07, 95% CI: 1.05, 1.08). In the distributed lag models, elevated odds of HDP were observed with increased O3 exposure during the 1st to 24th weeks of gestation, with higher odds during early pregnancy.CONCLUSIONS: O3 exposure during pregnancy is related to increased odds of HDP, and early pregnancy appears to be a potentially critical window of exposure.
Bayesian analysis of high-resolution ultrasonography and guided fine needle aspiration cytology in diagnosis of palpable thyroid nodules. INTRODUCTION: To evaluate diagnostic accuracy of high-resolution ultrasonography in differentiation of benign and malignant thyroid nodules in comparison to results of guided fine needle aspiration cytology based on the Bayes rule.OBJECTIVE: To assess the validity of ultrasonography results of thyroid nodules in comparison to guided fine needle aspiration cytology findings.METHODS: This study was done on randomly chosen 80 patients presented with palpable thyroid nodules, undergone real-time sonographic evaluation of thyroid nodules to characterize features, internal consistency, margins, echotexture, calcification, peripheral lucent halo and vascularity. Ultrasonography guided fine needle aspiration cytology studies of thyroid nodules were done.RESULTS: Palpable thyroid nodules were highly prevalent in fourth and fifth decades of life with female-male ratio, 4:1. Solid internal consistency was demonstrated by 75% malignant nodules. Hypoechogenicity and intra-nodular micro-calcifications were observed in 92% malignant nodules; 83% malignant nodules had intra-nodular vascularity and absence of peripheral halo. The pre-test prevalence of malignant nodules in the targeted population was 17.5%. As type I error, 2.5% false-positive cases and as type II error, 5.0% false-negative cases were detected. Values of sensitivity and specificity of the ultrasonography test were 71.43 and 96.97%, respectively.CONCLUSION: Malignant thyroid nodules demonstrated ultrasonography characteristics of hypoechoic texture, intra-nodular micro-calcifications, solid consistency, internal vascularity and absence of peripheral halo. The ultrasonography test has 92.5% diagnostic accuracy to differentiate malignant from benign lesions in comparison to the gold standard fine needle aspiration cytology test.
Exploration and exploitation of Victorian science in Darwin’s reading notebooks. Search in an environment with an uncertain distribution of resources involves a trade-off between exploitation of past discoveries and further exploration. This extends to information foraging, where a knowledge-seeker shifts between reading in depth and studying new domains. To study this decision-making process, we examine the reading choices made by one of the most celebrated scientists of the modern era: Charles Darwin. From the full-text of books listed in his chronologically-organized reading journals, we generate topic models to quantify his local (text-to-text) and global (text-to-past) reading decisions using Kullback-Liebler Divergence, a cognitively-validated, information-theoretic measure of relative surprise. Rather than a pattern of surprise-minimization, corresponding to a pure exploitation strategy, Darwin’s behavior shifts from early exploitation to later exploration, seeking unusually high levels of cognitive surprise relative to previous eras. These shifts, detected by an unsupervised Bayesian model, correlate with major intellectual epochs of his career as identified both by qualitative scholarship and Darwin’s own self-commentary. Our methods allow us to compare his consumption of texts with their publication order. We find Darwin’s consumption more exploratory than the culture’s production, suggesting that underneath gradual societal changes are the explorations of individual synthesis and discovery. Our quantitative methods advance the study of cognitive search through a framework for testing interactions between individual and collective behavior and between short- and long-term consumption choices. This novel application of topic modeling to characterize individual reading complements widespread studies of collective scientific behavior.
A multiscale Bayesian data integration approach for mapping air dose rates around the Fukushima Daiichi Nuclear Power Plant. This paper presents a multiscale data integration method to estimate the spatial distribution of air dose rates in the regional scale around the Fukushima Daiichi Nuclear Power Plant. We integrate various types of datasets, such as ground-based walk and car surveys, and airborne surveys, all of which have different scales, resolutions, spatial coverage, and accuracy. This method is based on geostatistics to represent spatial heterogeneous structures, and also on Bayesian hierarchical models to integrate multiscale, multi-type datasets in a consistent manner. The Bayesian method allows us to quantify the uncertainty in the estimates, and to provide the confidence intervals that are critical for robust decision-making. Although this approach is primarily data-driven, it has great flexibility to include mechanistic models for representing radiation transport or other complex correlations. We demonstrate our approach using three types of datasets collected at the same time over Fukushima City in Japan: (1) coarse-resolution airborne surveys covering the entire area, (2) car surveys along major roads, and (3) walk surveys in multiple neighborhoods. Results show that the method can successfully integrate three types of datasets and create an integrated map (including the confidence intervals) of air dose rates over the domain in high resolution. Moreover, this study provides us with various insights into the characteristics of each dataset, as well as radiocaesium distribution. In particular, the urban areas show high heterogeneity in the contaminant distribution due to human activities as well as large discrepancy among different surveys due to such heterogeneity.

以下では、LDAを実施するためにテキストデータを整え、扱いやすくするためにCorpusデータに流していきます。論文の検索や下処理はもっともっと丁寧にやる必要がありますが、今回は飛ばします。

#Corpusで処理
pubmed_data$Abstract <- as.character(pubmed_data$Abstract)
docs <- Corpus(VectorSource(pubmed_data$Abstract))

docs <-tm_map(docs,content_transformer(tolower))
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "’")
docs <- tm_map(docs, toSpace, "‘")
docs <- tm_map(docs, toSpace, "•")
docs <- tm_map(docs, toSpace, "”")

docs <- tm_map(docs, removePunctuation)
#Strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
#remove whitespace
docs <- tm_map(docs, stripWhitespace)
#Stem document
docs <- tm_map(docs,stemDocument)
#define and eliminate all custom stopwords
stops<-c("can", "say","one","way","use",
  "also","howev","tell","will",
  "much","need","take","tend","even",
  "like","particular","rather","said",
  "get","well","make","ask","come","end",
  "first","two","help","often","may",
  "might","see","someth","thing","point",
  "post","look","right","now","think","‘ve ",
  "‘re ","anoth","put","set","new","good",
  "want","sure","kind","larg","yes,","day","etc",
  "quit","sinc","attempt","lack","seen","awar",
  "littl","ever","moreov","though","found","abl",
  "enough","far","earli","away","achiev","draw",
  "last","never","brief","bit","entir","brief",
  "great","lot", "background", "aim", "method", "result", "conclusion",
  "show")

myStopwords <- stops

docs <- tm_map(docs, removeWords, myStopwords)
#inspect a document as a check
#delete.empty.docment
dtm <- DocumentTermMatrix(docs)
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
docs.new <- docs[-as.numeric(empty.rows)]

stanでLDA

さて、データが整ったので、LDAを実施しします。LDAで知りたいパラメタは2つ。一つは、どの文章がどのトピックに属する確率が高いかという、トピック分布(今回はtheta)。もう一つは、あるトピックはどのような用語が生起する確率が高いのかという、単語分布 (今回はphi)。 LDAをstanで書くと以下のようになります。コードは、こちらのブログを参考、というか丸パクリです。アヒル本にも、超わかりやすい解説あります。

data {
  int<lower=2> K;                    # num topics
  int<lower=2> V;                    # num words
  int<lower=1> M;                    # num docs
  int<lower=1> N;                    # total word instances
  int<lower=1,upper=V> W[N];         # word n
  int<lower=1> Freq[N];              # frequency of word n
  int<lower=1,upper=N> Offset[M,2];  # range of word index per doc
  vector<lower=0>[K] Alpha;          # topic prior
  vector<lower=0>[V] Beta;           # word prior
}
parameters {
  simplex[K] theta[M];   # topic dist for doc m
  simplex[V] phi[K];     # word dist for topic k
}
model {
  # prior
  for (m in 1:M)
    theta[m] ~ dirichlet(Alpha);
  for (k in 1:K)
    phi[k] ~ dirichlet(Beta);
  
  # likelihood
  for (m in 1:M) {
    for (n in Offset[m,1]:Offset[m,2]) {
      real gamma[K];
      for (k in 1:K)
        gamma[k] = log(theta[m,k]) + log(phi[k,W[n]]);
      increment_log_prob(Freq[n] * log_sum_exp(gamma));
    }
  }
}

なお、今回は、複数のトピック数をためして見ましたがトピック数2以外がうまく分かれませんでした。HDPなど、トピック数の推定法、また時間があれば取り組んでみたいです。とりあえず、上記のコードをldatest.stanで保存し、以下のコードでキックして、(´д`)ハァハァします。

K=2 # トピック数
V=dtm.new$ncol # 単語数
M=dtm.new$nrow # 論文数
N=length(dtm.new$j) #総単語数
W<-dtm.new$j # 出現単語
Freq<-dtm.new$v # 単語の出現頻度
offset <- t(sapply(1:M, function(m){ range(which(m==W)) })) #出現単語の範囲

data <- list(
  K=K,
  M=M,
  V=V,
  N=N,
  W=W,
  Freq=Freq,
  Offset=offset,
  Alpha=rep(1, K), # トピック分布の事前分布
  Beta=rep(0.5, V) # 単語分布の事前分布
)

library(rstan)
stanmodel <- stan_model(file="ldatest.stan")
#fit_nuts <- sampling(stanmodel, data=data, seed=123)
fit_vb <- vb(stanmodel, data=data, seed=123)
## 
## This is Automatic Differentiation Variational Inference.
## 
## (EXPERIMENTAL ALGORITHM: expect frequent updates to the procedure.)
## 
## Gradient evaluation took 0.034384 seconds
## 1000 iterations under these settings should take 34.384 seconds.
## Adjust your expectations accordingly!
## 
## Begin eta adaptation.
## Iteration:   1 / 250 [  0%]  (Adaptation)
## Iteration:  50 / 250 [ 20%]  (Adaptation)
## Iteration: 100 / 250 [ 40%]  (Adaptation)
## Iteration: 150 / 250 [ 60%]  (Adaptation)
## Iteration: 200 / 250 [ 80%]  (Adaptation)
## Success! Found best value [eta = 1] earlier than expected.
## 
## Begin stochastic gradient ascent.
##   iter       ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
##    100     -4e+05             1.000            1.000
##    200     -4e+05             0.501            1.000
##    300     -4e+05             0.334            0.002   MEDIAN ELBO CONVERGED
## 
## Drawing a sample of size 1000 from the approximate posterior... 
## COMPLETED.

というわけで以下は2つのトピックを指定した結果をみていきます。 横軸に単語の生起確率、縦軸に各単語をとって、各トピックでの単語の生起確率を示しています。

library(ggplot2)

ms <- rstan::extract(fit_vb)

probs <- c(0.1, 0.25, 0.5, 0.75, 0.9)
idx <- expand.grid(1:K, 1:V)

d_qua <- t(apply(idx, 1, function(x) quantile(ms$phi[,x[1],x[2]], probs=probs)))
d_qua <- data.frame(idx, d_qua)
colnames(d_qua) <- c('topic', 'word', paste0('p', probs*100))

p <- ggplot(data=d_qua, aes(x=word, y=p50))
p <- p + theme_bw(base_size=18)
p <- p + facet_wrap(~topic, ncol=3)
p <- p + coord_flip()
p <- p + scale_x_reverse(breaks=c(1, seq(20, 1925, 60)))
p <- p + geom_bar(stat='identity')
p <- p + labs(x='word', y='phi[k,y]')
p

今後は各論文を横軸にとって、生起確率(MAP推定値)を縦軸にプロットしています。こうみると、二つのトピックそれぞれに違う論文が反応している様子がわかります。

ms <- rstan::extract(fit_vb)

probs <- c(0.1, 0.25, 0.5, 0.75, 0.9)
idx <- expand.grid(1:M, 1:K)
d_qua=NULL

d_qua <- t(apply(idx, 1, function(x) quantile(ms$theta[,x[1],x[2]],probs=probs)))
d_qua <- data.frame(idx, d_qua)

colnames(d_qua) <- c('article', 'topic', paste0('p', probs*100))

best.topic=NULL
for(i in 1:48){
aaa<-subset(d_qua,article==i)
best.topic<-c(best.topic,max.col(t(aaa$p50)))
}


p <- ggplot(d_qua, aes(x=article, y=p50,col=as.factor(topic)))
p <- p +geom_point()
p <- p + facet_wrap(~as.factor(topic))
p +theme_bw()

論文を該当する確率の高い方のトピックわりふって、docs.newからトピックグループごとのCorpusを抜き出します。そしてグループごとの頻出語を検討します。

A<-which(best.topic==1)
B<-which(best.topic==2)

dtm.A<-DocumentTermMatrix(docs.new[A])
dtm.B<-DocumentTermMatrix(docs.new[B])

freqA <- colSums(as.matrix(dtm.A))
ordA <- order(freqA,decreasing=TRUE)
freq20A<-data.frame(head(freqA[ordA],50))
freq20A$wordA<-rownames(freq20A)
names(freq20A)<-c("freq","word")

freqB <- colSums(as.matrix(dtm.B))
ordB <- order(freqB,decreasing=TRUE)
freq20B<-data.frame(head(freqB[ordB],50))
freq20B$wordB<-rownames(freq20B)
names(freq20B)<-c("freq","word")
freqs<-data.frame(freq20A$word,freq20B$word,rank=1:50)

kable(freqs)
freq20A.word freq20B.word rank
model model 1
data studi 2
bayesian analysi 3
studi data 4
base bayesian 5
diseas estim 6
speci test 7
approach base 8
effect sampl 9
interv approach 10
popul identifi 11
compar perform 12
observ detect 13
provid infect 14
relat pcr 15
time respons 16
comput genet 17
confid high 18
differ leishmania 19
estim milk 20
spatial predict 21
test specif 22
anophel dynam 23
classifi effect 24
distribut individu 25
gene infer 26
genotyp patient 27
includ provid 28
infer speci 29
inform clinic 30
network composit 31
nodul diseas 32
present investig 33
simul sequenc 34
vaccin structur 35
among treatment 36
associ design 37
dataset develop 38
function differ 39
identifi host 40
perform indian 41
sampl ion 42
tree phylogenet 43
trimest suggest 44
assess within 45
predict evid 46
preval molecular 47
process neuron 48
type relationship 49
within select 50

あとは、グループごとにwordcloudを作成してやれば、本日の任務は完了です。上位の単語はあまり変わらないので見た目があまり変わらないので今回は面白くないですが、細かく見ていると、右の図の方が、infection、patient、response、 clinical、といった臨床関連用語が散見されるので医学的な研究のグループとしてまとめられているように思います (テキトー)。

library(wordcloud)
library(dplyr)
par(mfrow=c(1,2))
wordcloud::wordcloud(docs.new[A], scale = c(4.5, 0.4), max.words = 200, 
          min.freq = 1, random.color=T, random.order = FALSE, rot.per = 0.00, 
          use.r.layout = F, colors = brewer.pal(3, "Paired"),
          family="Helvetica")

wordcloud::wordcloud(docs.new[B], scale = c(4.5, 0.4), max.words = 200, 
          min.freq = 5, random.color=T, random.order = FALSE, rot.per = 0.00, 
          use.r.layout = F, colors = brewer.pal(3, "Paired"),
          family="Helvetica")

感想

stanでLDA、モデルはシンプルだけど、結構重たい処理になるので、高速化をどうするか。RでLDA用の便利な関数(topicmodelsなど)が続々と出てるので、それでまずはtopicmodelに親しむのはあり。gensimなど、HDPに挑戦していきたい。

Enjoy!!