TakeR

16 minute read

StanでLDA
library(RWordPress)
library(XMLRPC)
library(RISmed)
library(tm)
library(wordcloud2)
library(RColorBrewer)
library(topicmodels)

wordcloud

この記事はStanアドカレ14日目のエントリーです。今回のテーマはwordcloud、LDAです。wordcloudはしたの図にあるように、文章中で出現頻度が高い単語を複数選び出し、その頻度に応じた大きさで図示する手法で、単語の出現頻度が視覚的に理解できて便利。なによりイケてる。RではテキストデータをCorpusやdataframeとして整えてあげれば(ここがくっそ面倒)、wordcloudやwordcloud2といったパッケージで簡単にwordcloudを作成できます。特にwordcloud2は、以下のようにカッコ良い感じに仕上げてくれます。wordcloud2では文字の上に単語をのせるlettorcloudも可能です。

letterCloud(subset(term.f,freq>=3), word = "Stan", 
            fontFamily = "Helvetica",
            wordSize=2, backgroundColor="white")

pubmedから論文情報をRStudioにぶち込む

上記のワードクラウドは、Pubmedでbayesianという検索用語でヒットした48本の英語論文のアブストラクトのテキストデータの単語の出現頻度に応じてプロットしたものです。以下のようにRISmedを用いると、Pubmedから簡単に論文情報を入手できます。

res <- EUtilsSummary("bayesian",#検索用語
                     type = "esearch", 
                     db = "pubmed", 
                     datetype = "pdat", 
                     mindate = 2014, #検索開始年 
                     maxdate = 2017, #検索終了年
                     retmax = 50 #最高何件記録するか
                     )

EUtilsSummary関数は検索演算子を発生させる関数になります。検索用語や検索開始、終了年を指定します。EUtilsGet関数で、論文情報を抜き出し格納します。そこからTitle情報を抜き出したいときはArticleTitle関数、アブストテキスト情報を抜き出したいときはAbstractText関数が使えます。

  records = EUtilsGet(res) #論文情報の抜き出し
  pubmed_data <- data.frame(Title = ArticleTitle(records), 
                            Abstract = AbstractText(records))

こんな感じで論文のアブストテキストデータが抽出されました。今回は、この論文一つ一つのアブストテキストデータに対してトピックモデルの代表格であるLDAを適用し、それぞれの論文を潜在的なトピックにわりふって、トピックのまとまりごとにwordcloudを作ってやろう。というのが解析の目的になります。

  #データを確認
  library(knitr)
  kable(head(pubmed_data,10))
Title Abstract
Molecular characterization and population structure study of cambuci: strategy for conservation and genetic improvement. Cambuci (Campomanesia phaea) belongs to the Myrtaceae family and is native to the Atlantic Forest of Brazil. It has ecological and social appeal but is exposed to problems associated with environmental degradation and expansion of agricultural activities in the region. Comprehensive studies on this species are rare, making its conservation and genetic improvement difficult. Thus, it is important to develop research activities to understand the current situation of the species as well as to make recommendations for its conservation and use. This study was performed to characterize the cambuci accessions found in the germplasm bank of Coordenadoria de Assistência Técnica Integral using inter-simple sequence repeat markers, with the goal of understanding the plant’s population structure. The results showed the existence of some level of genetic diversity among the cambuci accessions that could be exploited for the genetic improvement of the species. Principal coordinate analysis and discriminant analysis clustered the 80 accessions into three groups, whereas Bayesian model-based clustering analysis clustered them into two groups. The formation of two cluster groups and the high membership coefficients within the groups pointed out the importance of further collection to cover more areas and more genetic variability within the species. The study also showed the lack of conservation activities; therefore, more attention from the appropriate organizations is needed to plan and implement natural and ex situ conservation activities.
Dose-Finding Study of Omeprazole on Gastric pH in Neonates with Gastro-Esophageal Acid Reflux Using a Bayesian Sequential Approach. OBJECTIVE: Proton pump inhibitors are frequently administered on clinical symptoms in neonates but benefit remains controversial. Clinical trials validating omeprazole dosage in neonates are limited. The objective of this trial was to determine the minimum effective dose (MED) of omeprazole to treat pathological acid reflux in neonates using reflux index as surrogate marker.DESIGN: Double blind dose-finding trial with continual reassessment method of individual dose administration using a Bayesian approach, aiming to select drug dose as close as possible to the predefined target level of efficacy (with a credibility interval of 95%).SETTING: Neonatal Intensive Care unit of the Robert Debré University Hospital in Paris, France.PATIENTS: Neonates with a postmenstrual age ≥ 35 weeks and a pathologic 24-hour intra-esophageal pH monitoring defined by a reflux index ≥ 5% over 24 hours were considered for participation. Recruitment was stratified to 3 groups according to gestational age at birth.INTERVENTION: Five preselected doses of oral omeprazole from 1 to 3 mg/kg/day.MAIN OUTCOME MEASURES: Primary outcome, measured at 35 weeks postmenstrual age or more, was a reflux index <5% during the 24-h pH monitoring registered 72±24 hours after omeprazole initiation.RESULTS: Fifty-four neonates with a reflux index ranging from 5.06 to 27.7% were included. Median age was 37.5 days and median postmenstrual age was 36 weeks. In neonates born at less than 32 weeks of GA (n = 30), the MED was 2.5mg/kg/day with an estimated mean posterior probability of success of 97.7% (95% credibility interval: 90.3-99.7%). The MED was 1mg/kg/day for neonates born at more than 32 GA (n = 24).CONCLUSIONS: Omeprazole is extensively prescribed on clinical symptoms but efficacy is not demonstrated while safety concerns do exist. When treatment is required, the daily dose needs to be validated in preterm and term neonates. Optimal doses of omeprazole to increase gastric pH and decrease reflux index below 5% over 24 hours, determined using an adaptive Bayesian design differ among neonates. Both gestational and postnatal ages account for these differences but their differential impact on omeprazole doses remains to be determined.
Phylodynamic and Phylogeographic Profiles of Subtype B HIV-1 Epidemics in South Spain. BACKGROUND: Since 1982, HIV-1 epidemics have evolved to different scenarios in terms of transmission routes, subtype distribution and characteristics of transmission clusters. We investigated the evolutionary history of HIV-1 subtype B in south Spain.PATIENTS & METHODS: We studied all newly diagnosed HIV-1 subtype B patients in East Andalusia during the 2005-2012 period. For the analysis, we used the reverse transcriptase and protease sequences from baseline resistance, and the Trugene® HIV Genotyping kit (Siemens, Barcelona, Spain). Subtyping was done with REGA v3.0. The maximum likelihood trees constructed with RAxML were used to study HIV-1 clustering. Phylogeographic and phylodynamic profiles were studied by Bayesian inference methods with BEAST v1.7.5 and SPREAD v1.0.6.RESULTS: Of the 493 patients infected with HIV-1 subtype B, 234 grouped into 55 clusters, most of which were small (44 clusters ≤ 5 patients, 31 with 2 patients, 13 with 3). The rest (133/234) were grouped into 11 clusters with ≥ 5 patients, and most (82%, 109/133) were men who have sex with men (MSM) grouped into 8 clusters. The association with clusters was more frequent in Spanish (p = 0.02) men (p< 0.001), MSM (p<0.001) younger than 35 years (p = 0.001) and with a CD4+ T-cell count above 350 cells/ul (p<0.001). We estimated the date of HIV-1 subtype B regional epidemic diversification around 1970 (95% CI: 1965-1987), with an evolutionary rate of 2.4 (95%CI: 1.7-3.1) x 10-3 substitutions/site/year. Most clusters originated in the 1990s in MSMs. We observed exponential subtype B HIV-1 growth in 1980-1990 and 2005-2008. The most significant migration routes for subtype B went from inland cities to seaside locations.CONCLUSIONS: We provide the first data on the phylodynamic and phylogeographic profiles of HIV-1 subtype B in south Spain. Our findings of transmission clustering among MSMs should alert healthcare managers to enhance preventive measures in this risk group in order to prevent future outbreaks.
A Bayesian network meta-analysis of three different surgical procedures for the treatment of humeral shaft fractures. BACKGROUND: The optimal surgical procedure for humeral shaft fractures remains a matter of debate. We aimed to establish the optimum procedure by performing a Bayesian network meta-analysis.METHODS: PubMed, EMBASE, the Cochrane Library, and Medline were searched for both randomized controlled trials and prospective studies of surgical treatment for humeral shaft fractures. The quality of the included studies was assessed according to the Cochrane Collaboration’s “Risk of bias”.RESULTS: Seventeen RCTs or prospective studies were included in the meta-analysis. The pooled results showed that the occurrence rate of radial nerve injury was lowest for minimally invasive plate osteosynthesis (MIPO; SUCRA probability, 95.1%), followed by open reduction and plate osteosynthesis (ORPO; SUCRA probability, 29.5%), and was highest for intramedullary nailing (IMN; SUCRA probability, 25.4%). The aggregated results of pairwise meta-analysis showed no significant difference in radial nerve injury rate when comparing ORPO versus IMN (OR, 1.92; 95% CI, 0.96 to 3.86), ORPO versus MIPO (OR, 3.38; 95% CI, 0.80 to 14.31), or IMN versus MIPO (OR, 3.19; 95% CI, 0.48 to 21.28). Regarding the nonunion, SUCRA probabilities were 90.5%, 40.2%, and 19.3% for MIPO, ORPO, and IMN, respectively. The aggregated results of a pairwise meta-analysis also showed no significant difference for ORPO versus IMN (OR, 0.83; 95% CI, 0.41 to 1.69), ORPO versus MIPO (OR, 2.42; 95% CI, 0.45 to 12.95), or IMN versus MIPO (OR, 2.49; 95% CI, 0.35 to 17.64).CONCLUSION: The current evidence indicates that MIPO is the optimum choice in the treatment of humeral shaft fractures and that ORPO is superior to IMN.
Bayesian Isotonic Regression Dose-response (BIRD) Model. Understanding dose-response relationship is a crucial step in drug development. There are a few parametric methods to estimate dose-response curves, such as the Emax model and the logistic model. These parametric models are easy to interpret and, hence, widely used. However, these models often require the inclusion of patients on high-dose levels; otherwise, the model parameters cannot be reliably estimated. To have robust estimation, nonparametric models are used. However, these models are not able to estimate certain important clinical parameters, such as ED50 and Emax. Furthermore, in many therapeutic areas, dose-response curves can be assumed as non-decreasing functions. This creates an additional challenge for nonparametric methods. In this paper, we propose a new Bayesian isotonic regression dose-response model which features advantages from both parametric and nonparametric models. The ED50 and Emax can be derived from this model. Simulations are provided to evaluate the Bayesian isotonic regression dose-response model performance against two parametric models. We apply this model to a data set from a diabetes dose-finding study.
The Generalized Quantum Episodic Memory Model. Recent evidence suggests that experienced events are often mapped to too many episodic states, including those that are logically or experimentally incompatible with one another. For example, episodic over-distribution patterns show that the probability of accepting an item under different mutually exclusive conditions violates the disjunction rule. A related example, called subadditivity, occurs when the probability of accepting an item under mutually exclusive and exhaustive instruction conditions sums to a number >1. Both the over-distribution effect and subadditivity have been widely observed in item and source-memory paradigms. These phenomena are difficult to explain using standard memory frameworks, such as signal-detection theory. A dual-trace model called the over-distribution (OD) model (Brainerd & Reyna, 2008) can explain the episodic over-distribution effect, but not subadditivity. Our goal is to develop a model that can explain both effects. In this paper, we propose the Generalized Quantum Episodic Memory (GQEM) model, which extends the Quantum Episodic Memory (QEM) model developed by Brainerd, Wang, and Reyna (2013). We test GQEM by comparing it to the OD model using data from a novel item-memory experiment and a previously published source-memory experiment (Kellen, Singmann, & Klauer, 2014) examining the over-distribution effect. Using the best-fit parameters from the over-distribution experiments, we conclude by showing that the GQEM model can also account for subadditivity. Overall these results add to a growing body of evidence suggesting that quantum probability theory is a valuable tool in modeling recognition memory.
Learning to Be (In)variant: Combining Prior Knowledge and Experience to Infer Orientation Invariance in Object Recognition. How does the visual system recognize images of a novel object after a single observation despite possible variations in the viewpoint of that object relative to the observer? One possibility is comparing the image with a prototype for invariance over a relevant transformation set (e.g., translations and dilations). However, invariance over rotations (i.e., orientation invariance) has proven difficult to analyze, because it applies to some objects but not others. We propose that the invariant transformations of an object are learned by incorporating prior expectations with real-world evidence. We test this proposal by developing an ideal learner model for learning invariance that predicts better learning of orientation dependence when prior expectations about orientation are weak. This prediction was supported in two behavioral experiments, where participants learned the orientation dependence of novel images using feedback from solving arithmetic problems.
Use of spatiotemporal templates for pathway discrimination in peripheral nerve recordings: a simulation study. OBJECTIVE: Extraction of information from the peripheral nervous system can provide control signals in neuroprosthetic applications. However, the ability to selectively record from different pathways within peripheral nerves is limited. We investigated the integration of spatial and temporal information for pathway discrimination in peripheral nerves using measurements from a multi-contact nerve cuff electrode.APPROACH: Spatiotemporal templates were established for different neural pathways of interest, and used to obtain tailored matched filters for each of these pathways. Simulated measurements of compound action potentials propagating through the nerve in different test cases were used to evaluate classification accuracy, percentage of missed spikes, and ability to reconstruct the original firing rates of the neural pathways.MAIN RESULTS: The mean Pearson correlation coefficients between the original firing rates and estimated firing rates over all tests cases was found to be 0.832  ±  0.161, 0.421  ±  0.145, 0.481  ±  0.340 for our algorithm, Bayesian spatial filters, and velocity selective recordings respectively.SIGNIFICANCE: The proposed method shows that the spatiotemporal templates were able to provide more robust spike detection and reliable pathway discrimination than these existing algorithms.
Inferring Alcoholism SNPs and Regulatory Chemical Compounds Based on Ensemble Bayesian Network. The disturbance of consciousness is one of the most common symptoms of those have alcoholism and may cause disability and mortality. Previous studies indicated that several single nucleotide polymorphisms (SNP) increase the susceptibility of alcoholism. In this study, we utilized the Ensemble Bayesian Network (EBN) method to identify causal SNPs of alcoholism based on the verified GAW14 data. Thirteen out of eighteen SNPs directly connected with alcoholism were found concordance with potential risk regions of alcoholism in OMIM database. As a number of SNPs were found contributing to alteration on gene expression, known as expression quantitative trait loci (eQTLs), we further sought to identify chemical compounds acting as regulators of alcoholism genes captured by causal SNPs. Chloroprene and valproic acid were identified as the expression regulators for genes C11orf66 and SALL3 which were captured by alcoholism SNPs, respectively.
Poisson Mixture Regression Models for Heart Disease Prediction. Early heart disease control can be achieved by high disease prediction and diagnosis efficiency. This paper focuses on the use of model based clustering techniques to predict and diagnose heart disease via Poisson mixture regression models. Analysis and application of Poisson mixture regression models is here addressed under two different classes: standard and concomitant variable mixture regression models. Results show that a two-component concomitant variable Poisson mixture regression model predicts heart disease better than both the standard Poisson mixture regression model and the ordinary general linear Poisson regression model due to its low Bayesian Information Criteria value. Furthermore, a Zero Inflated Poisson Mixture Regression model turned out to be the best model for heart prediction over all models as it both clusters individuals into high or low risk category and predicts rate to heart disease componentwise given clusters available. It is deduced that heart disease prediction can be effectively done by identifying the major risks componentwise using Poisson mixture regression model.

以下では、LDAを実施するためにテキストデータを整え、扱いやすくするためにCorpusデータに流していきます。論文の検索や下処理はもっともっと丁寧にやる必要がありますが、今回は飛ばします。

#Corpusで処理
pubmed_data$Abstract <- as.character(pubmed_data$Abstract)
docs <- Corpus(VectorSource(pubmed_data$Abstract))

docs <-tm_map(docs,content_transformer(tolower))
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "’")
docs <- tm_map(docs, toSpace, "‘")
docs <- tm_map(docs, toSpace, "•")
docs <- tm_map(docs, toSpace, "”")

docs <- tm_map(docs, removePunctuation)
#Strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
#remove whitespace
docs <- tm_map(docs, stripWhitespace)
#Stem document
docs <- tm_map(docs,stemDocument)
#define and eliminate all custom stopwords
stops<-c("can", "say","one","way","use",
  "also","howev","tell","will",
  "much","need","take","tend","even",
  "like","particular","rather","said",
  "get","well","make","ask","come","end",
  "first","two","help","often","may",
  "might","see","someth","thing","point",
  "post","look","right","now","think","‘ve ",
  "‘re ","anoth","put","set","new","good",
  "want","sure","kind","larg","yes,","day","etc",
  "quit","sinc","attempt","lack","seen","awar",
  "littl","ever","moreov","though","found","abl",
  "enough","far","earli","away","achiev","draw",
  "last","never","brief","bit","entir","brief",
  "great","lot", "background", "aim", "method", "result", "conclusion",
  "show")

myStopwords <- stops

docs <- tm_map(docs, removeWords, myStopwords)
#inspect a document as a check
#delete.empty.docment
dtm <- DocumentTermMatrix(docs)
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
docs.new <- docs[-as.numeric(empty.rows)]

stanでLDA

さて、データが整ったので、LDAを実施しします。LDAで知りたいパラメタは2つ。一つは、どの文章がどのトピックに属する確率が高いかという、トピック分布(今回はtheta)。もう一つは、あるトピックはどのような用語が生起する確率が高いのかという、単語分布 (今回はphi)。 LDAをstanで書くと以下のようになります。コードは、こちらのブログを参考、というか丸パクリです。アヒル本にも、超わかりやすい解説あります。

data {
  int<lower=2> K;                    # num topics
  int<lower=2> V;                    # num words
  int<lower=1> M;                    # num docs
  int<lower=1> N;                    # total word instances
  int<lower=1,upper=V> W[N];         # word n
  int<lower=1> Freq[N];              # frequency of word n
  int<lower=1,upper=N> Offset[M,2];  # range of word index per doc
  vector<lower=0>[K] Alpha;          # topic prior
  vector<lower=0>[V] Beta;           # word prior
}
parameters {
  simplex[K] theta[M];   # topic dist for doc m
  simplex[V] phi[K];     # word dist for topic k
}
model {
  # prior
  for (m in 1:M)
    theta[m] ~ dirichlet(Alpha);
  for (k in 1:K)
    phi[k] ~ dirichlet(Beta);
  
  # likelihood
  for (m in 1:M) {
    for (n in Offset[m,1]:Offset[m,2]) {
      real gamma[K];
      for (k in 1:K)
        gamma[k] = log(theta[m,k]) + log(phi[k,W[n]]);
      increment_log_prob(Freq[n] * log_sum_exp(gamma));
    }
  }
}

なお、今回は、複数のトピック数をためして見ましたがトピック数2以外がうまく分かれませんでした。HDPなど、トピック数の推定法、また時間があれば取り組んでみたいです。とりあえず、上記のコードをldatest.stanで保存し、以下のコードでキックして、(´д`)ハァハァします。

K=2 # トピック数
V=dtm.new$ncol # 単語数
M=dtm.new$nrow # 論文数
N=length(dtm.new$j) #総単語数
W<-dtm.new$j # 出現単語
Freq<-dtm.new$v # 単語の出現頻度
offset <- t(sapply(1:M, function(m){ range(which(m==W)) })) #出現単語の範囲

data <- list(
  K=K,
  M=M,
  V=V,
  N=N,
  W=W,
  Freq=Freq,
  Offset=offset,
  Alpha=rep(1, K), # トピック分布の事前分布
  Beta=rep(0.5, V) # 単語分布の事前分布
)

library(rstan)
stanmodel <- stan_model(file="ldatest.stan")
#fit_nuts <- sampling(stanmodel, data=data, seed=123)
fit_vb <- vb(stanmodel, data=data, seed=123)
## 
## This is Automatic Differentiation Variational Inference.
## 
## (EXPERIMENTAL ALGORITHM: expect frequent updates to the procedure.)
## 
## Gradient evaluation took 0.029386 seconds
## 1000 iterations under these settings should take 29.386 seconds.
## Adjust your expectations accordingly!
## 
## Begin eta adaptation.
## Iteration:   1 / 250 [  0%]  (Adaptation)
## Iteration:  50 / 250 [ 20%]  (Adaptation)
## Iteration: 100 / 250 [ 40%]  (Adaptation)
## Iteration: 150 / 250 [ 60%]  (Adaptation)
## Iteration: 200 / 250 [ 80%]  (Adaptation)
## Success! Found best value [eta = 1] earlier than expected.
## 
## Begin stochastic gradient ascent.
##   iter       ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
##    100     -7e+05             1.000            1.000
##    200     -7e+05             0.504            1.000
##    300     -7e+05             0.336            0.007   MEDIAN ELBO CONVERGED
## 
## Drawing a sample of size 1000 from the approximate posterior... 
## COMPLETED.

というわけで以下は2つのトピックを指定した結果をみていきます。 横軸に単語の生起確率、縦軸に各単語をとって、各トピックでの単語の生起確率を示しています。

library(ggplot2)

ms <- rstan::extract(fit_vb)

probs <- c(0.1, 0.25, 0.5, 0.75, 0.9)
idx <- expand.grid(1:K, 1:V)

d_qua <- t(apply(idx, 1, function(x) quantile(ms$phi[,x[1],x[2]], probs=probs)))
d_qua <- data.frame(idx, d_qua)
colnames(d_qua) <- c('topic', 'word', paste0('p', probs*100))

p <- ggplot(data=d_qua, aes(x=word, y=p50))
p <- p + theme_bw(base_size=18)
p <- p + facet_wrap(~topic, ncol=3)
p <- p + coord_flip()
p <- p + scale_x_reverse(breaks=c(1, seq(20, 1925, 60)))
p <- p + geom_bar(stat='identity')
p <- p + labs(x='word', y='phi[k,y]')
p

今後は各論文を横軸にとって、生起確率(MAP推定値)を縦軸にプロットしています。こうみると、二つのトピックそれぞれに違う論文が反応している様子がわかります。

ms <- rstan::extract(fit_vb)

probs <- c(0.1, 0.25, 0.5, 0.75, 0.9)
idx <- expand.grid(1:M, 1:K)
d_qua=NULL

d_qua <- t(apply(idx, 1, function(x) quantile(ms$theta[,x[1],x[2]],probs=probs)))
d_qua <- data.frame(idx, d_qua)

colnames(d_qua) <- c('article', 'topic', paste0('p', probs*100))

best.topic=NULL
for(i in 1:48){
aaa<-subset(d_qua,article==i)
best.topic<-c(best.topic,max.col(t(aaa$p50)))
}


p <- ggplot(d_qua, aes(x=article, y=p50,col=as.factor(topic)))
p <- p +geom_point()
p <- p + facet_wrap(~as.factor(topic))
p +theme_bw()

論文を該当する確率の高い方のトピックわりふって、docs.newからトピックグループごとのCorpusを抜き出します。そしてグループごとの頻出語を検討します。

A<-which(best.topic==1)
B<-which(best.topic==2)

dtm.A<-DocumentTermMatrix(docs.new[A])
dtm.B<-DocumentTermMatrix(docs.new[B])

freqA <- colSums(as.matrix(dtm.A))
ordA <- order(freqA,decreasing=TRUE)
freq20A<-data.frame(head(freqA[ordA],50))
freq20A$wordA<-rownames(freq20A)
names(freq20A)<-c("freq","word")

freqB <- colSums(as.matrix(dtm.B))
ordB <- order(freqB,decreasing=TRUE)
freq20B<-data.frame(head(freqB[ordB],50))
freq20B$wordB<-rownames(freq20B)
names(freq20B)<-c("freq","word")
freqs<-data.frame(freq20A$word,freq20B$word,rank=1:50)

kable(freqs)
freq20A.word freq20B.word rank
model model 1
bayesian bayesian 2
speci studi 3
studi predict 4
group structur 5
differ base 6
analysi estim 7
evid high 8
treatment measur 9
predict speci 10
count differ 11
calibr dose 12
estim propos 13
fmt data 14
provid genet 15
gene morpholog 16
lineag analysi 17
data individu 18
patient time 19
sampl compar 20
signific paramet 21
within algorithm 22
approach cluster 23
compar distribut 24
exponenti effect 25
observ identifi 26
paramet level 27
propos present 28
reconstruct primari 29
accid signific 30
base tempor 31
concentr approach 32
cryptanthus network 33
fit test 34
network activ 35
split atom 36
uncertainti chang 37
cluster hyperparathyroid 38
determin includ 39
distribut provid 40
gpp regress 41
hiv relat 42
infer accuraci 43
inform exist 44
subtyp group 45
time indic 46
communiti inform 47
correl phylogenet 48
dwi rate 49
genus respons 50

あとは、グループごとにwordcloudを作成してやれば、本日の任務は完了です。上位の単語はあまり変わらないので見た目があまり変わらないので今回は面白くないですが、細かく見ていると、右の図の方が、infection、patient、response、 clinical、といった臨床関連用語が散見されるので医学的な研究のグループとしてまとめられているように思います (テキトー)。

library(wordcloud)
library(dplyr)
par(mfrow=c(1,2))
wordcloud::wordcloud(docs.new[A], scale = c(4.5, 0.4), max.words = 200, 
          min.freq = 1, random.color=T, random.order = FALSE, rot.per = 0.00, 
          use.r.layout = F, colors = brewer.pal(3, "Paired"),
          family="Helvetica")

wordcloud::wordcloud(docs.new[B], scale = c(4.5, 0.4), max.words = 200, 
          min.freq = 5, random.color=T, random.order = FALSE, rot.per = 0.00, 
          use.r.layout = F, colors = brewer.pal(3, "Paired"),
          family="Helvetica")

感想

stanでLDA、モデルはシンプルだけど、結構重たい処理になるので、高速化をどうするか。RでLDA用の便利な関数(topicmodelsなど)が続々と出てるので、それでまずはtopicmodelに親しむのはあり。gensimなど、HDPに挑戦していきたい。

Enjoy!!

comments powered by Disqus