Lecture 20: Mining literature and metadata

  • Update on where you are on the labs
  • Answer questions on Labs 4 and 5
  • Introduce Lab 6 and the concept of metadata
In the news:

An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage 
by Aaron M Newman, Scott V Bratman, Jacqueline To, Jacob F Wynne, Neville C W Eclov,
Leslie A Modlin, Chih Long Liu, Joel W Neal, Heather A Wakelee, Robert E Merritt, Joseph B Shrager,
Billy W Loo Jr, Ash A Alizadeh & Maximilian Diehn
Nature Medicine (2014) doi:10.1038/nm.3519

Meta-analysis refers to methods for summarizing evidence across multiple studies. This approach allows you to get the benefit of larger datasets. The execellent 200+ page tutorial Performing Meta-Analysis in R (http://www.edii.uclm.es/~useR-2013/Tutorials/kovalchik/kovalchik_meta_tutorial.pdf) by Stephanie Kovalchik author of RISmed summarizes the different R methods that can be used for meta-analysis.

The RISmed package is designed to do queries to PubMed from R. The user manual can be found at: http://cran.r-project.org/web/packages/RISmed/RISmed.pdf

The RISmed package is capable of doing more general E-Utilities to any of the databases at NCBI, not just PubMed. To find out more about E-Utilities, see the free book: Entrez Programming Utilities Help at http://www.ncbi.nlm.nih.gov/books/NBK25499/.

Mining pubmed and metadata from other NCBI databases

The RISmed library has 4 main methods
  • Build a query summary object for an NCBI database (EUtilsSummary). This doesn't actually retrieve the results, but it allows you to find get the complete query string (QueryTranslation), get a query ID for the query (QueryId) and find out how many items in the database match the query (QueryCount).
  • If you just need the query string and don't want to actually to count how many items match the query, use (EUtilsQuery).
  • To actually fetch use EUtilsGet. This function takes either a vector of IDs or an EUtilsSummary object. For PubMed queries, this function returns a Medline object whose slots correspond to fields in the PubMed database.

Example 1: How many pubmed entries have PTEN in them?
fit <- EUtilsSummary("PTEN", db = "pubmed")

Example 2: Get the actual information in the form of a Medline object
pQuery <- EUtilsGet(fit)

Example 3: Find out what kind of slots (like object attributes) of a Medline object so that we can tell what we can ask for.

Example 4: Find the titles of the first 10 articles (there were 8678 articles in the query from Example 1).

Example 5: Find the number of articles that have PTEN in the title or in the mesh terms. 
fit1 <- EUtilsSummary("PTEN[Title] OR PTEN[MeSH Terms]", db = "pubmed")
ids1 <- QueryId(fit1)

Note: There were 3257 articles that met this criteria. By default only the first 1000 IDs are returned. You need to be careful about putting forward too many queries at once.

Example 6: Find ids of the second 1000 that have PTEN in the title or in the mesh terms. 
fit2 <- EUtilsSummary("PTEN[Title] OR PTEN[MeSH Terms]", db = "pubmed", retstart=1000)
ids2 <- QueryId(fit2)

Note: By default, only 1000 entries are returned each time. The retstart parameter allows you to specify that you want the returned values to start at a different position. (Positions start at 0.) The example returns the PubMed IDs of the second 1000 items.

Example 7: Return the pubmed summary entries for the query of Example 6
qResults2 <- EUtilsGet(fit2)

Example 8: Get all of the IDs for the query of Example 5, 1000 at a time.
          # Find out how many there are and allocate a vector
 q <- EUtilsSummary("PTEN[Title] OR PTEN[MeSH Terms]", db = "pubmed")
cnt <- QueryCount(q) 
ids <- character(length=cnt)
rStart = 1
rStep = 1000
rRestart = 0
nInc = ceiling(cnt/rStep)
# Now loop to fill in
for (k in 1:nInc) {
  q <- EUtilsSummary("PTEN[Title] OR PTEN[MeSH Terms]", db = "pubmed", retstart=rRestart)
  qIds <- QueryId(q)
  rEnd <- rStart + length(qIds) - 1
  ids[rStart:rEnd] <- qIds
  rStart <- rStart + rStep
  rRestart <- rRestart + rStep

Example 9: Write a function to accomplish a more general retrieval