Home

Data mining

Background reading

iPhylo
Facts are Sacred: The power of data (Guardian Shorts) [Kindle and iBooks]
Elsevier Challenge entry
Visualization tutorials list from Compulsive data

Lecture

The themes of this lecture are extracting information from (often messy) data, and the challenge of linking data together.

Data mining and data linking
View more PowerPoint from Roderic Page

Regular expressions

The first step in any data analysis is to get the data into a form you can analyse. Often the data isn't quite in the right format, and requires a little massaging before you can work with it. In many cases simple tools like "search and replace" are sufficient, but sometimes you need more powerful tools, such as regular expressions.

Regular expressions are rules for matching strings, for example:

PatternExample that pattern matchesComment
\d1single digit
\d+12one or more digits
[0-9]{4}2012four digit number (e.g., a year)
\w+abcword
[A-Z][A-Z]\d+FJ559180string starting with two capital letters followed by numbers
[W|E]Wmatch either W or E
\d+°\d+'[W|E]10°36'Elongitude

Armed with regular expressions like these we can develop tools to extract information from text, such as the specimen parser which finds museum specimen codes.

Regular expressions

Using the Regular Expression Tester below, create regular expressions for the following tasks:

Regular Expression Tester

This regular expression tester (from Rob Locher's web site) uses the regular expression parser in your browser's implementation of JavaScript.

Test String:

Pattern:
/
Options:
/


 

Original string:





 

Tips

ExpressionMatches
[abc] A single character: a, b, or c
[^abc] Any single character but a, b, or c
[a-z] Any character in the range a-z
[a-zA-Z] Any character in the range a-z or A-Z (any alphabetical character)
\s Any whitespace character [ \t\n\r\f\v]
\S Any non-whitespace character [^ \t\n\r\f\v]
\d Any digit [0-9]
\D Any non-digit [^0-9]
\w Any word character [a-zA-Z0-9_]
\W Any non-word character [^a-zA-Z0-9_]
\b A word boundary between \w and \W
\B A position that is not a word boundary
| Alternation: matches either the subexpression to the left or to the right
() Grouping: group all together for repetition operators
^ Beginning of the string
$ End of the string
 
Repetition OperatorMeaning
{n,m} Match the previous item at least n times but no more than m times
{n,} Match the previous item n or more times
{n} Match exactly n occurrences of the previous item
? Match 0 or 1 occurrences of the previous item {0,1}
+ Match 1 or more occurrences of the previous item {1,}
* Match 0 or more occurrences of the previous item {0,}

OptionDescription
g "Global" -- find all matches in the string rather than just the first
i "case Insensitive" -- ignore character case when matching
m "Multiline" -- search over more than one line if the text contains line breaks

Getting data from a paper

Some journals are trying to make data in papers more accessible. For example, Elsevier's "Article of the Future" project includes tools to extract data from the tables in a paper. If you go to a paper such as http://dx.doi.org/10.1016/j.ympev.2009.07.011 you will see a panel labelled Table download. This tool will find tables in the article and make them available for download.

If you download the tables for the paper http://dx.doi.org/10.1016/j.ympev.2009.07.011 you will seem something like this:

Taxon and institutional vouchera,Locality ID,Collection locality,Geographic coordinates/approximate location,Elevation (m),GenBank accession number 12S,16S,COI,c-myc
1. UTA A-52449,1,"Puntarenas, CR","(10°18′N, 84°48′W)",1520,EF562312,EF562365,None,EF562417
2. MVZ 149813,2,"Puntarenas, CR","(10°18′N, 84°42′W)",1500,EF562319,EF562373,EF562386,EF562430
3. FMNH 257669,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562320,EF562372,EF562380,EF562432
4. FMNH 257670,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562317,EF562336,EF562376,EF562421
5. FMNH 257671,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562314,EF562374,EF562409,None
6. FMNH 257672,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562318,None,EF562382,None

While this makes the data easy to get, it might not be in the form we need. But because it is available in a widely used format (CSV) we can load it into a spreadsheet or a tool like OpenRefine (ex-Google Refine) and clean it.

Extracting data to create a visualisation

Once we've got the data from a paper, we can create some new visualisations that might not have been available to the paper's authors. For example, with a little data cleaning and regular expressions we can create a geophylogeny.

Go to Create KML tree and paste in the tree below, then data from the paper. You should get a KML tree back (if you have the Google Earth Plug-in you should see the tree displayed on Google Earth embedded in the web page.

Extracting data from PDFs

There are an increasing number of tools for extracting data from PDFs, such as Tabula. Once you've downloaded and installed Tabula, point your web browser at http://127.0.0.1:8080 and follow the instructions.

Taxonomic name cleaning using OpenRefine

OpenRefine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use Freebase as well as other external services.

For this course I've implemented the following services:

To use these you need to add the URLs above to Google Refine (see example below). The EOL, NCBI and WORMS do a basic name lookup. The uBio FindIT service extracts a taxonomic name from a string, and can be viewed as a "taxonomic name cleaner".

How to use reconciliation services

Start an Open Refine session and go to http://127.0.0.1:3333. If you can't get OpenRefine installed on your computrer, there is a verison running as a container here: https://openrefine.sloppy.zone

Save the names below to a text file and open it as a new project.

Names
Achatina fulica (giant African snail)
Acromyrmex octospinosus ST040116-01
Alepocephalus bairdii (Baird's smooth-head)
Alaska Sea otter (Enhydra lutris kenyoni)
Toxoplasma gondii
Leucoagaricus gongylophorus
Pinnotheres
Themisto gaudichaudii
Hyperiidae

You should see something like this: Refine1

Click on the column header Names and choose ReconcileStart reconciling.

Refine2

A dialog will popup asking you to select a service.

Refine3

If you've already added a service it will be in the list on the left. If not, click the Add Standard Services... button at the bottom left and paste in the URL (in this case http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php).

Once the service has loaded click on Start Reconciling. Once it has finished you should see most of the names linked to uBio (click on a name to check this):

Refine4

Sometimes there may be more than one possible match, in which case these will be listed in the cell. Once you have reconciled the data you may want to do something with the reconciliation. For example, if you want to get the ids for the names you've just matched you can create a new column based on the reconciliation. Click on the Names column header and choose Edit columnAdd column based on this column.... A dialog box will be displayed:

Refine6

In the box labelled Expression enter cell.recon.match.id and give the column a name (e.g., "NamebankID"). You will now have a column of uBio NamebankIDs for the names:

Refine7

You could also get the names uBio extracted by creating a column based on the values of cell.recon.match.name. To compare this with the original values, click on the Names column header and choose ReconcileActionsClear reconciliation data. Now you can see the original input names, and the string uBio extracted from each name:

Refine8

Wordtrees

In the lecture we saw an example of article titles which contain information about host-parasite relationships:

These titles have a similar structure, which suggests were could develop a tool to parse these titles and extract the host-parasite associations. To explore this further we can use a tool like Many Eyes Wordtrees.

IBM's Many Eyes contains tools for creating visualisation (login required, you will also need Java). We will use the Wordtree visualisation to explore the structure of the following sentences:

Eimeria azul sp. n. (Protozoa: Eimeriidae) from the eastern cottontail, Sylvilagus floridanus, in Pennsylvania
Mirandula parva gen. et sp. nov. (Cestoda, Dilepididae) from the long-nosed Bandicoot (Perameles nasuta Geoff.)
Hysterothylacium carutti n. sp. (Nematoda: Anisakidae) from the marine fish Johnius carutta Bloch of Bay of Bengal (Visakhapatnam)
Ctenascarophis lesteri n. sp. and Prospinitectus exiguus n. sp. (Nematoda: Cystidicolidae) from the skipjack tuna, Katsuwonus pelamis
Buticulotrema stenauchenus n. gen. n. sp. (Digenea: Opecoelidae) from Malacocephalus occidentalis and Nezumia aequalis (Macrouridae) from the Gulf of Mexico
Nubenocephalus nebraskensis n. gen., n. sp. (Apicomplexa: Actinocephalidae) from adults of Argia bipunctulata (Odonata: Zygoptera)
Studies on Stenoductus penneri gen. n., sp. n. (Cephalina: Monoductidae) from the spirobolid millipede, Floridobolus penneri Causey 1957
Species of Cloacina Linstow, 1898 (Nematoda: Strongyloidea) from the black-tailed wallaby, Wallabia bicolor (Desmarest, 1804) from eastern Australia
A new marine Cercaria (Digenea: Aporocotylidae) from the southern quahog Mercenaria campechiensis
A new species of Breinlia (Breinlia) (Nematoda: Filarioidea) from the south Indian flying squirrel Petaurista philippensis (Elliot)

Javascript Wordtrees

Jason Davies has created a Javascript-based version of word tree that can also be used (and doesn't have the Java dependency of Many Eyes).

Linked data

The holy grail of data linking is to be able to seamlessly navigate a web of data (in much the same we the World Wide Web enables us to navigate through documents). For example, we could start with a museum specimen and navigate through all the connected data elements (DNA sequences, publications, phylogenies, taxonomic names, ecological data, etc.). Or, more correctly, we could have computers do this for us. Linked data is aimed at making data computer-readable so that we can treat the web as a giant database.

This is the vision of linked data (see TED talk by Tim Berners-Lee below).

For this to happen requires at least three things:

  1. We need to use globally unique identifiers for the same thing (so you and I know we are talking about the same thing)
  2. These identifiers need to be resolvable, that is, I can use a tool such as a web browser and get some information about the thing with that identifier.
  3. The information you get back about an object uses a standard vocabulary to describe that information. For example, if the information is about a person, any source of information should use the same terms to describe the name. If one source uses, say "surname" and the other uses "lastname" to describe the last part of a Western-style name, then we will struggle to integrate that data.

In theory, linked data is a very powerful idea. In practice, there are significant hurdles at each of the three steps above, and we are some way from realising the linked data vision. For example, major providers of bibliographic data have managed to create linked data for the same papers with virtually no overlap (see Linked data that isn't: the failings of RDF).

Answering a question using linked data

Given data from various sources, how can we combine that data to test hypotheses, gain insight, or discover patterns worth exploring further? If we use the same identifier to refer to the same entities (e.g., taxon, publication, specimen, gene sequence) and the same terms ("vocabulary") to refer to attributes of those entities and their interrelationships, then we can combine data sets into a large graph of "linked data". We can query that graph by tracing paths that connect the entities we are interested in.

Part 1: What data do we need to answer the question?

The first question is what kind of data would we need? To help answer this we can draw a "model" of the entities we are interested in and how they are related. For example, we may have specimens and the date they were collected, and taxonomic names and the date they were published, and taxonomic names are associated with type specimens. Can we sketch the relationships between these things?

Part 2: What data do we have?

Next we need to identify the sources of information about the entities in the diagram(s) we created above. What are they, and do they provide their data in a form that we can use?

Part 3: Can the data answer the question?

Once we have a model of the connections between the data that we need, we can now ask whether we can trace those connections in the data we have available. If not, what would we need to make these connections?

Linked data and SPARQL

Linked data can be visualised as a graph, where the nodes are things (or concepts) and the edges (links) are connections between those things. In order to query the data we need to formulate a path through the graph. For example, to ask the question "who publishes in the journal Nature?" we would follow the paths from author to article to journal the article was published in. We can represent this as:

person authorOf article
article partOf journal 
This query resembles one written in SPARQL, which is a simple way to describe paths in a graph in terms of pairs of nodes connected by a (labelled) edge.

Exercise

Given the frog species named Rana okaloosae we could ask some questions, such as:

  1. When was this species described?
  2. Who described that species?
  3. When was this species sequenced?
  4. How much time elapsed between the collection of the specimens and the description of the species (a recent study suggests that this averages 21 years, see doi:10.1016/j.cub.2012.10.029)

To answer these questions we need some data and we need it in a form that computers can easily digest:

EntityWeb pageRDF
Rana okaloosae in NCBI taxonomytax_id:190275http://bioguid.info/taxonomy:190275 [local copy]
GenBank sequenceAY083283http://bioguid.info/genbank:AY083283 [local copy]
Taxonomic name Rana okaloosaeION 492778urn:lsid:organismnames.com:name:492778 [local copy]
Publication of Rana okaloosaedoi:10.2307/1444847doi:10.2307/1444847
Specimen USNM 239420 of Rana okaloosaeGBIFRDF

When was a specimen of the frog with NCBI tax_id 190275 collected?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tcommon: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX tc: <http://rs.tdwg.org/ontology/voc/TaxonConcept#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#nameComplete>
PREFIX to: <http://rs.tdwg.org/ontology/voc/TaxonOccurrence#>

SELECT * WHERE {
	<http://bioguid.info/taxonomy:190275> tc:nameString ?n . 
	?node to:taxonName ?n .
	?occurrence to:identifiedTo ?node .
	?occurrence to:identifiedTo ?node .
	?occurrence to:institutionCode ?inst .
	?occurrence to:catalogNumber ?cat .
	OPTIONAL {
	?occurrence to:earliestDateCollected ?date .
	}
}

When was the DNA sequence from the frog with NCBI tax_id 190275 obtained?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT * WHERE {
  ?sequence dcterms:subject <http://purl.uniprot.org/taxonomy/190275> .
  ?sequence dcterms:created  ?date

} 

What publication was the frog described in?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tcommon: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX tc: <http://rs.tdwg.org/ontology/voc/TaxonConcept#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX to: <http://rs.tdwg.org/ontology/voc/TaxonOccurrence#>

SELECT * WHERE {
<http://bioguid.info/taxonomy:190275> tc:nameString ?name .
?ion tn:nameComplete ?name .
?ion tcommon:PublishedIn ?citation .
}

When was the frog described?

To answer this question we need to add some missing "glue" linking the publication string about to the DOI. Insert this triple into the triple store:

Now we can get the date:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tcommon: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX tc: <http://rs.tdwg.org/ontology/voc/TaxonConcept#>
PREFIX tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX to: <http://rs.tdwg.org/ontology/voc/TaxonOccurrence#>

SELECT * WHERE {
<http://bioguid.info/taxonomy:190275> tc:nameString ?name .
?ion tn:nameComplete ?name .
?ion tcommon:PublishedInCitation ?publication .
?publication dcterms:date ?date .
}

Based on these results answer the following questions:

  1. Was the DNA sequence used in discovering the new species?
  2. What was time interval between the specimen being collected and the description of the species it belongs to?

Wikidata

Wikidarta is a large, rapidly growing knowledge graph that covers everything in Wikipedia, and more. I have some example Wikidata queries, and below are some more examples of using Wikidata to ask various biological queries.

IUCN status

SELECT * WHERE 
{ 
  ?wikidata wdt:P141 ?status. 
  ?wikidata rdfs:label ?name .
  ?wikidata wdt:P225 ?taxon_name .
  ?status rdfs:label ?status_label .
  FILTER (lang(?name) = 'en') .
  FILTER (lang(?status_label) = 'en')
}
LIMIT 10
Try it

Taxa that have been sequenced

SELECT *
WHERE
{
  # taxon 
  ?taxon wdt:P31 wd:Q16521 .
  ?taxon wdt:P225 ?name . # taxonomic name
  ?taxon wdt:P685 ?ncbi .
}
limit 100
Try it

Maps for Hominidae

SELECT ?child_name ?map 
WHERE
{
 VALUES ?root_name {"Hominidae"}
 ?root wdt:P225 ?root_name .
 ?child wdt:P171+ ?root .
 ?child wdt:P171 ?parent .
 ?child wdt:P225 ?child_name .
 ?parent wdt:P225 ?parent_name .
 ?child wdt:P181 ?map . 
}
limit 100
Try it

Distribution maps

SELECT *
WHERE
{
  # taxon 
  ?taxon wdt:P31 wd:Q16521 .
  ?taxon wdt:P181 ?map .
}
limit 100
Try it

Publications describing new species

SELECT *
WHERE
{
  # taxon published in
  ?taxon wdt:P5326 ?work .
  ?work rdfs:label ?work_name .
  ?taxon rdfs:label ?name .
  FILTER (lang(?name) = 'en') .
  FILTER (lang(?work_name) = 'en') .
}
limit 100
Try it

Species hosted by the potato

SELECT *
WHERE
{
  # taxon 
  ?taxon wdt:P225 "Solanum tuberosum" .
  ?parasite wdt:P2975 ?taxon .
  ?parasite wdt:P225 ?parasite_name . 

}
Try it

Host associations

SELECT *
WHERE
{
  # taxon 
  ?taxon wdt:P31 wd:Q16521 .
  ?taxon wdt:P225 ?name . # taxonomic name
  ?taxon wdt:P2975 ?host .
  ?host wdt:P225 ?host_name . # taxonomic name of host
}
limit 100
Try it

Food source for sperm whale

SELECT *
WHERE
{
  # taxon 
  ?taxon wdt:P225 "Physeter macrocephalus" .
  ?taxon wdt:P1034  ?food .
  ?food rdfs:label ?food_label . 
  
  FILTER (lang(?food_label) = 'en') .
}
Try it

Food sources for different species

SELECT *
WHERE
{
  # taxon 
  ?taxon wdt:P31 wd:Q16521 .
  ?taxon wdt:P225 ?name  .
  # food source
  ?taxon wdt:P1034  ?food .
  ?food rdfs:label ?food_label . 
  
  FILTER (lang(?food_label) = 'en') .
 
}
limit 100
Try it

Data mining using R

If you use R then you might be interested in Biodiversity Observations Miner by Muñoz et al. (see https://fgabriel1891.shinyapps.io/biodiversityobservationsminer/).

Reading

blog comments powered by Disqus