annotate tmp.dat @ 3:0df72a8ab095 draft default tip

planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit f2432aaedd36ae7662873623d8861d0982dffdd2
author stevecassidy
date Mon, 20 Nov 2017 22:52:11 -0500
parents a47980ef2b96
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
1 Some text that is nøt øß ascii
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
2 DADA project is developing software for managing language resources and exposing them on the web .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
3 Language resources are digital collections of language as audio , video and text used to study language and build technology systems .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
4 The project has been going for a while with some initial funding from the ARC to build the basic infrastructure and later from Macquarie University for some work on the Auslan corpus of Australian Sign Language collected by Trevor Johnston .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
5 Recently we have two projects which DADA will be part of , and so the pace of development has picked up a little .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
6 The Australian National Corpus ( AusNC ) is an effort to build a centralised collection of resources of language in Australia .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
7 The core idea is to take whatever existing collections we can get permission to publish and make them available under a common technical infrastructure .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
8 Using some funding from HCSNet we build a small demonstration site that allowed free text search on two collections : the Australian Corpus of English and the Corpus of Oz Early English .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
9 We now have some funding to continue this work and expand both the size of the collection and the capability of the infrastructure that will support it .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
10 What we ’ ve already done is to separate the text in these corpora from their meta - data ( descriptions of each text ) and the annotation ( denoting things within the texts ).
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
11 While the pilot allows searching on the text the next steps will allow search using the meta - data ( look for this in texts written after 1900 ) and the annotation ( find this in the titles of articles ).
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
12 This project is funded by the Australian National Data Service ( ANDS ) and is a collaboration with Michael Haugh at Griffith .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
13 The Big Australian Speech Corpus , more recently renamed AusTalk , is an ARC funded project to collect speech and video from 1000 Australian speakers for a new freely available corpus .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
14 The project involves many partners around the country each of who will have a ‘ black box ’ recording station to collect audio and stereo video of subjects reading words and sentences , being interviewed and doing the Map task – a game designed to elicit natural speech between two people .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
15 Our part of the project is to provide the server infrastructure that will store the audio , video and annotation data that will make up the corpus .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
16 DADA will be part of this solution but the main driver is to be able to provide a secure and reliable store for the primary data as it comes in from the collection sites .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
17 An important feature of the collection is the meta - data that will describe the subjects in the recording .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
18 Some annotation of the data will be done automatically , for example some forced alignment of the read words and sentences .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
19 Later , we will move on to support manual annotation of some of the data – for example transcripts of the interviews and map task sessions .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
20 All of this will be published via the DADA server infrastructure to create a large , freely available research collection for Australian English .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
21 Since the development of DADA now involves people outside Macquarie , we have started using a public bitbucket repository for the code .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
22 As of this writing the code still needs some tidying and documentation to enable third parties to be able to install and work on it , but we hope to have that done within a month .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
23 The public DADA demo site is down at the moment due to network upgrades at Macquarie ( it ’ s only visible inside MQ ) – I hope to have that fixed soon with some new sample data sets loaded up for testing .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
24 2011 looks like it will be a significant year for DADA .
a47980ef2b96 planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
stevecassidy
parents: 1
diff changeset
25 We hope to end this year with a number of significant text , audio and video corpora hosted on DADA infrastructure and providing useful services to the linguistics and language technology communities .