Mercurial > repos > stevecassidy > nltktools
diff tmp.tok @ 2:a47980ef2b96 draft
planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
author | stevecassidy |
---|---|
date | Wed, 01 Nov 2017 01:19:55 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tmp.tok Wed Nov 01 01:19:55 2017 -0400 @@ -0,0 +1,696 @@ +some +text +that +is +nøt +øß +ascii +dada +project +is +developing +software +for +managing +language +resources +and +exposing +them +on +the +web +. +language +resources +are +digital +collections +of +language +as +audio +, +video +and +text +used +to +study +language +and +build +technology +systems +. +the +project +has +been +going +for +a +while +with +some +initial +funding +from +the +arc +to +build +the +basic +infrastructure +and +later +from +macquarie +university +for +some +work +on +the +auslan +corpus +of +australian +sign +language +collected +by +trevor +johnston +. +recently +we +have +two +projects +which +dada +will +be +part +of +, +and +so +the +pace +of +development +has +picked +up +a +little +. +the +australian +national +corpus +( +ausnc +) +is +an +effort +to +build +a +centralised +collection +of +resources +of +language +in +australia +. +the +core +idea +is +to +take +whatever +existing +collections +we +can +get +permission +to +publish +and +make +them +available +under +a +common +technical +infrastructure +. +using +some +funding +from +hcsnet +we +build +a +small +demonstration +site +that +allowed +free +text +search +on +two +collections +: +the +australian +corpus +of +english +and +the +corpus +of +oz +early +english +. +we +now +have +some +funding +to +continue +this +work +and +expand +both +the +size +of +the +collection +and +the +capability +of +the +infrastructure +that +will +support +it +. +what +we’ve +already +done +is +to +separate +the +text +in +these +corpora +from +their +meta-data +( +descriptions +of +each +text +) +and +the +annotation +( +denoting +things +within +the +texts +) +. +while +the +pilot +allows +searching +on +the +text +the +next +steps +will +allow +search +using +the +meta-data +( +look +for +this +in +texts +written +after +1900 +) +and +the +annotation +( +find +this +in +the +titles +of +articles +) +. +this +project +is +funded +by +the +australian +national +data +service +( +ands +) +and +is +a +collaboration +with +michael +haugh +at +griffith +. +the +big +australian +speech +corpus +, +more +recently +renamed +austalk +, +is +an +arc +funded +project +to +collect +speech +and +video +from +1000 +australian +speakers +for +a +new +freely +available +corpus +. +the +project +involves +many +partners +around +the +country +each +of +who +will +have +a +‘black +box’ +recording +station +to +collect +audio +and +stereo +video +of +subjects +reading +words +and +sentences +, +being +interviewed +and +doing +the +map +task +– +a +game +designed +to +elicit +natural +speech +between +two +people +. +our +part +of +the +project +is +to +provide +the +server +infrastructure +that +will +store +the +audio +, +video +and +annotation +data +that +will +make +up +the +corpus +. +dada +will +be +part +of +this +solution +but +the +main +driver +is +to +be +able +to +provide +a +secure +and +reliable +store +for +the +primary +data +as +it +comes +in +from +the +collection +sites +. +an +important +feature +of +the +collection +is +the +meta-data +that +will +describe +the +subjects +in +the +recording +. +some +annotation +of +the +data +will +be +done +automatically +, +for +example +some +forced +alignment +of +the +read +words +and +sentences +. +later +, +we +will +move +on +to +support +manual +annotation +of +some +of +the +data +– +for +example +transcripts +of +the +interviews +and +map +task +sessions +. +all +of +this +will +be +published +via +the +dada +server +infrastructure +to +create +a +large +, +freely +available +research +collection +for +australian +english +. +since +the +development +of +dada +now +involves +people +outside +macquarie +, +we +have +started +using +a +public +bitbucket +repository +for +the +code +. +as +of +this +writing +the +code +still +needs +some +tidying +and +documentation +to +enable +third +parties +to +be +able +to +install +and +work +on +it +, +but +we +hope +to +have +that +done +within +a +month +. +the +public +dada +demo +site +is +down +at +the +moment +due +to +network +upgrades +at +macquarie +( +it’s +only +visible +inside +mq +) +– +i +hope +to +have +that +fixed +soon +with +some +new +sample +data +sets +loaded +up +for +testing +. +2011 +looks +like +it +will +be +a +significant +year +for +dada +. +we +hope +to +end +this +year +with +a +number +of +significant +text +, +audio +and +video +corpora +hosted +on +dada +infrastructure +and +providing +useful +services +to +the +linguistics +and +language +technology +communities +.