diff tmp.tok @ 2:a47980ef2b96 draft

planemo upload for repository https://github.com/Alveo/alveo-galaxy-tools commit b5b26e9118f2ad8af109d606746b39a5588f0511-dirty
author stevecassidy
date Wed, 01 Nov 2017 01:19:55 -0400
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tmp.tok	Wed Nov 01 01:19:55 2017 -0400
@@ -0,0 +1,696 @@
+some
+text
+that
+is
+nøt
+øß
+ascii
+dada
+project
+is
+developing
+software
+for
+managing
+language
+resources
+and
+exposing
+them
+on
+the
+web
+.
+language
+resources
+are
+digital
+collections
+of
+language
+as
+audio
+,
+video
+and
+text
+used
+to
+study
+language
+and
+build
+technology
+systems
+.
+the
+project
+has
+been
+going
+for
+a
+while
+with
+some
+initial
+funding
+from
+the
+arc
+to
+build
+the
+basic
+infrastructure
+and
+later
+from
+macquarie
+university
+for
+some
+work
+on
+the
+auslan
+corpus
+of
+australian
+sign
+language
+collected
+by
+trevor
+johnston
+.
+recently
+we
+have
+two
+projects
+which
+dada
+will
+be
+part
+of
+,
+and
+so
+the
+pace
+of
+development
+has
+picked
+up
+a
+little
+.
+the
+australian
+national
+corpus
+(
+ausnc
+)
+is
+an
+effort
+to
+build
+a
+centralised
+collection
+of
+resources
+of
+language
+in
+australia
+.
+the
+core
+idea
+is
+to
+take
+whatever
+existing
+collections
+we
+can
+get
+permission
+to
+publish
+and
+make
+them
+available
+under
+a
+common
+technical
+infrastructure
+.
+using
+some
+funding
+from
+hcsnet
+we
+build
+a
+small
+demonstration
+site
+that
+allowed
+free
+text
+search
+on
+two
+collections
+:
+the
+australian
+corpus
+of
+english
+and
+the
+corpus
+of
+oz
+early
+english
+.
+we
+now
+have
+some
+funding
+to
+continue
+this
+work
+and
+expand
+both
+the
+size
+of
+the
+collection
+and
+the
+capability
+of
+the
+infrastructure
+that
+will
+support
+it
+.
+what
+we’ve
+already
+done
+is
+to
+separate
+the
+text
+in
+these
+corpora
+from
+their
+meta-data
+(
+descriptions
+of
+each
+text
+)
+and
+the
+annotation
+(
+denoting
+things
+within
+the
+texts
+)
+.
+while
+the
+pilot
+allows
+searching
+on
+the
+text
+the
+next
+steps
+will
+allow
+search
+using
+the
+meta-data
+(
+look
+for
+this
+in
+texts
+written
+after
+1900
+)
+and
+the
+annotation
+(
+find
+this
+in
+the
+titles
+of
+articles
+)
+.
+this
+project
+is
+funded
+by
+the
+australian
+national
+data
+service
+(
+ands
+)
+and
+is
+a
+collaboration
+with
+michael
+haugh
+at
+griffith
+.
+the
+big
+australian
+speech
+corpus
+,
+more
+recently
+renamed
+austalk
+,
+is
+an
+arc
+funded
+project
+to
+collect
+speech
+and
+video
+from
+1000
+australian
+speakers
+for
+a
+new
+freely
+available
+corpus
+.
+the
+project
+involves
+many
+partners
+around
+the
+country
+each
+of
+who
+will
+have
+a
+‘black
+box’
+recording
+station
+to
+collect
+audio
+and
+stereo
+video
+of
+subjects
+reading
+words
+and
+sentences
+,
+being
+interviewed
+and
+doing
+the
+map
+task
+–
+a
+game
+designed
+to
+elicit
+natural
+speech
+between
+two
+people
+.
+our
+part
+of
+the
+project
+is
+to
+provide
+the
+server
+infrastructure
+that
+will
+store
+the
+audio
+,
+video
+and
+annotation
+data
+that
+will
+make
+up
+the
+corpus
+.
+dada
+will
+be
+part
+of
+this
+solution
+but
+the
+main
+driver
+is
+to
+be
+able
+to
+provide
+a
+secure
+and
+reliable
+store
+for
+the
+primary
+data
+as
+it
+comes
+in
+from
+the
+collection
+sites
+.
+an
+important
+feature
+of
+the
+collection
+is
+the
+meta-data
+that
+will
+describe
+the
+subjects
+in
+the
+recording
+.
+some
+annotation
+of
+the
+data
+will
+be
+done
+automatically
+,
+for
+example
+some
+forced
+alignment
+of
+the
+read
+words
+and
+sentences
+.
+later
+,
+we
+will
+move
+on
+to
+support
+manual
+annotation
+of
+some
+of
+the
+data
+–
+for
+example
+transcripts
+of
+the
+interviews
+and
+map
+task
+sessions
+.
+all
+of
+this
+will
+be
+published
+via
+the
+dada
+server
+infrastructure
+to
+create
+a
+large
+,
+freely
+available
+research
+collection
+for
+australian
+english
+.
+since
+the
+development
+of
+dada
+now
+involves
+people
+outside
+macquarie
+,
+we
+have
+started
+using
+a
+public
+bitbucket
+repository
+for
+the
+code
+.
+as
+of
+this
+writing
+the
+code
+still
+needs
+some
+tidying
+and
+documentation
+to
+enable
+third
+parties
+to
+be
+able
+to
+install
+and
+work
+on
+it
+,
+but
+we
+hope
+to
+have
+that
+done
+within
+a
+month
+.
+the
+public
+dada
+demo
+site
+is
+down
+at
+the
+moment
+due
+to
+network
+upgrades
+at
+macquarie
+(
+it’s
+only
+visible
+inside
+mq
+)
+–
+i
+hope
+to
+have
+that
+fixed
+soon
+with
+some
+new
+sample
+data
+sets
+loaded
+up
+for
+testing
+.
+2011
+looks
+like
+it
+will
+be
+a
+significant
+year
+for
+dada
+.
+we
+hope
+to
+end
+this
+year
+with
+a
+number
+of
+significant
+text
+,
+audio
+and
+video
+corpora
+hosted
+on
+dada
+infrastructure
+and
+providing
+useful
+services
+to
+the
+linguistics
+and
+language
+technology
+communities
+.