Mercurial > repos > stevecassidy > nltktools

some
text
that
is
nøt
øß
ascii
dada
project
is
developing
software
for
managing
language
resources
and
exposing
them
on
the
web
.
language
resources
are
digital
collections
of
language
as
audio
,
video
and
text
used
to
study
language
and
build
technology
systems
.
the
project
has
been
going
for
a
while
with
some
initial
funding
from
the
arc
to
build
the
basic
infrastructure
and
later
from
macquarie
university
for
some
work
on
the
auslan
corpus
of
australian
sign
language
collected
by
trevor
johnston
.
recently
we
have
two
projects
which
dada
will
be
part
of
,
and
so
the
pace
of
development
has
picked
up
a
little
.
the
australian
national
corpus
(
ausnc
)
is
an
effort
to
build
a
centralised
collection
of
resources
of
language
in
australia
.
the
core
idea
is
to
take
whatever
existing
collections
we
can
get
permission
to
publish
and
make
them
available
under
a
common
technical
infrastructure
.
using
some
funding
from
hcsnet
we
build
a
small
demonstration
site
that
allowed
free
text
search
on
two
collections
:
the
australian
corpus
of
english
and
the
corpus
of
oz
early
english
.
we
now
have
some
funding
to
continue
this
work
and
expand
both
the
size
of
the
collection
and
the
capability
of
the
infrastructure
that
will
support
it
.
what
we’ve
already
done
is
to
separate
the
text
in
these
corpora
from
their
meta-data
(
descriptions
of
each
text
)
and
the
annotation
(
denoting
things
within
the
texts
)
.
while
the
pilot
allows
searching
on
the
text
the
next
steps
will
allow
search
using
the
meta-data
(
look
for
this
in
texts
written
after
1900
)
and
the
annotation
(
find
this
in
the
titles
of
articles
)
.
this
project
is
funded
by
the
australian
national
data
service
(
ands
)
and
is
a
collaboration
with
michael
haugh
at
griffith
.
the
big
australian
speech
corpus
,
more
recently
renamed
austalk
,
is
an
arc
funded
project
to
collect
speech
and
video
from
1000
australian
speakers
for
a
new
freely
available
corpus
.
the
project
involves
many
partners
around
the
country
each
of
who
will
have
a
‘black
box’
recording
station
to
collect
audio
and
stereo
video
of
subjects
reading
words
and
sentences
,
being
interviewed
and
doing
the
map
task
–
a
game
designed
to
elicit
natural
speech
between
two
people
.
our
part
of
the
project
is
to
provide
the
server
infrastructure
that
will
store
the
audio
,
video
and
annotation
data
that
will
make
up
the
corpus
.
dada
will
be
part
of
this
solution
but
the
main
driver
is
to
be
able
to
provide
a
secure
and
reliable
store
for
the
primary
data
as
it
comes
in
from
the
collection
sites
.
an
important
feature
of
the
collection
is
the
meta-data
that
will
describe
the
subjects
in
the
recording
.
some
annotation
of
the
data
will
be
done
automatically
,
for
example
some
forced
alignment
of
the
read
words
and
sentences
.
later
,
we
will
move
on
to
support
manual
annotation
of
some
of
the
data
–
for
example
transcripts
of
the
interviews
and
map
task
sessions
.
all
of
this
will
be
published
via
the
dada
server
infrastructure
to
create
a
large
,
freely
available
research
collection
for
australian
english
.
since
the
development
of
dada
now
involves
people
outside
macquarie
,
we
have
started
using
a
public
bitbucket
repository
for
the
code
.
as
of
this
writing
the
code
still
needs
some
tidying
and
documentation
to
enable
third
parties
to
be
able
to
install
and
work
on
it
,
but
we
hope
to
have
that
done
within
a
month
.
the
public
dada
demo
site
is
down
at
the
moment
due
to
network
upgrades
at
macquarie
(
it’s
only
visible
inside
mq
)
–
i
hope
to
have
that
fixed
soon
with
some
new
sample
data
sets
loaded
up
for
testing
.
2011
looks
like
it
will
be
a
significant
year
for
dada
.
we
hope
to
end
this
year
with
a
number
of
significant
text
,
audio
and
video
corpora
hosted
on
dada
infrastructure
and
providing
useful
services
to
the
linguistics
and
language
technology
communities
.
author	stevecassidy
date	Mon, 20 Nov 2017 22:52:11 -0500
parents	a47980ef2b96
children