|
 |
 |

NEWS
INDEX
Archives
2004
October
Mellon grant to fund project
to develop data-mining software for libraries
Andrea
Lynn, Humanities Editor
217-333-2177; andreal@uiuc.edu
10/25/04
 |
| John
Unsworth, dean of the U. of I. Graduate School of
Library and information Science. will lead a two-year
multi-institutional project funded by the Andrew W.
Mellon Foundation that is expected to produce software
for “for discovering, visualizing and exploring
significant patterns across large collections of full-text
humanities resources in digital libraries and collections.”
|
|
|
CHAMPAIGN, Ill.
— Using cutting edge “tools of discovery” and a diamond-sharp
new process called data-mining, information scientists at the University
of Illinois at Urbana-Champaign are beginning work that eventually will
help scholars carve out new literary knowledge in the works of writers
across languages, cultures and time.
The Andrew W. Mellon Foundation is funding the two-year, nearly $600,000
multi-institutional project, which John Unsworth, dean of Illinois’
Graduate School of Library and Information
Science (GSLIS), will lead.
In his winning project, titled “Web-based Text-Mining and Visualization
for Humanities Digital Libraries,” Unsworth expects to produce
software “for discovering, visualizing and exploring significant
patterns across large collections of full-text humanities resources
in digital libraries and collections.” The collections he’s
focusing on are at Illinois, Indiana University, the University of Michigan,
the University of North Carolina, Tufts University, the University of
Virginia and other universities.
In traditional “search-and-retrieval” projects, scholars
bring specific queries to collections of text and get back more or less
useful answers to those queries, Unsworth said.
“By contrast, the goal of data-mining, including text-mining,
is to produce new knowledge by exposing unanticipated similarities or
differences, clustering or dispersal,
co-occurrence and trends.”
During the last decade, he said, many millions of dollars have been
invested in creating digital library collections. Thus, today, terabytes
of full-text humanities resources are publicly available on the Web.
One terabyte, Unsworth said, equals 1,000 gigabytes, or enough storage
for 300 feature-length films in digital form.
Those collections, dispersed across many institutions, “are large
enough and rich enough to provide an excellent opportunity for text-mining.
By creating the Web-based software tools, we aim to make those collections
significantly more useful, more informative and more rewarding for research
and teaching.”
With its roots in statistics, artificial intelligence and machine learning,
data-mining has been around since the 1990s. And statistical analysis
of humanities texts is “one of the older activities in humanities
computing,” Unsworth said. “People have been doing it in
authorship-attribution studies, for example, for most of the last half
of the 20th century.
“But data mining per se – discovering patterns in large
textual data sets – is not something that’s been done much
in the humanities. Our project may not be a ‘first,’ but
it is an early entry into the field, certainly.”
Unsworth said he intends to build on data-mining expertise at GSLIS
and on “several years of software development work” that
has been done at the U. of I.’s National Center for Supercomputing
Applications, in particular, work developing the D2K (Data 2 Knowledge)
software in Michael Welge’s Automated Learning Group.
“This project relies on Michael’s D2K, and could not happen
without it,” Unsworth said. “We’re grateful for his
participation.”
Nor is this the first GSLIS project to build on D2K, Unsworth said.
The National Science Foundation and the Mellon Foundation has funded
Stephen Downie, a young scholar in GSLIS, to use D2K in a music information
retrieval project.
With data-mining tools, Unsworth said, you first select a body of material
that you think is important in some way, next select features of those
materials that you similarly think are important, and then “map
the occurrence of those features in the selected materials to see whether
patterns emerge. If patterns do emerge, you analyze them and from that
analysis emerges – if you are lucky – new insights into
the materials.”
For example, in the planning grant for this project, members of his
research team, using the full set of Shakespeare’s plays, selected
five “circulation-of characters”
features – scenes, nodes, singles, loops, switches – as
independent variables, and “genre” as the dependent variable;
they then “attempted to order the plays by feature similarities
and see how that corresponded – or didn’t – to genre,”
he said.
“There was one very interesting result, which was that Othello
fell squarely in with the comedies. If I were to analyze this result,
I’d ask a number of questions about the methods used to produce
the results, but once satisfied that I was not looking at an artifact
of the procedure itself, I would ask what it means that Othello has
the structural features of comedy, and from there, an interesting journal
article might emerge.”
Unsworth said that
this example isn’t strictly representative of what he’s
proposing to do in his project in terms of the scope of the data set.
“In the project we plan to explore thousands of works by hundreds
of authors,” he said. “Part of the experimentation will
be to determine what features are meaningful at what level of generality,
what subsets present the richest veins for data-mining and what methods
expose the most interesting patterns at what scope.”
Unsworth said that he and his team of researchers know literary scholars
are interested in the works that make up the data set he proposes to
use – British and American literary texts of many types, mostly
from the 19th century – and he knows that the features they’ll
be identifying are “features of interest,” especially structurally.
“What we don’t know, because this is an experiment with
a tool of discovery, is what interesting patterns we will find as we
map these features across this body of works. It is, therefore, a bit
of a leap of faith to accept the assertion that interesting patterns
will emerge, but I do make that assertion and I am comfortable doing
so.
“To date, we haven’t had a tool that exposes patterns in
literary texts at the level of granularity and the scope that we propose
in this project, but we know that the D2K tools work at that scope and
granularity with other kinds of data, and we know that literature –
and language itself – exhibits some meaningful patterns at every
level we can observe, so it seems reasonable to hypothesize that new
levels of observation across larger scopes of literary text at higher
resolution, with respect to textual features, will expose meaningful
patterns that haven’t been visible before.
“From there, it will be up to literary scholars to analyze, interpret
and explain those patterns, and in a very general way, that activity
is the advance in literary scholarship that we assume will emerge from
this project.”
Additional project partners in humanities research computing are Stephen
Ramsay, English department at the University of Georgia; Matthew Kirschenbaum,
English department at the University of Maryland, and fellow at the
Maryland Institute for Technology in the Humanities; and Tom Horton,
computer science department at the University of Virginia.
The Mellon Foundation provided an earlier $56,000 planning grant for
this project in 2003.
The new Mellon grant is the second major grant Unsworth has won this
fall. With co-project investigator Beth Sandore, associate university
librarian for information technology planning and policy at Illinois,
he won nearly $3 million over three years from the Library of Congress
to take part in a massive project to save at-risk digital materials
nationwide. Through the grant, the U. of I. Library and the U. of I.
Graduate School of Library and Information Science will take a leadership
role in the National Digital Information Infrastructure and Preservation
Project.
|
 |
 |
|