Date: Mon, 8 Jul
2002 08:50:32 -0400 (EDT)
From: Ronald Reck
<rreck@iama.rrecktek.com>
To:
CORPORA@HD.UIB.NO
Subject:
[Corpora-List] string frequency reports for Project Gutenberg texts
Hello all,
I have created
string frequency
reports for 5400+
books (400M words)
from Project
Gutenberg:
http://iama.rrecktek.com/text/frequency/
they are
searchable here:
http://iama.rrecktek.com/cgi-bin/apps/wordfind/searchpg.pl
the process is
described briefly here with links to
all the src in
CVS:
http://iama.rrecktek.com/text/
I am looking for
help in improving
these graphs of string
frequency histograms across the archive
when they are
rendered in SVG:
http://iama.rrecktek.com/text/frequency/words/seeall.html
I merged some of
the results into an SVG:
(its worth the
plugin hassle)
http://iama.rrecktek.com/~rreck/samplesvg
I also extended
the DAML ontology for PG presented here:
http://www.daml.org/ontologies/113
and created RDF
metadata for the archive here:
http://iama.rrecktek.com/text/frequency/meta/
the meta data is
loaded into a specialty rdf backend called
Parka. this
example query shows how to get RF values for an
author's use of
certain strings:
http://iama.rrecktek.com/cgi-bin/apps/parka/parka.pl
Comments, and
criticisms are very appreciated,
(I know the png graphs
arent labeled well, all will get fixed
in the SVG s.)
----
Ronald P.
Reck
rreck@iama.rrecktek.com