Home Web demo Login About
logo
About and Help
Accounts
Test drives are encouraged — required for file upload — getting an account — your account dashboard
Processing
uploading files — retrieving results — processed file formats
Examples
"before and after" examples
Limitations
file sizes — languages & vocab — perfection is impossible — indexing corrected text
Pricing
rates and services
Paper
technical description of overProof delivered to the DaTECH 2014 conference
About Project Computing
who are we?

Accounts

Test Drives are encouraged

We've worked hard to create overProof as a tool which makes searching and using digitised archives much more effective, but only you can assess its suitability for your particular requirements.

We want to make that assessment as easy as possible, so trial accounts let you test overProof on your own content, obligation-free.

You need an account to upload files

Anyone can try overProof by using the Web demo form, but to upload files of OCR'ed text for processing, you'll need an account.

Get an account by emailing us at [email protected]. and starting a discussion with us about:

  • your text sources (books, newspapers, ...), formats (plain, ALTO, hOCR, ...) and languages
  • your sample text for assessing overProof's effectiveness
  • your anticipated volumes and required turn-around times
  • any special requirements you may have.

Your account dashboard

Your account dashboard shows the processing status of your uploaded jobs. You can filter the results by processing status and the contents of the job and comment metadata fields provided when you created the job.

Processing

Uploading files

You need an account to upload files.

Files are uploaded using http binary transfer to supply the file, HTTP basic authentication to supply your account credentials, and HTTP headers to provide the metadata overProof requires.

Each upload is referred to as a job by overProof. You can check on the progress of your jobs using your account's dashboard which is shown to you when you login to the overProof website.

The required HTTP headers are:

  • X-CUSTOMER - your overProof account name
  • X-JOB - an identifier of the file meaningful to you. It could be the name of the file you are uploading. It is important that you can use this identifier to associate the job with content in your system so that your know how to process overProof's output file. Up to 128 characters in length.
  • X-COMMENT - a comment you choose to associate with this job. Up to 256 characters in length.
  • X-FORMAT - the format of the supplied file. Must be one of: plain, alto or hocr.
  • X-COMPRESSED - whether the supplied file has been compressed using gzip. Must be one of: y or n.

Here's an example of using curl to upload an ALTO file:

curl --user yourAccount:yourPassword \
  --header 'X-CUSTOMER: yourAccount' --header 'X-JOB: your job description' \
  --header 'X-COMMENT: your job comment'  --header 'X-FORMAT: alto' \
  --header 'X-COMPRESSED: n' -i \
  --data-binary '@/some/file/to/upload.gz' http://overproof.projectcomputing.com/supply

You don't have to use curl; you can use any tool or program to generate the simple stream required. Here's a sample stream to upload a file containing two simple lines:

POST /supply HTTP/1.1
Authorization: Basic eW91ckFjY291bnQ6eW91clBhc3N3b3Jk
User-Agent: myDemoProgram
Host: overproof.projectcomputing.com
Accept: */*
X-CUSTOMER: yourAccount
X-JOB: your job description
X-COMMENT: your job comment
X-FORMAT: plain
X-COMPRESSED: n
Content-Length: 18
Content-Type: application/x-www-form-urlencoded

ABC
one two three

The response to the upload request will contain a unique job id generated by overProof which you may use to track the job and download the corrected file. Here's a sample response:

OK: 1382587314752121
Format: plain
Compressed: n
Job: your job description
Comment: your job comment
FileLength: 18
id: 1382587314752121

The job id in the above response is 1382587314752121, which may be provided as a parameter to the overProof showRequest url:

http://overproof.projectcomputing.com/showRequest/{requestId}

Using curl to upload multiple files in a directory

The following shell script will upload all files (presumed to be ALTO files) ending in .gz in the current directory, setting the X-JOB header to the file name, which provides a handy way to associate each file with its overProof job:

for f in *.gz; do \
curl --user yourAccount:yourPassword \
  --header 'X-CUSTOMER: yourAccount' --header "X-JOB: $f" --header 'X-COMMENT: your job comment' \
  --header 'X-FORMAT: alto'  --header 'X-COMPRESSED: n' \
  -i --data-binary "@$f" http://overproof.projectcomputing.com/supply; done

Using java to upload files and retrieve results

Easy-to-use java classes which upload files and retrieve results are available to account-holders.

Retrieving processing status and results

Account-holders have access to a web dashboard which displays the processing status of all their requests.

They may also use a HTTP REST API to programatically retrieve processing status and output files.

Processed file formats
  • ALTO

    The ALTO XML standard schema defines the <ALTERNATIVE> element for specifying alternative interpretations of OCR'ed text.

    OverProof generates this element in accordance with the standard. For example, given the following snippet of ALTO to correct:

    <String ID='S14' CONTENT='extensiveenquiry' HPOS='521' VPOS='831' WIDTH='258' HEIGHT='21'
         STYLEREFS='TS4' WC='0.94' CC='6 8 8 9 9 9 8 8 8 9 9 9 8 9 9 9'/>
    

    overProof will generate the following ALTO:

    <String ID='S14' CONTENT='extensiveenquiry' HPOS='521' VPOS='831' WIDTH='258' HEIGHT='21'
         STYLEREFS='TS4' WC='0.94' CC='6 8 8 9 9 9 8 8 8 9 9 9 8 9 9 9'>
         <ALTERNATIVE PURPOSE='overProof' CONTENT='extensive enquiry'/>
    </String>

    As shown, all <ALTERNATIVE> elements generated by overProof contain the standard PURPOSE attribute with a value of overProof.

  • hOCR

    The hOCR format is based on html. OverProof corrects the contents of span elements with a class attribute equal to ocrx_word, and adds a CORRECTION attribute to corrected span elements, for example:

    <span class='ocrx_word' id='xword_1_189' title='x_wconf -1'
        CORRECTION='extensive enquiry'>extensiveenquiry</span>
    
  • Plain text

    OverProof generates a plain text file with corrections "in place". For example:

    InputOutput
    , At 'a banquet giwn inthe Parliannent
    Hous'o,Mdbourno, to' tho Now South
    Walos legislEtors who, lately proceeded 
    to Mclbauriie to plaj in a Parlianientorj 
    Crickot NIatch, his Bxcelloiiey Sir 
    Hcnry B. Locli, in rcturning tlianks for 
    tho toast of hie health; said "the future 
    of these ureat : Austrdian colonieswas
    hound up in ; the far wider and greater 
    in tho namo oi Austral asia
    
    At a banquet given in the Parliament
    House Melbourne, to the New South
    Wales legislators who, lately proceeded
    to Melbourne to play in a Parliamentary
    Cricket Match, his Excellency Sir
    Henry B. Loch, in returning thanks for
    the toast of his health said "the future
    of these great Australian colonies was
    bound up in the far wider and greater
    in the name of Australasia
    

    Note: for plain text input, incoming "white space" formatting (such as new lines and paragraphs) is not preserved: all emitted words are separated by a single space. Use ALTO or hOCR if you need to preserve formatting.

Examples

Here's a list of sample articles from evaluation dataset 2 (SMH) and evaluation dataset 3 (U.S. Chronicling America). The first link in each line shows the uncorrected OCR and corrected text, with changed words highlighted. The second link shows the article (in NLA's Trove) or page PDF (in LoC's Chronicling America).

Raw OCR and corrected textArticle in source corpus
Land SaleSydney Morning Herald, 14 Apr 1843Trove
To the EditorSydney Morning Herald, 11 Jul 1861Trove
Country WorksSydney Morning Herald, 21 Jul 1864Trove
Water Police CourtSydney Morning Herald, 31 Mar 1871Trove
PoliceSydney Morning Herald, 17 Oct 1884Trove
Country NewsSydney Morning Herald, 12 Jan 1889Trove
Salvation ArmySydney Morning Herald, 11 Mar 1899Trove
Antarctic ExplorationSydney Morning Herald, 20 May 1901Trove
PersonalSydney Morning Herald, 30 May 1905Trove
ExplosionSydney Morning Herald, 01 Sep 1920Trove
Quarter SessionsSydney Morning Herald, 20 Oct 1923Trove
School ChurchSydney Morning Herald, 07 Apr 1926Trove
Explosion at RydeSydney Morning Herald, 19 Jul 1935Trove
Visit of AmericansSydney Morning Herald, 23 Jan 1936Trove
Mothers' Day AppealSydney Morning Herald, 28 Apr 1938Trove
Supreme Court DivorceSydney Morning Herald, 21 Jan 1939Trove
Bulgarian ReservistsSydney Morning Herald, 26 Aug 1939Trove
W H Donald RescueSydney Morning Herald, 26 Feb 1945Trove
£60,000 EstateSydney Morning Herald, 05 Feb 1952Trove
PoliticalCairo Daily Bulletin (Illinois), 21 Jun 1871Chronicling America
Senate Proceedings (partial)Cairo Daily Bulletin (Illinois), 21 Dec 1871Chronicling America
Death of Samuel MorseCairo Daily Bulletin (Illinois), 5 Apr 1872Chronicling America
Maui NotesThe Independent (Honolulu), 25 Jun 1895Chronicling America
300 lettersThe Independent (Honolulu), 21 Jun 1896Chronicling America
Ladies of the Cannibal IslesThe Independent (Honolulu), 21 Jul 1896Chronicling America
Great Britain of the EastThe Independent (Honolulu), 21 Jun 1897Chronicling America
Passenger TravelThe Independent (Honolulu), 21 Jun 1898Chronicling America
CommencementsThe Independent (Honolulu), 21 Jun 1898Chronicling America
Roman miningMohave County Miner, 18 Jun 1887Chronicling America
India copperMohave County Miner, 17 Jun 1899Chronicling America
General Mining NewsMohave County Miner, 20 Jun 1903Chronicling America
In Solemn ServiceSan Francisco Call, 1 Jan 1900Chronicling America
Sparrows attack menThe Washington Times, 21 Jun 1921Chronicling America
Moyer Prison JobThe Washington Times, 21 Jun 1921Chronicling America
D.C. Rent ActThe Washington Times, 21 Jun 1921Chronicling America

Details of our evaluation datasets and improvements in recall, false positives and text readability are discussed here.

Limitations

File size

The default compressed plain text, ALTO and hOCR file size limit is 20MB. Let us know if this is inadequate for your intended usage.

Languages and vocabularies

The current production version of overProof has been trained on a "general" vocabulary of the English language.

However, overProof's architecture is language and vocabulary neutral, so if you have a requirement to correct texts in other languages, or specialist vocabularies (such as industrial patents, or physical science papers) please contact us as we'd love to work with your data to extend overProof to meet your needs.

The impossibility of perfect correction, and implications for indexing corrected text

Correction can be neither complete nor perfect. Some source materials are of such a poor quality that even expert human correctors can only guess at the original text, even with reference to the image. Some text contains such atypical language that any statistical language approach to analysing it will fail to correct OCR errors within it, and will even introduce new errors: that is, rather than making it better, it will make it worse.

The development of overProof has focused on providing the best possible correction rate that can be achieved algorithmically and based on an immense corpus of trained text and OCR error analysis, but errors are inevitable.

Hence a prudent approach to indexing text will index both the original and the correction. The annoyances of false positives arising from search hits on uncorrected text will typically be outweighed by improved recall, particularly as the majority of corrected words are unsearched "nonsense" words in their pre-corrected version.

Read more about how we have evaluated overProof's performance here.

Pricing

Pricing varies based on the number of words submitted for correction per month:

Words per monthCost per month
Less than 1 million$100
Between 1 million - 10 million$100 plus $10 per million words
Between 10 million - 100 million$190 + $7 per million words
Between 100 million - 200 million$820 + $6 per million words
Between 200 million - 1 billion$1420 + $5 per million words
Over 1 billion$5420 + $4.80 per million words

Real-time processing (very rapid turnaround), and highly secure processing (dedicated customer servers) services are available at additional cost.

Costs are quoted in US dollars, and include GST for Australian customers.

Paper

A peer-reviewed technical description of overProof delivered to the DaTECH 2014 Conference Madrid, May 19-20 2014, is available as a pdf. Slides and video of the presentation are available from the digitisation.eu blog entry: DATeCH 3rd Session - Postcorrection.

About Project Computing

Project Computing Pty Ltd - since 1983

We're an Australian-based software house. Over the past 30 years we've developed systems used by large commercial and government organisations around the world.

We've recently been heavily involved with the design and implementation of the immensely popular and award-winning Newspaper digitisation and Trove discovery systems at the National Library of Australia.

We specialise in large text processing systems and big data

We've been designing and implementing large-scale text searching systems since the 1980's. We've also have a long and deep experience with large mathematical models of the type used by overProof which we can use to assist you with other text corpus processing such as quality analysis, vocabulary extraction, named entity extraction, preliminary search term extraction, sentiment analysis and visualisation.

You can find out more about us here.