- Accounts
- Test drives are encouraged — required for file upload — getting an account — your account dashboard
- Processing
- uploading files — retrieving results — processed file formats
- Examples
- "before and after" examples
- Limitations
- file sizes — languages & vocab — perfection is impossible — indexing corrected text
- Pricing
- rates and services
- Paper
- technical description of overProof delivered to the DaTECH 2014 conference
- About Project Computing
- who are we?
Accounts
We've worked hard to create overProof as a tool which makes searching and using digitised archives much more effective, but only you can assess its suitability for your particular requirements.
We want to make that assessment as easy as possible, so trial accounts let you test overProof on your own content, obligation-free.
Anyone can try overProof by using the Web demo form, but to upload files of OCR'ed text for processing, you'll need an account.
Get an account by emailing us at [email protected]. and starting a discussion with us about:
- your text sources (books, newspapers, ...), formats (plain, ALTO, hOCR, ...) and languages
- your sample text for assessing overProof's effectiveness
- your anticipated volumes and required turn-around times
- any special requirements you may have.
Your account dashboard shows the processing status of your uploaded jobs. You can filter the results by processing status and the contents of the job and comment metadata fields provided when you created the job.
Processing
You need an account to upload files.
Files are uploaded using http binary transfer to supply the file, HTTP basic authentication to supply your account credentials, and HTTP headers to provide the metadata overProof requires.Each upload is referred to as a job by overProof. You can check on the progress of your jobs using your account's dashboard which is shown to you when you login to the overProof website.
The required HTTP headers are:
- X-CUSTOMER - your overProof account name
- X-JOB - an identifier of the file meaningful to you. It could be the name of the file you are uploading. It is important that you can use this identifier to associate the job with content in your system so that your know how to process overProof's output file. Up to 128 characters in length.
- X-COMMENT - a comment you choose to associate with this job. Up to 256 characters in length.
- X-FORMAT - the format of the supplied file. Must be one of: plain, alto or hocr.
- X-COMPRESSED - whether the supplied file has been compressed using gzip. Must be one of: y or n.
Here's an example of using curl to upload an ALTO file:
curl --user yourAccount:yourPassword \ --header 'X-CUSTOMER: yourAccount' --header 'X-JOB: your job description' \ --header 'X-COMMENT: your job comment' --header 'X-FORMAT: alto' \ --header 'X-COMPRESSED: n' -i \ --data-binary '@/some/file/to/upload.gz' http://overproof.projectcomputing.com/supply
You don't have to use curl; you can use any tool or program to generate the simple stream required. Here's a sample stream to upload a file containing two simple lines:
POST /supply HTTP/1.1 Authorization: Basic eW91ckFjY291bnQ6eW91clBhc3N3b3Jk User-Agent: myDemoProgram Host: overproof.projectcomputing.com Accept: */* X-CUSTOMER: yourAccount X-JOB: your job description X-COMMENT: your job comment X-FORMAT: plain X-COMPRESSED: n Content-Length: 18 Content-Type: application/x-www-form-urlencoded ABC one two three
The response to the upload request will contain a unique job id generated by overProof which you may use to track the job and download the corrected file. Here's a sample response:
OK: 1382587314752121 Format: plain Compressed: n Job: your job description Comment: your job comment FileLength: 18 id: 1382587314752121
The job id in the above response is 1382587314752121, which may be provided as a parameter to the overProof showRequest url:
http://overproof.projectcomputing.com/showRequest/{requestId}
Using curl to upload multiple files in a directory
The following shell script will upload all files (presumed to be ALTO files) ending in .gz in the current directory, setting the X-JOB header to the file name, which provides a handy way to associate each file with its overProof job:
for f in *.gz; do \ curl --user yourAccount:yourPassword \ --header 'X-CUSTOMER: yourAccount' --header "X-JOB: $f" --header 'X-COMMENT: your job comment' \ --header 'X-FORMAT: alto' --header 'X-COMPRESSED: n' \ -i --data-binary "@$f" http://overproof.projectcomputing.com/supply; done
Using java to upload files and retrieve results
Easy-to-use java classes which upload files and retrieve results are available to account-holders.
Account-holders have access to a web dashboard which displays the processing status of all their requests.
They may also use a HTTP REST API to programatically retrieve processing status and output files.
ALTO
The ALTO XML standard schema defines the <ALTERNATIVE> element for specifying alternative interpretations of OCR'ed text.
OverProof generates this element in accordance with the standard. For example, given the following snippet of ALTO to correct:
<String ID='S14' CONTENT='extensiveenquiry' HPOS='521' VPOS='831' WIDTH='258' HEIGHT='21' STYLEREFS='TS4' WC='0.94' CC='6 8 8 9 9 9 8 8 8 9 9 9 8 9 9 9'/>
overProof will generate the following ALTO:
<String ID='S14' CONTENT='extensiveenquiry' HPOS='521' VPOS='831' WIDTH='258' HEIGHT='21' STYLEREFS='TS4' WC='0.94' CC='6 8 8 9 9 9 8 8 8 9 9 9 8 9 9 9'> <ALTERNATIVE PURPOSE='overProof' CONTENT='extensive enquiry'/> </String>
As shown, all <ALTERNATIVE> elements generated by overProof contain the standard PURPOSE attribute with a value of overProof.
hOCR
The hOCR format is based on html. OverProof corrects the contents of span elements with a class attribute equal to ocrx_word, and adds a CORRECTION attribute to corrected span elements, for example:
<span class='ocrx_word' id='xword_1_189' title='x_wconf -1' CORRECTION='extensive enquiry'>extensiveenquiry</span>
Plain text
OverProof generates a plain text file with corrections "in place". For example:
Input Output , At 'a banquet giwn inthe Parliannent Hous'o,Mdbourno, to' tho Now South Walos legislEtors who, lately proceeded to Mclbauriie to plaj in a Parlianientorj Crickot NIatch, his Bxcelloiiey Sir Hcnry B. Locli, in rcturning tlianks for tho toast of hie health; said "the future of these ureat : Austrdian colonieswas hound up in ; the far wider and greater in tho namo oi Austral asia
At a banquet given in the Parliament House Melbourne, to the New South Wales legislators who, lately proceeded to Melbourne to play in a Parliamentary Cricket Match, his Excellency Sir Henry B. Loch, in returning thanks for the toast of his health said "the future of these great Australian colonies was bound up in the far wider and greater in the name of Australasia
Note: for plain text input, incoming "white space" formatting (such as new lines and paragraphs) is not preserved: all emitted words are separated by a single space. Use ALTO or hOCR if you need to preserve formatting.
Examples
Here's a list of sample articles from evaluation dataset 2 (SMH) and evaluation dataset 3 (U.S. Chronicling America). The first link in each line shows the uncorrected OCR and corrected text, with changed words highlighted. The second link shows the article (in NLA's Trove) or page PDF (in LoC's Chronicling America).
Raw OCR and corrected text | Article in source corpus | |
---|---|---|
Land Sale | Sydney Morning Herald, 14 Apr 1843 | Trove |
To the Editor | Sydney Morning Herald, 11 Jul 1861 | Trove |
Country Works | Sydney Morning Herald, 21 Jul 1864 | Trove |
Water Police Court | Sydney Morning Herald, 31 Mar 1871 | Trove |
Police | Sydney Morning Herald, 17 Oct 1884 | Trove |
Country News | Sydney Morning Herald, 12 Jan 1889 | Trove |
Salvation Army | Sydney Morning Herald, 11 Mar 1899 | Trove |
Antarctic Exploration | Sydney Morning Herald, 20 May 1901 | Trove |
Personal | Sydney Morning Herald, 30 May 1905 | Trove |
Explosion | Sydney Morning Herald, 01 Sep 1920 | Trove |
Quarter Sessions | Sydney Morning Herald, 20 Oct 1923 | Trove |
School Church | Sydney Morning Herald, 07 Apr 1926 | Trove |
Explosion at Ryde | Sydney Morning Herald, 19 Jul 1935 | Trove |
Visit of Americans | Sydney Morning Herald, 23 Jan 1936 | Trove |
Mothers' Day Appeal | Sydney Morning Herald, 28 Apr 1938 | Trove |
Supreme Court Divorce | Sydney Morning Herald, 21 Jan 1939 | Trove |
Bulgarian Reservists | Sydney Morning Herald, 26 Aug 1939 | Trove |
W H Donald Rescue | Sydney Morning Herald, 26 Feb 1945 | Trove |
£60,000 Estate | Sydney Morning Herald, 05 Feb 1952 | Trove |
Political | Cairo Daily Bulletin (Illinois), 21 Jun 1871 | Chronicling America |
Senate Proceedings (partial) | Cairo Daily Bulletin (Illinois), 21 Dec 1871 | Chronicling America |
Death of Samuel Morse | Cairo Daily Bulletin (Illinois), 5 Apr 1872 | Chronicling America |
Maui Notes | The Independent (Honolulu), 25 Jun 1895 | Chronicling America |
300 letters | The Independent (Honolulu), 21 Jun 1896 | Chronicling America |
Ladies of the Cannibal Isles | The Independent (Honolulu), 21 Jul 1896 | Chronicling America |
Great Britain of the East | The Independent (Honolulu), 21 Jun 1897 | Chronicling America |
Passenger Travel | The Independent (Honolulu), 21 Jun 1898 | Chronicling America |
Commencements | The Independent (Honolulu), 21 Jun 1898 | Chronicling America |
Roman mining | Mohave County Miner, 18 Jun 1887 | Chronicling America |
India copper | Mohave County Miner, 17 Jun 1899 | Chronicling America |
General Mining News | Mohave County Miner, 20 Jun 1903 | Chronicling America |
In Solemn Service | San Francisco Call, 1 Jan 1900 | Chronicling America |
Sparrows attack men | The Washington Times, 21 Jun 1921 | Chronicling America |
Moyer Prison Job | The Washington Times, 21 Jun 1921 | Chronicling America |
D.C. Rent Act | The Washington Times, 21 Jun 1921 | Chronicling America |
Details of our evaluation datasets and improvements in recall, false positives and text readability are discussed here.
Limitations
The default compressed plain text, ALTO and hOCR file size limit is 20MB. Let us know if this is inadequate for your intended usage.
The current production version of overProof has been trained on a "general" vocabulary of the English language.
However, overProof's architecture is language and vocabulary neutral, so if you have a requirement to correct texts in other languages, or specialist vocabularies (such as industrial patents, or physical science papers) please contact us as we'd love to work with your data to extend overProof to meet your needs.
Correction can be neither complete nor perfect. Some source materials are of such a poor quality that even expert human correctors can only guess at the original text, even with reference to the image. Some text contains such atypical language that any statistical language approach to analysing it will fail to correct OCR errors within it, and will even introduce new errors: that is, rather than making it better, it will make it worse.
The development of overProof has focused on providing the best possible correction rate that can be achieved algorithmically and based on an immense corpus of trained text and OCR error analysis, but errors are inevitable.
Hence a prudent approach to indexing text will index both the original and the correction. The annoyances of false positives arising from search hits on uncorrected text will typically be outweighed by improved recall, particularly as the majority of corrected words are unsearched "nonsense" words in their pre-corrected version.
Read more about how we have evaluated overProof's performance here.
Pricing
Pricing varies based on the number of words submitted for correction per month:
Words per month | Cost per month |
---|---|
Less than 1 million | $100 |
Between 1 million - 10 million | $100 plus $10 per million words |
Between 10 million - 100 million | $190 + $7 per million words |
Between 100 million - 200 million | $820 + $6 per million words |
Between 200 million - 1 billion | $1420 + $5 per million words |
Over 1 billion | $5420 + $4.80 per million words |
Real-time processing (very rapid turnaround), and highly secure processing (dedicated customer servers) services are available at additional cost.
Costs are quoted in US dollars, and include GST for Australian customers.
Paper
A peer-reviewed technical description of overProof delivered to the DaTECH 2014 Conference Madrid, May 19-20 2014, is available as a pdf. Slides and video of the presentation are available from the digitisation.eu blog entry: DATeCH 3rd Session - Postcorrection.
About Project Computing
We're an Australian-based software house. Over the past 30 years we've developed systems used by large commercial and government organisations around the world.
We've recently been heavily involved with the design and implementation of the immensely popular and award-winning Newspaper digitisation and Trove discovery systems at the National Library of Australia.
We've been designing and implementing large-scale text searching systems since the 1980's. We've also have a long and deep experience with large mathematical models of the type used by overProof which we can use to assist you with other text corpus processing such as quality analysis, vocabulary extraction, named entity extraction, preliminary search term extraction, sentiment analysis and visualisation.
You can find out more about us here.