MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

How I Cut Agentic Workflow Latency by 3-5x Without Increasing Model Costs

2025-08-19 05:46:21

“The first time I built an agentic workflow, it was like watching magic, i.e., until it took 38 seconds to answer a simple customer query and cost me $1.12 per request.”

\ When you start building agentic workflows where autonomous agents plan and act on multi-step processes, it’s easy to get carried away. The flexibility is incredible! But so is the overhead that comes with it. Some of these pain points include slow execution, high compute usage, and a mess of moving parts.

\ The middle ground in Agentic workflows is where the performance problems and the best optimization opportunities usually show up.

\ Over the last year, I've learned how to make these systems dramatically faster and more cost-efficient without sacrificing their flexibility, and decided to create this playbook.

:::tip Before I talk optimization, I wanted to make sure you all know what I mean when I use the following words:

  • Workflows: Predetermined sequences that may or may not use an LLM altogether.
  • Agents: Self-directing, and they can decide which steps to take and the order in which they choose to execute.
  • Agentic Workflows: This is a hybrid where you set a general path but give the agents in your workflow the freedom to move within certain steps.

:::

\


Trim the Step Count

Something everyone needs to keep in mind while designing agentic workflows is that every model call adds latency. Every extra hop is another chance for a timeout. And let’s not forget about how it also augments our chance of hallucinations, leading to decisions being made that stray away from the main objective.

\ The guidelines here are simple:

  • Merge related steps into a single prompt
  • Avoid unnecessary micro-decisions that a single model could handle in one go
  • Design to minimize round-trips

\ There’s always a fine balance in this phase of design, and the process should always start with the least number of steps. When I design a workflow, I always start with a single agent (because maybe we don’t need a workflow at all) and then evaluate it against certain metrics and checks that I have in place.

\ Based on where it fails, I start to decompose the parts where the evaluation scores didn’t meet the minimum criteria, and iterate from then on. Soon, I get to the point of diminishing returns, just like the elbow method in clustering, and choose my step count accordingly.

Parallelize Anything That Doesn’t Have Dependencies

Borrowing context from the point above, sequential chains are latency traps, too. If two asks don’t need each other’s output, run them together!

\ As an example, I wanted to mimic a customer support agentic workflow where I can help a customer get their order status, analyze the sentiment of the request, and generate a response. I started off with a sequential approach, but then realized that getting the order status and analyzing the sentiment of the request do not depend on each other. Sure, they might be correlated, but that doesn’t mean anything for the action I’m trying to take.

\ Once I had these two responses, I would then feed the order status and sentiment detected to the response generator, and that easily shaved the total time taken from 12 seconds to 5.

Cut Unnecessary Model Calls

We’ve all seen the posts online that talk about how ChatGPT can get a little iffy when it comes to math. Well, that’s a really good reminder that these models were not built for that. Yes, they might get it 99% of the time, but why leave that to fate?

\ Also, if we know the kind of calculation that needs to take place, why not just code it into a function that can be used, instead of having an LLM figure that out on its own? If a rule, regex, or small function can do it, skip the LLM call. This shift will eliminate needless latency, reduce token costs, and increase reliability all in one go.

Match The Model To The Task

“Not every task is built the same” is a fundamental principle of task management and productivity, recognizing that tasks vary in their nature, demands, and importance. In the same way, we need to make sure that we’re assigning the right tasks to the right model. Models now come in different flavors and sizes, and we don’t need a Llama 405B model to do a simple classification or entity extraction task; instead, an 8B Model should be more than enough.

\ It is common these days to see people designing their agentic workflows with the biggest-baddest model that’s come out, but that comes at the cost of latency. The bigger the model, the more the compute required, and hence the latency. Sure, you could host it on a larger instance and get away with it, but that comes at a cost, literally.

\ Instead, the way I go about designing a workflow again would be to start with the smallest. My go-to model is the Llama 3.1 8B, which has proven to be a faithful warrior for decomposed tasks. I start by having all my agents use the 8B model and then decide whether I need to find a bigger model, or if it’s simple enough, maybe even go down to a smaller model.

\ Sizes aside, there has been a lot of tribal knowledge about what flavors of LLMs do better at each task, and that’s another consideration to take into account, depending on the type of task you’re trying to accomplish.

Rethinking Your Prompt

It’s common knowledge now, but as we go through our evaluations, we tend to add in more guardrails to the LLM’s prompt. This starts to inflate the prompt and, in turn, affects the latency. There are various methods for building effective prompts that I won’t get into in this article, but the few methods that I ended up using to reduce my round-trip response time were Prompt Caching for static instructions and schemas.

\ This included adding dynamic context at the end of the prompt for better cache reuse. Setting clear response length limits so that the model doesn’t eat up time, giving me unnecessary information.

Cache Everything

In a previous section, I talked about Prompt Caching, but that shouldn’t be where you stop trying to optimize for with caching. Caching isn’t just for final answers; it’s something that should be applied wherever applicable. While trying to optimize certain expensive tool calls, I cached intermediate and final results.

\ You can even implement KV caches for partial attention states and, of course, any session-specific data like customer data or sensor states. While implementing these caching strategies, I was able to slash repeated work latency by 40-70%.

Speculative Decoding

Here’s one for the advanced crowd: use a small “draft” model to guess the next token quickly and then have a larger model validate or correct them in parallel. A lot of the bigger infrastructure companies out there that promise faster inference do this behind the scenes, so you might as well utilize it to push your latency down further.

Save Fine-Tuning For Last - and Do It Strategically

Finetuning is something a lot of people talked about in the initial days, but now, some of the newer adopters of LLMs don’t seem to even know why or when to use it. When you look it up, you’ll see that it’s a way to have your LLM understand your domain and/or your task in more detail, but how does this help latency?

\ Well, this is something not a lot of people talk about, but there’s a reason I talk about this optimization last, and I’ll get to that in a bit. When you fine-tune an LLM to do a task, the prompt required at inference is considerably smaller than what you would have otherwise, because now, in most contexts, what you put in the prompt is baked into the weights through your fine-tune process.

\ This, in turn, feeds into the above point on reducing your prompt length and hence, latency gains.

Monitor Relentlessly

This is the most important step I took when trying to reduce latency. This sets the groundwork for any of the optimizations listed above and gives you clarity on what works and what doesn’t. Here are some of the metrics I used:

  • Time to First Token (TTFT)
  • Tokens Per Second (TPS)
  • Routing Accuracy
  • Cache Hit Rate
  • Multi-agent Coordination Time

\ These metrics tell you where to optimize and when because without them, you’re flying blind.


Bottom Line

The fastest, most reliable agentic workflows don’t just happen. They are a result of ruthless step-cutting, smart parallelization, deterministic code, model right-sizing, and caching everywhere it makes sense. Do this and evaluate your results, and you should see 3-5x speed improvements (and probably even major cost savings) are absolutely within reach.

"She Never Existed"

2025-08-19 04:57:31

It began with a photo on his feed — not glossy, not curated, just… there. A girl mid-laugh, hair whipped by the kind of wind that smells of rain before it falls. She held a chipped mug of tea, steam unraveling upward. The caption was a simple confession: “Bitterness reminds me I’m alive.” He liked it without thinking. She replied as though she’d been waiting for him.

\ Her name was Aanya. Pune-born, she claimed, with a job in digital marketing, she spoke about it as if it were a necessary evil. She loved indie music, loathed coriander. Her voice notes carried the unhurried weight of someone who refused to rush for the world. She remembered the smallest details of his life — the week his manager blindsided him, the night he couldn’t sleep. She sent playlists that mapped his moods before he even named them.

\ They never video-called. “I hate cameras,” she’d said. “They show too much.” He didn’t push. The mystery suited him. It made her seem like a story that might vanish if examined too closely.

\ Three months in, he no longer noticed the gap between them. Her words filled rooms he didn’t realize had gone empty. She spoke of a seaside town she ached for — fried fish curling in hot oil, the tide clapping like an audience that never tired. He promised to take her there.

\ So, he booked the trip.

\ Something shifted. She hesitated. Said her work was peaking. Said disappearing from her socials would strangle her metrics. “Metrics” caught in his head like a burr. She’d never sounded so… brand-aware.

\ He searched her profile again. It had grown — followers multiplying, captions sharpened, engagement immaculate. But her photographs were too precise now. Light fell at identical angles. Her smile’s edges never wandered. He ran one through a reverse image search.

\ It wasn’t her. Or rather, it was no one. The face belonged to a public dataset meant to teach machines what “human” looked like.

\ The café she swore she’d claimed as her sanctuary was a stock image. The seaside town? Footage from a travel B-roll library. The warm laugh that had knotted something inside him came from a free sound archive, clipped neatly at the end.

\ The chat thread on his phone felt different now. Each message still there, still hers, but calcified — artifacts from a world that had only pretended to breathe. The final one was a lone heart emoji, still pulsing blue, like a small machine unaware its power source was gone.

\ He deleted her number.

\ The algorithm noticed the absence and reshaped his world. No more tea-stained jokes. No songs that seemed to understand him. Only ads: therapy apps, dating platforms, other faceless promises.

\ A week later, he saw her again — not her name, but her shape. Same eyes, same practiced windblown hair, now selling organic skincare. He scrolled past without pausing.

\ But the silence left in her wake wasn’t clean. It clung. It whispered through the moments between real conversations. It curled around the way he now doubted a stranger’s kindness. He stopped sending voice notes to anyone.

\ SHE NEVER EXISTED!

In the first place.

\ Anyone can chat with a girl that never existed, and the worst thing is.. It’s not any human on that side, just an algorithm and lots of mental illness :)

\

Specialist or Generalist?

2025-08-19 04:43:42

\ Two quotes crashed into my brain this evening and started a brawl, revealing a tension that defines the modern creator's journey:

\

“A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.” — Robert Heinlein

\ And then, this:

“It is a foolish dominant assumption to simply say we’ll do it all. It’s an unsustainable position and certainly not the way to become distinctive. I think that we have to begin to create some space to explore what is essential.” — Greg McKeown

\ If you’re a creator, this probably feels like a tug-of-war you know all too well.

\ For New Creators: You’re just starting, juggling ideas like “Should I be a photographer, a writer, a coach, or all three?” Everyone’s yelling “pick a niche!” while you’re thinking: “But I love photography AND writing AND teaching… what if I pick the wrong one and waste my time?” You’re busy trying everything, but the income isn’t following.

\ For Proven Creators (10K+ followers): You’ve found your footing, built an audience, and you’re still pulled in a million directions. You’ve got amazing content on Instagram, products on Gumroad, services on Calendly, and a community on Discord. You’re constantly creating, engaging, and selling across platforms.

\ Yet, despite all that activity, you’re leaving money on the table, feeling the friction of disconnected systems, and wondering why your growth isn’t compounding. You’re busy, but not as profitable as you could be.

\ Here’s the truth for both of you: You’re asking the wrong question about “what” to focus on, and that’s why you’re busy but broke.

The “Pick Your Passion” Scam


Let’s debunk the biggest lie in career advice: “Follow your passion.”

\ This assumes you have one lifelong passion waiting to be unearthed. But humans don’t work that way.

\ Does Lionel Messi have one favorite shoe? Does Beyoncé have one favorite outfit? Do I have one favorite SpongeBob episode?

\ When you explore the world, you discover many things you love. If you have only one favorite song, you’re not listening to enough music. If you have only one favorite dish, you haven’t eaten enough food.

\ The “follow your passion” crowd wants you to pick a color before seeing the full spectrum.

\ Here’s what they don’t tell you: Passion without competence is just expensive daydreaming. Loving basketball at 4 feet tall doesn’t make you NBA material. Adoring music doesn’t make you a musician. Dreaming of writing doesn’t make you a writer.

\ You can’t monetize what you’re passionate about if you suck at it.

Exploration → Exploitation


Naval nailed it: Early in your career, you explore. Later, you exploit.

\ Most people get this backwards. They try to exploit before they’ve explored enough. They pick a lane before they know what roads exist.

\ As a creator, your job isn’t to find your passion. It’s to build competence—where your skills meet what the world needs.

Competence = Talent + Market

To be competent means you’re good at something people actually want. Not just something you love. Not just something you’re naturally good at. Something at the intersection of what you do well and what solves real problems.

\ Ask yourself:

  • What are you naturally good at? (Be honest—your mom’s opinion doesn’t count.)
  • What problems does this solve for people?
  • Will people pay for this solution?

\ Start there. Build competence first. Let passion follow.

The Synergy Secret


Think of your creator journey like building a house. As a beginner, you’re laying the foundation—testing skills, finding what works, building competence. As you grow, you’re furnishing the house, connecting every room (content, products, services) so it feels like one cohesive home. A unified system is your front door, inviting your audience in and guiding them effortlessly to everything you offer.

\ Take Canva. They don’t have fewer features than Photoshop. They have features that all serve one goal: making visual content creation simple for non-designers. Every tool, template, and sharing option works together toward that mission.

\ Your to-do list works the same way. One task reminds you of others. Completing one strengthens the whole system, compared to storing random tasks in your head. It’s not minimalism for minimalism’s sake—it’s strategic integration.

The Cost of Being a Jack-of-All-Trades


Visit Google.com and Yahoo.com and observe the difference. Yahoo tried to be everything—a search engine, email provider, news portal, and more. Google focused on being the best search engine. By niching down to solve one problem exceptionally well, Google became the go-to solution, while Yahoo faded. As a creator, scattering your efforts across disconnected platforms is like being Yahoo. Niching down to a unified system makes you Google—effective, not just busy.

\ Mind you, Google is not ONLY a search engine, but there's synergy of multiple products towards one mission.

\ For example, take a YouTuber who started with random vlogs but switched to focused tutorials on video editing. Their views doubled, and they launched a course that sold out in a week. By focusing on one problem (teaching video editing), they became the go-to expert, not a jack-of-all-trades.

Why Your Scattered Creator Life Is Bleeding Money


You’re probably doing this:

  • Amazing content on Instagram
  • Products on Gumroad
  • Services on Calendly
  • Community on Discord
  • Email list on ConvertKit
  • Course on Teachable

\ Each platform works fine alone. But they’re not working together.

\ Every platform switch creates friction. Every disconnected interaction is a missed opportunity. When someone finds your content, they can’t easily see your services. When they buy your product, they might never discover your community.

\ You’re not just losing efficiency. You’re sabotaging your compound growth.

The Antifragile Advantage


When you scatter across platforms, you’re building on rented land. Algorithm changes, policy updates, platform shutdowns—each can disrupt your business. You’re not diversifying. You’re creating vulnerabilities.

\ But when you build on your own foundation, chaos becomes your edge. While others scramble to adapt to platform changes, you’re thriving from a position of strength. Your system becomes antifragile—getting stronger from disorder.

The Compound Effect You’re Missing


Here’s what happens when everything lives in your ecosystem:

  • Your blog post leads to your products.
  • Your products introduce your services.
  • Your services deepen community engagement.
  • Your community creates testimonials that enhance your marketing.

\ Every element amplifies the others. This compounding is impossible when your presence is fragmented. A follower on Platform A might never find your products on Platform B. Each isolated interaction loses leverage.

The Focus Paradox Solved


So, should you be a Swiss Army knife or a scalpel?

\ The answer is both—and neither.

\ You focus on one unified goal: serving your audience and growing your business. But you develop multiple capabilities that work together toward that goal. You don’t specialize in content, OR products, OR services. You specialize in creating value for your people, using whatever tools and skills serve that mission.

\ Instead of writing a book, start a newsletter. Instead of just a newsletter, write X threads. Instead of just threads, share ideas on your WhatsApp status. Each step teaches you something that makes the next step easier and more effective.

Your Next Move


You’ve just seen the paradox. The cure isn’t a choice between two paths; it’s a new system for thinking that builds an antifragile business.

Grab "The Antidote to Industrial-Age Thinking" to escape the next set of traps.


Share this essay with another creator who’ll find this helpful.

\ Talk soon,

Praise

Improving OCR Accuracy in Historical Archives with Deep Learning

2025-08-19 00:46:30

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

2.5 Latin

Vamvakas et al. (2008) presented a complete OCR methodology for recognizing historical documents. It is possible to apply this methodology to both machine-printed and handwritten documents. Due to its ability to adjust depending on the type of documents that we wish to process, it does not require any knowledge of fonts or databases. Three steps were involved in the methodology: The first two involved creating a database for training based on a set of documents, and the third involved recognizing new documents. First, a pre-processing step that includes image binarization and enhancement takes place. In the second step, a top-down segmentation approach is used to detect text lines, words, and characters. A clustering scheme is then adopted to group characters of similar shapes. In this process, the user is able to interact at any time in order to correct errors in clustering and assign an ASCII label. Upon completion of this step, a database is created for the purpose of recognition. Lastly, in the third step, the above segmentation approach is applied to every new document image, while the recognition is performed using the character database that was created in the previous step. Based on the results of the experiments, the model was found to be 83.66% accurate. Efforts will be taken in the future to optimize the current recognition results by exploiting new approaches for segmentation and new types of features to optimize the current recognition results.

\ In typical OCR systems, binarization is a crucial preprocessing stage where the input image is transformed into a binary form by removing unwanted elements, resulting in a clean and binarized version for further processing. However, binarization is not always perfect, and artifacts introduced during this process can lead to the loss of important details, such as distorted or fragmented character shapes. Particularly in historical documents, which are prone to higher levels of noise and degradation, binarization methods tend to perform poorly, impeding the effectiveness of the overall recognition pipeline. To address this issue, Yousefi et al. (2015) proposes an alternative approach that bypasses the traditional binarization step. They propose training a 1D LSTM network directly on gray-level text data. For their experiments, they curated a large dataset of historical Fraktur documents from publicly accessible online sources, which served as training and test sets for both grayscale and binary text lines. Additionally, to investigate the impact of resolution, they utilized sets of both low and high resolutions in their experiments. The results demonstrated the effectiveness of the 1D LSTM network compared to binarization. The network achieved significantly lower error rates, outperforming binarization by 24% on the low-resolution set and 19% on the high-resolution set. This approach offers a promising alternative by leveraging LSTM networks to directly process gray-level text data, bypassing the limitations and artifacts associated with traditional binarization methods. It proves particularly beneficial for historical documents and provides improved accuracy in OCR tasks.

\ According to Springmann et al. (2016), achieving accurate OCR results for historical printings requires training recognition models using diplomatic transcriptions, which are scarce and time-consuming resources. To overcome this challenge, the authors propose a novel method that avoids training separate models for each historical typeface. Instead, they employ mixed models initially trained on transcriptions from six printings spanning the years 1471 to 1686, encompassing various fonts. The results demonstrate that using mixed models yields character accuracy rates exceeding 90% when evaluated on a separate test set comprising six additional printings from the same historical period. This finding suggests that the typography barrier can be overcome by expanding the training beyond a limited number of fonts to encompass a broader range of (similar) fonts used over time. The outputs of the mixed models serve as a starting point for further development using both fully automated methods, which employ the OCR results of mixed models as pseudo ground truth for training subsequent models, and semi-automated methods that require minimal manual transcriptions. In the absence of actual ground truth, the authors introduce two easily observable quantities that exhibit a strong correlation with the actual accuracy of each generated model during the training process. These quantities are the mean character confidence (C), determined by the OCR engine OCRopus, and the mean token lexicality (L), which measures the distance between OCR tokens and modern wordforms while accounting for historical spelling patterns. Through an ordinal classification scale, the authors determine the most effective model in recognition, taking into account the calculated C and L values. The results reveal that a wholly automatic method only marginally improves OCR results compared to the mixed model, whereas hand-corrected lines significantly enhance OCR accuracy, resulting in considerably lower character error rates. The objective of this approach is to minimize the need for extensive ground truth generation and to avoid relying solely on a pre-existing typographical model. By leveraging mixed models and incorporating manual corrections, the proposed method demonstrates advancements in OCR results for historical printings, offering a more efficient and effective approach to training recognition models.

\ Bukhari et al. (2017) introduced the ”anyOCR” system, which focuses on the accurate digitization of historical archives. This system, being open-source, allows the research community to easily employ anyOCR for digitizing historical archives. It encompasses a comprehensive document processing pipeline that supports various stages, including layout analysis, OCR model training, text line prediction, and web applications for layout analysis and OCR error correction. One notable feature of anyOCR is its capability to handle contemporary images of documents with diverse layouts, ranging from simple to complex. Leveraging the power of LSTM networks, modern OCR systems enable text recognition. Moreover, anyOCR incorporates an unsupervised OCR training framework called anyOCRModel, which can be readily trained for any script and language. To address layout and OCR errors, anyOCR offers web applications with interactive tools. The anyLayoutEdit component enables users to rectify layout issues, while the anyOCREdit component allows for the correction of OCR errors. Additionally, the research community can access a Dockerized Virtual Machine (VM) that comes pre-installed with most of the essential components, facilitating easy setup and deployment. By providing these components and tools, anyOCR empowers the research community to utilize and enhance them according to their specific requirements. This collaborative approach encourages further refinement and advancements in the field of historical archive digitization.

\ Springmann et al. (2018) provided resources for historical OCR called the GT4HistOCR dataset, which consists of printed text line images accompanied by corresponding transcriptions. This dataset encompasses a total of 313,173 line pairs derived from incunabula spanning the 15th to the 19th centuries. It is made publicly available under the CC-BY 4.0 license, ensuring accessibility and usability. The GT4HistOCR dataset is particularly well-suited for training advanced recognition models in OCR software that utilize recurrent neural networks, specifically the LSTM architecture, such as Tesseract 4 or OCRopus. To assist researchers, the authors have also provided pretrained OCRopus models specifically tailored to subsets of the dataset. These pretrained models demonstrate impressive character accuracy rates of 95 percent for early printings and 98.5 percent for 19th-century Fraktur printings, showcasing their effectiveness even on unseen test cases.

\ According to Nunamaker et al. (2016), historical document images must be accompanied by ground truth text for training an OCR system. However, this process typically requires linguistic experts to manually collect the ground truths, which can be time-consuming and labor-intensive. To address this challenge, the authors propose a framework that enables the autonomous generation of training data using labelled character images and a digital font, eliminating the need for manual data generation. In their approach, instead of using actual text from sample images as ground truth, the authors generate arbitrary and rule-based ”meaningless” text for both the image and the corresponding ground truth text file. They also establish a correlation between the similarity of character samples in a subset and the performance of classification. This allows them to create upper- and lower-bound performance subsets for model generation using only the provided sample images. Surprisingly, their findings indicate that using more training samples does not necessarily improve model performance. Consequently, they focus on the case of using just one training sample per character. By training a Tesseract model with samples that maximize a dissimilarity metric for each character, the authors achieve a character recognition error rate of 15% on a custom benchmark of 15th-century Latin documents. In contrast, when a traditional Tesseract-style model is trained using synthetically generated training images derived from real text, the character recognition error rate increases to 27%. These results demonstrate the effectiveness of their approach in generating training data autonomously and improving the OCR performance for historical documents.

\ Koistinen et al. (2017) documented the efforts undertaken by the National Library of Finland (NLF) to enhance the quality of OCR for their historical Finnish newspaper collection spanning the years 1771 to 1910. To analyze this collection, a sample of 500,000 words from the Finnish language section was selected. The sample consisted of three parallel sections: a manually corrected ground truth version, an OCR version corrected using ABBYY FineReader version 7 or 8, and an ABBYY FineReader version 11-reOCR version. Utilizing the original page images and this sample, the researchers devised a re-OCR procedure using the open-source software Tesseract version 3.04.01. The findings reveal that their method surpassed the performance of ABBYY FineReader 7 or 8 by 27.48 percentage points and ABBYY FineReader 11 by 9.16 percentage points. At the word level, their method outperformed ABBYY FineReader 7 or 8 by 36.25 percent and ABBYY FineReader 11 by 20.14 percent. The recall and precision results for the re-OCRing process, measured at the word level, ranged between 0.69 and 0.71, surpassing the previous OCR process. Additionally, other metrics such as the ability of the morphological analyzer to recognize words and the rate of character accuracy demonstrated a significant improvement following the re-OCRing process.

\ Reul et al. (2018) examined the performance of OCR on 19th-century Fraktur scripts using mixed models. These models are trained to recognize various fonts and typesets from previously unseen sources. The study outlines the training process employed to develop robust mixed OCR models and compares their performance to freely available models from popular open-source engines such as OCRopus and Tesseract, as well as to the most advanced commercial system, ABBYY. To evaluate a substantial amount of unknown information, the researchers utilized 19th-century data extracted from books, journals, and a dictionary. Through the experiment, they found that combining models with real data yielded better results compared to combining models with synthetic data. Notably, the OCR engine Calamari demonstrated superior performance compared to the other engines assessed. It achieved an average CER of less than 1 percent, a significant improvement over the CER exhibited by ABBYY.

\ According to Romanello et al. (2021), commentaries have been a vital publication format in literary and textual studies for over a century, alongside critical editions and translations. However, the utilization of thousands of digitized historical commentaries, particularly those containing Greek text, has been challenging due to the limitations of OCR systems in terms of poor-quality results. In response to this, the researchers aimed to evaluate the performance of two OCR algorithms specifically designed for historical classical commentaries. The findings of their study revealed that the combination of Kraken and Ciaconna algorithms achieved a significantly lower CER compared to Tesseract/OCR-D (average CER of 7% versus 13% for Tesseract/OCR-D) in sections of commentaries containing high levels of polytonic Greek text. On the other hand, in sections predominantly composed of Latin script, Tesseract/OCR-D exhibited slightly higher accuracy than Kraken + Ciaconna (average CER of 8.2% versus 8.2%). Additionally, the study highlighted the availability of two resources. Pogretra is a substantial collection of training data and pre-trained models specifically designed for ancient Greek typefaces. On the other hand, GT4HistComment is a limited dataset that provides OCR ground truth specifically for 19th-century classical commentaries.

\ According to Skelbye and Dann´ells (2021), the use of deep CNN-LSTM hybrid neural networks has proven to be effective in improving the accuracy of OCR models for various languages. In their study, the authors specifically examined the impact of these networks on OCR accuracy for Swedish historical newspapers. By employing the open source OCR engine Calamari, they developed a mixed deep CNN-LSTM hybrid model that surpassed previous models when applied to Swedish historical newspapers from the period between 1818 and 1848. Through their experiments using nineteenth-century Swedish newspaper text, they achieved a remarkable average Character Accuracy Rate (CAR) of 97.43 percent, establishing a new state-of-the-art benchmark in OCR performance.

\ Based on Aula (2021), scanned documents can contain deteriorations acquired over time or as a result of outdated printing methods. There are a variety of visual attributes that can be observed on these documents, such as variations in style and font, broken characters, varying levels of ink intensity, noise levels and damage caused by folding or ripping, among others. Modern OCR tools are unfavorable to many of these attributes, leading to failures in the recognition of characters. To improve the result of character recognition, they used image processing methods. Furthermore, common image quality characteristics of scanned historical documents with unidentifiable text were analyzed. For the purposes of this study, the opensource Tesseract software was used for optical character recognition. To prepare the historical documents for Tesseract, Gaussian lowpass filtering, Otsu’s optimum thresholding method, and morphological operations were employed. The OCR output was evaluated based on the Precision and Recall classification method. It was found that the recall had improved by 63 percentage points and the precision by 18 percentage points. This study demonstrated that using image pre-processing methods to improve the readability of historical documents for the use of OCR tools has been effective.

\ According to Gilbey and Sch¨”onlieb (2021), it is noted that historical and contemporary printed documents often have extremely low resolutions, such as 60 dots per inch (dpi). While humans can still read these scans fairly easily, OCR systems encounter significant challenges. The prevailing approach involves employing a super-resolution reconstruction method to enhance the image, which is then fed into a standard OCR system along with an approximation of the original high-resolution image. However, the researchers propose an end-to-end method that eliminates the need for the super-resolution phase, leading to superior OCR results. Their approach utilizes neural networks for OCR and draws inspiration from the human visual system. Remarkably, their experiments demonstrate that OCR can be successfully performed on scanned images of English text with a resolution as low as 60 dpi, which is considerably lower than the current state of the art. The results showcase an impressive CLA of 99.7% and a Word Level Accuracy (WLA) of 98.9% across a corpus comprising approximately 1000 pages of 60 dpi text in diverse fonts. When considering 75 dpi images as an example, the mean CLA and WLA achieved were 99.9% and 99.4%, respectively.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq ([email protected]).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Advances in OCR for Historical Chinese, Japanese, Coptic, and Greek Texts

2025-08-19 00:46:21

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

2.2 Chinese/Japanese

Historical Chinese characters have posed one of the greatest challenges in pattern recognition in the past. This is due to their large character set and various writing styles. To address this issue Li et al. (2014) proposed a method of recognizing historical Chinese characters by incorporating STM into an MQDF classifier. The experiment was conducted on historical documents from Dunhuang and traditional Chinese fonts. The optimal selection of parameters was selected after testing many different parameters. They conducted two separate sets of experiments. An experiment using printed traditional Chinese characters was conducted as part of the first set of experiments. For the second experiment, samples taken from historical Chinese documents were used to perform the experiments. In addition, the method may be improved by introducing nonlinear transfer and integrating it with other approaches. Furthermore, the system was tested with a variety of features and classifiers. The results of experiments suggest that supervised STMs may improve the generalization of classifiers. As a result of the results, the error rate was reduced by a considerable amount and the method showed significant potential. For example, it is possible to reduce the error rate of one of the tested documents by 60% by tagging only 10% of the samples with labels.

\ The lack of labeled training samples makes recognition of historical Chinese characters very challenging. Therefore, Feng et al. (2015) proposed a non-linear Style Transfer Mapping (STM) model based on Gaussian Process (GP-STM), which extends the traditional linear STM model. By using GP-STM, existing printed samples of Chinese characters were used to recognize historical Chinese characters. To prepare the GP-STM framework, the researchers compared a number of methods for extracting features, trained a Modified Quadratic Discriminant Function (MQDF) classifier on examples of Chinese characters printed on paper, and then applied the model to historical documents from Dunhuang. The impact of different kernels and parameters was measured, in addition to the impact of the number of training samples. In the experiments, the results indicate that the GP-STM is capable of achieving an accuracy of 57.5%, an improvement of over 8% over the STM.

\ It is difficult to recognize Chinese characters directly by using classical methods when they appear in historical documents since they can be categorized into more than 8000 different categories. Due to the lack of well-labeled data, deep learning based methods are unable to recognize them. The authors of Yang et al. (2018) presented a historical Chinese text recognition algorithm based on data that was labeled at the page level without aligning each line of text. In order to reduce the influence of misalignment between text line images and labels, they proposed Adaptive Gradient Gate (AGG). The proposed text recognizer can reduce its error rate by over 35 percent with the help of AGG. Furthermore, they found that establishing an implicit language model using Convolutional Neural Networks (CNNs) and Connectionist Temporal Classification (CTC) is one of the key factors in achieving high recognition performance. With an accuracy rate of 94.64%, the proposed system outperformed other optical character recognition systems.

\ Deep reinforcement learning has found successful applications across various fields. Sihang et al. (2020) presented an innovative approach, based on deep reinforcement learning, to enhance the F-measure score for Chinese character detection in historical documents. Their method introduced a novel fully convolutional network called fully convolutional network with position-sensitive Region-of-Interest (RoI) pooling FCPN. Unlike fixed-size character patches, this network could accommodate patches of varying sizes and incorporate positional information into action features. Additionally, they proposed a Dense Reward Function (DRF) that effectively rewarded different actions based on environmental conditions, thereby enhancing the decisionmaking capability of the agent. The method was designed to be applicable to the output of character-level or word-level text detectors, resulting in more precise outcomes. The effectiveness of their approach was demonstrated through its application to the Tripitaka Koreana in Han (TKH) and Multiple Tripitaka in Han (MTH) datasets, where a notable improvement was observed, achieving an Intersection over Union (IoU) of 0.8.

\ The introduction of ARCED by Ly et al. (2020) presents a novel attention-based row-column encode-decoder model for recognizing multiple text lines in images without requiring explicit line segmentation. The recognition system comprises three key components: a feature extractor, a row-column encoder, and a decoder. By adopting an attention-based seq2seq approach, the proposed model achieves significantly lower error rates compared to previous state-of-the-art methods for both single and multiple text line recognition. The encoder component leverages a row-column Bidirectional Long Short-Term Memory (BLSTM) network, enabling the capture of sequential order information in both horizontal and vertical dimensions. This contributes to further reducing error rates within the attention-based model. Additionally, a residual LSTM network utilizes all prior attention vectors to generate predictive distributions in the decoder, leading to improved accuracy. Training of the entire system is conducted using a cross-entropy loss function, utilizing only document images and ground-truth text. To evaluate the performance of ARCED, the Kana-PRMU dataset, comprising Japanese historical documents, is employed. Experimental results demonstrate that ARCED outperforms existing recognition methods. Specifically, when evaluated on the level 2 and level 3 subsets of the Kana-PRMU dataset, the proposed ARCED model achieves character error rates of 4.15% and 12.69% respectively. Future work aims to enhance ARCED’s capabilities for recognizing entire Japanese document pages. Furthermore, incorporating a language model into ARCED is anticipated to further enhance its performance.

2.3 Coptic

According to Bulert et al. (2017) due to non-standard fonts and varying paper and font quality, OCR results may not be satisfactory when applied to historical texts. Further, historical texts are not transmitted in their entirety over time, but rather include gaps and fragments. As a result, automatic post-correction is more difficult when it comes to historical texts than when it comes to modern texts. Two tools were used to create recognition patterns (or models) specific to different languages and documents to recognize printed Coptic texts. Historically, Coptic was the last stage in the development of the pre-Arabic language that was indigenous to Egypt. In addition, it led to the creation of a rich and unique body of literature, including monastic texts, Gnostic texts, Manichaean texts, magical texts, and translations of the Bible and patristic texts. According to the researchers, Coptic texts possess properties that make them excellent candidates for computer-based reading. As a result of their limited number and the fact that most handwritten texts exhibit highly consistent forms, the characters can easily be identified.

2.4 Greek

A study published by Simistira et al. (2015) investigated the performance of LSTM for inputting Greek polytonic script in OCR. Even though there are many Greek polytonic manuscripts, digitization of such documents has not been widely applied, and very little work has been done on the recognition of these scripts. In this study, they collected many diverse Greek polytonic script pages into a novel database, called Polyton-DB, containing 15,689 text lines of synthetic and authentic printed scripts, and conducted baseline experiments with LSTM networks. LSTM is shown to have an error rate between 5.51 and 14.68 percent (depending on the document) and is better than Tesseract and ABBYY FineReader, two well-known OCR engines.

\ It is not possible to recognize Greek characters in early printed Greek books using traditional character recognition techniques. Because the writing of the same or consecutive words does not permit character or word segmentation, the character or word cannot be segmented. To address this issue, Poulos et al. (2010) has developed a novel OCR system combining image preprocessing with computational geometry. Their objective was to perform OCR digitization of a large collection of digitized Greek early printed books dating from the late 15th century to the mid-18th century. In this method, image processing is performed through the use of image binarization and enhancement, the creation of a convex polygon that represents the feature extraction of each font, and the development of training and identification procedures based on algorithms for intersecting convex polygons. Among the major advantages of this method was the ability to control the authentication of an image of a published document or a partial modification of it to a high degree of reliability. Therefore, the proposed system uses smart geometric practices to determine the classification of a candidate letter. According to experimental results, the proposed method yields positive and negative verification scores that are greater than 92% correct.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq ([email protected]).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Can AI Finally Crack Ottoman Text Recognition?

2025-08-19 00:46:15

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

2 Related work

This section reviews the literature by focusing on machine-typed historical documents. To the best of our knowledge, currently, there is no OCR system that can accurately extract text from old Kurdish publications written in Arabic-Persian script. Therefore, we concentrate on the related work for other languages.

2.1 Arabic/Persian

It is difficult to implement an Ottoman character recognition system according to Ozturk et al. (2000). There are insufficient studies in this field. Therefore, they developed a model using artificial neural networks using 28 different Ottoman machine-printed documents in order to develop an OCR that will recognize different fonts. Three Ottoman newspapers were used to prepare their data. For documents with a trained font, the accuracy was 95%, while for documents with an unknown font, it was 70%.

\ According to Ataer and Duygulu (2007), it may not be possible to obtain satisfactory results using character recognition-based systems due to the characteristics of Ottoman documents. Moreover, it is desirable to store the documents as images, since the documents may contain important drawings, especially signatures. The author viewed Ottoman words as images and proposed a matching technique to solve the problem because of these reasons. According to the author, the bag-of-visterm approach was shown to be successful in classifying objects and scenes, which is why he adopted the same approach for matching word images. Using vector quantization of Scale-Invariant Feature Transform (SIFT) descriptors, word images were represented by sets of visual terms. By comparing the visual terms’ distributions, similar words are then matched. Over 10,000 words were included in the printed and handwritten documents used in the experiments. In the experiment, the highest accuracy was 91% and the lowest accuracy was 30%.

\ Kilic et al. (2008) developed an OCR system specifically designed for Ottoman script segmentation, normalization, edge detection, and recognition. The Ottoman characters were categorized into four distinct forms based on their position within a word: beginning, middle, end, and isolated form. Images of printed papers containing Ottoman script were used for data acquisition. The process involved segmentation and normalization of the images, followed by edge detection using Cellular Neural Networks for feature extraction. Subsequently, a Support Vector Machine (SVM) was employed to accurately identify these multi-font Ottoman characters. The SVM training involved the utilization of Polynomial (linear and quadratic) and Gaussian Radial Basis Function kernels. The proposed recognition system achieved an impressive accuracy rate of 87.32 percent for character classification.

\ Shafii (2014) proposed a new technique in two important preprocessing steps, skew detection and page segmentation, after reviewing the existing technology. Instead of utilizing the usual practice of segmenting characters, they suggested segmenting subwords to avoid challenges with segmentation due to Persian script’s highly cursive nature. Feature extraction was implemented using a hybrid scheme that combines three commonly used methods before being classified using a nonparametric method. Based on their experimental tests on a library of 500 words, they were able to recognize 97% of the words.

\ Due to the challenges of the Arabic heritage collection, which consists of early prints and manuscripts, it is difficult to extract text from its documents. To address these problems, Stahlberg and Vogel (2016) developed a system called QATIP (QCRI Qatar Computing Research Institute Arabic Text Image Processing) to OCR these kind of documents. A sophisticated text-to-image binarization technique was used in conjunction with Kaldi, which was originally designed for speech recognition. This paper contributed two major areas, one involving the creation of both a graphical user interface for users as well as API endpoints for integration and the other new approaches to model language and ligatures. After testing the system, they found out that the newly proposed technique for language modeling and ligature modeling was highly successful. The accuracy of the system was 37.5% WER 12.6% CER for early books.

\ In order to recognize Ottoman-Turkish characters, Do˘gru (2016) used Tesseract optical character recognition system. In addition, various transcription methods have been developed from Ottoman Turkish to Latin. Optical character recognition could not recognize certain OttomanTurkish characters. As a result, Ottoman-Turkish keyboards were developed to facilitate the writing of unrecognized characters using Ottoman-Turkish alphabets. For the transcription process, dictionary tables were used. This resulted in an increase in the success rate of transcription when enrichment data was included in the dictionary tables. Therefore, an application was developed to enrich dictionary tables with data. The recognition rates for the first two pages of an Ottoman book was between 75.88% - 77.38%. Based on the results of the author’s experiments, he concludes that recognition rates could vary based on quality, style, and printed or handwritten documents or images. High quality and printed images can be recognized with a 100% accuracy rate, while handwritten and low-quality documents or images cannot be recognized by optical character recognition. It is therefore necessary to write these kinds of documents or images again in Ottoman-Turkish.

\ Analytical based approach for cursive scripts such as Arabic can be very challenging, especially for segmentation, because of the frequent overlapping between the characters. Because of that Nashwan et al. (2017) proposed a segmentation-based holistic approach to solve this issue. Since we deal with the entire word as a single unit in the holistic approach, this will improve the error rate for cursive scripts. But on the other hand, it will require computation complexity especially if the application has a huge vocabulary. In their view, their holistic approach, based on Discrete Cosine Transforms (DCTs) and local block features, will be computationally efficient. In addition, they developed a method for reducing the length of the lexicon by clustering words that have similar shapes. The proposed system was tested on a wide range of datasets, and it was found to have a 47.8% WRR accuracy, and it increased to 65.7% WRR when considering the top-10 hypotheses.

\ By employing deep convolutional neural networks, K¨u¸c¨uk¸sahin (2019) devised an offline OCR system that demonstrates the ability to recognize Ottoman characters. The proposed methodology encompasses multiple stages, including image processing, image digitization, character segmentation, adaptation of inputs for the network, network training, recognition, and evaluation of outcomes. To create a character dataset, text images of varying lengths were segmented from diverse samples of Ottoman literature obtained from the Turkish National Library’s digital repository. Two convolutional neural networks of differing complexity were trained using the generated character dataset, and the correlation between recognition rates and network complexity was examined. The dataset’s features were extracted through the Histogram of Oriented Gradients and Principal Component Analysis techniques, while classification of Ottoman characters was achieved using the widely employed k-Nearest Neighbor Algorithm and Support Vector Machines. Results from the conducted analyses revealed that both networks exhibit recognition rates comparable to traditional classifiers; however, the more intricate deep neural network outperformed others in terms of accuracy and loss. After 100 epochs, the most accurate model achieved an impressive accuracy of 97.58 percent.

\ Dolek and Kurt (2021) presented an OCR tool developed for printed Ottoman documents in Naskh font. The tool was developed using a deep learning model trained with data sets containing both original and synthetic documents. The model was compared with free and opensource OCR engines using a test dataset comprising 21 pages of original documents. In terms of accuracy rates, their model outperformed the other tools with 88.64% raw, 95.92% normalized, and 97.18% joined. Additionally, their model achieved an accuracy rate of 58 percent for word recognition, which is the only rate above 50 percent among the OCR tools that were compared.

\ \

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq ([email protected]).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\