March 1, 2025

Engineering at Astera Software

At Astera I worked on ReportMiner, a platform for processing unstructured documents like PDFs, scanned images, invoices, and contracts into structured data. Most of my time was on the OCR and extraction side of the product.

A lot of the work started with understanding where the system struggled. Documents with inconsistent layouts or noisy scanned inputs would often break extraction pipelines, so I spent a fair amount of time digging into those edge cases and figuring out why different OCR approaches behaved the way they did. I ended up experimenting with combining multiple OCR engines, including open-source solutions and cloud services like AWS Textract and Google Vision, instead of relying on a single model. ReportMiner supports multiple OCR strategies under the hood, each with different tradeoffs in accuracy and layout detection, so the goal was to make the system more adaptable rather than forcing one approach everywhere.

I also worked on making the extraction pipeline more flexible. ReportMiner relies heavily on template-based extraction for documents with similar structures, so I focused on improving how pipelines could fall back or adapt when documents didn’t perfectly match expected formats. That kind of edge case handling is easy to overlook but it matters a lot in production when you’re dealing with real-world documents that never quite look the way you expect.

On the processing side, the platform supports automation, batch processing, and scheduled ingestion, so even small inefficiencies compound quickly at scale. Making the pipeline smarter about when and how extraction runs helped reduce unnecessary processing and improved overall reliability.

Beyond OCR, I also built AI-driven features in Centerprise and put together Astera’s AI App Orchestrator Server using Python and FastAPI. That was a service for coordinating AI workloads across the platform, and it was a good exercise in thinking about how to keep things modular when the underlying models and APIs are constantly changing.

Honestly this role was less about plugging in tools and more about understanding how unstructured data behaves in the real world. Messy inputs, inconsistent formats, edge cases that only show up at scale. I think that kind of work teaches you a lot about building systems that are actually robust, not just ones that work on clean test data.

Ready to take your idea to the next level? Let's work together.