Automating Document Data Extraction with AWS Textract and Streamlit
In the digital age, managing and processing vast amounts of data is crucial for businesses and individuals alike. One of the key challenges faced is extracting meaningful information from documents like PDFs and images. Manual data entry is not only time-consuming but also prone to errors. To address this challenge, we can leverage the power of machine learning and cloud services to automate the process.
In this blog, we will walk you through the process of building an application that extracts data from documents using AWS Textract and presents it in a user-friendly web interface built with Streamlit. Whether you’re handling invoices, receipts, or other documents, this solution can save you time and improve accuracy.
What is AWS Textract?
AWS Textract is a service provided by Amazon Web Services (AWS) that automatically extracts text, handwriting, and data from scanned documents. Textract can identify and extract text from forms and tables, making it ideal for processing various document types.
Why Streamlit?
Streamlit is an open-source app framework for Machine Learning and Data Science projects. It allows you to create interactive web applications with minimal effort, making it perfect for quickly building and deploying data-driven applications.
What You Will Learn
In this tutorial, we will cover:
- Setting Up the Environment: Installing necessary libraries and setting up AWS credentials.
- Uploading Documents: Building a Streamlit interface to upload PDFs and images.
- Extracting Data with AWS Textract: Using AWS Textract to extract text and tables from uploaded documents.
- Processing and Displaying Data: Parsing the Textract response and displaying the extracted data in a structured format.
- Saving Results: Saving the extracted data to a text file.
Let’s get started!
Prerequisites
Before starting, ensure you have the following:
- AWS Account: Create or use an existing AWS account.
- AWS IAM Credentials: Obtain AWS Access Key ID and Secret Access Key with permissions for AWS Textract and S3.
- Python Environment: Install Python (version 3.6+) and pip package manager.
- Streamlit: Install Streamlit to build the web interface.
Step 1: Setup AWS Credentials
1. Create AWS IAM User:
— Go to AWS IAM console.
— Create a new IAM user or use an existing one.
— Attach policies for AWS Textract and S3 access.
2. Obtain Access Key ID and Secret Access Key:
— In AWS IAM, navigate to the user’s security credentials.
— Generate or use existing Access Key ID and Secret Access Key.
3. Configure AWS Credentials Locally:
— Install AWS CLI: `pip install awscli`.
— Configure AWS CLI with your Access Key ID, Secret Access Key, and AWS Region:
aws configure
AWS Access Key ID: [YourAccessKeyID]
AWS Secret Access Key: [YourSecretAccessKey]
Default region name: [YourAWSRegion]
Step 2: Set Up Python Environment
- Create a New Python Virtual Environment:
— Open a terminal or command prompt.
— Create a new virtual environment:
python -m venv textract-env
2. Activate the Virtual Environment:
— Activate the virtual environment:
— On Windows:
textract-env\Scripts\activate
— On macOS/Linux:
source textract-env/bin/activate
3. Install Required Python Packages:
- Create a
requirements.txt
file:
Yourrequirements.txt
file should look like this:
streamlit
boto3
pandas
textract-trp
botocore
amazon-textract-response-parser
You can create this file in your project directory. When setting up your environment, you can install all dependencies listed in the requirements.txt
file using the following command:
pip install -r requirements.txt
Configuring AWS Credentials
To interact with AWS services, you’ll need to configure your AWS credentials. Create a file named config.py
and add your AWS credentials:
AWS_ACCESS_KEY_ID = 'your-access-key-id'
AWS_SECRET_ACCESS_KEY = 'your-secret-access-key'
AWS_REGION = 'your-aws-region'
BUCKET_NAME = 'your-s3-bucket-name'
Replace the placeholders with your actual AWS access key, secret key, region, and S3 bucket name.
Step 3: Building the Streamlit Interface
Create a new Python file named app.py
and start by importing the necessary libraries:
import streamlit as st
import boto3
import pandas as pd
import time
from io import BytesIO
from datetime import datetime
from botocore.exceptions import ClientError
from config import AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, BUCKET_NAME
from trp import Document
Next, initialize the Textract client:
client = boto3.client('textract',
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
region_name=AWS_REGION)
Create the main function to define the Streamlit interface:
def main():
st.title("AWS Textract Data Extraction")
uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"])
if uploaded_file is not None:
file_bytes = uploaded_file.read()
original_filename = uploaded_file.name
try:
if uploaded_file.type == "application/pdf":
st.write("Extracting text from PDF...")
response = extract_text_from_pdf(file_bytes, original_filename)
if response:
lines, tables, key_values, date = process_textract_response(response)
else:
st.error("Error occurred during document analysis.")
else:
st.write("Extracting text from Image...")
tables = extract_text_from_image_table(file_bytes)
lines = extract_text_from_image(file_bytes)
key_values, date = {}, None
st.subheader("Extracted Text:")
for line in lines:
st.write(line)
if tables:
st.subheader("Extracted Tables:")
for i, table in enumerate(tables):
st.write(f"Table {i+1}:")
st.table(table)
if key_values:
st.subheader("Extracted Key-Value Pairs:")
st.write(key_values)
date_str = datetime.now().strftime("%Y-%m-%d")
base_name, extension = original_filename.rsplit('.', 1)
sanitized_base_name = base_name.replace(' ', '_')
output_filename = f"{sanitized_base_name}_{date_str}.txt"
if st.button("Save to File"):
with open(output_filename, "w") as text_file:
text_file.write("Extracted Tables:\n")
for i, table in enumerate(tables):
text_file.write(f"Table {i+1}:\n")
text_file.write(table.to_string() + "\n\n")
text_file.write("Extracted Key-Value Pairs:\n")
for key, value in key_values.items():
text_file.write(f"{key}: {value}\n")
if date:
text_file.write(f"Date: {date}\n")
st.success(f"Text saved to {output_filename}")
except Exception as e:
st.error(f"An error occurred: {e}")
if __name__ == "__main__":
main()
Step 4: Extracting Data with AWS Textract
Uploading Documents to S3
Add the function to upload documents to S3:
def upload_to_s3(file_bytes, original_filename):
s3 = boto3.client('s3', aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY, region_name=AWS_REGION)
bucket_name = BUCKET_NAME
date_str = datetime.now().strftime("%Y-%m-%d")
base_name, extension = original_filename.rsplit('.', 1)
sanitized_base_name = base_name.replace(' ', '_')
file_name = f'{sanitized_base_name}_{date_str}.{extension}'
try:
s3.upload_fileobj(BytesIO(file_bytes), bucket_name, file_name)
return bucket_name, file_name
except ClientError as e:
st.error(f"Error uploading file to S3: {e}")
return None, None
Extracting Text from PDFs
Add the function to start document analysis with AWS Textract:
def extract_text_from_pdf(file_bytes, original_filename):
bucket_name, file_name = upload_to_s3(file_bytes, original_filename)
if not bucket_name or not file_name:
return None
try:
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': bucket_name,
'Name': file_name
}
},
FeatureTypes=['TABLES', 'FORMS']
)
job_id = response['JobId']
while True:
response = client.get_document_analysis(JobId=job_id)
status = response['JobStatus']
if status in ['SUCCEEDED', 'FAILED']:
break
time.sleep(5)
if status == 'FAILED':
raise Exception("Document analysis failed")
# Collect all pages of the response
all_blocks = response['Blocks']
next_token = response.get('NextToken', None)
while next_token:
response = client.get_document_analysis(JobId=job_id, NextToken=next_token)
all_blocks.extend(response['Blocks'])
next_token = response.get('NextToken', None)
response['Blocks'] = all_blocks
return response
except ClientError as e:
st.error(f"AWS Textract error: {e}")
return None
except Exception as e:
st.error(f"An unexpected error occurred: {e}")
return None
Extracting Text from Images
Add the function to extract text from images:
def extract_text_from_image(file_bytes):
try:
response = client.detect_document_text(Document={'Bytes': file_bytes})
lines = [item['Text'] for item in response['Blocks'] if item['BlockType'] == 'LINE']
return lines
except ClientError as e:
st.error(f"AWS Textract error: {e}")
return []
except Exception as e:
st.error(f"An unexpected error occurred: {e}")
return []
Step 5: Processing and Displaying Data
Add the function to process the Textract response and display extracted data:
def process_textract_response(response):
try:
if 'Blocks' not in response:
raise ValueError("Response does not contain 'Blocks'")
doc = Document(response)
lines = [line.text for page in doc.pages for line in page.lines if line.text]
tables = []
key_values = {}
for page in doc.pages:
for table in page.tables:
table_data = []
for row in table.rows:
row_data = [cell.text if cell.text else "" for cell in row.cells]
table_data.append(row_data)
df = pd.DataFrame(table_data)
tables.append(df)
for field in page.form.fields:
key = field.key.text if field.key and field.key.text else ""
value = field.value.text if field.value and field.value.text else ""
key_values[key] = value
date = next((line for line in lines if 'Date' in line), None)
return lines, tables, key_values, date
except Exception as e:
st.error(f"Error processing Textract response: {e}")
return [], [], {}, None
Step 6: Saving Results
Add the functionality to save the extracted data to a text file:
if st.button("Save to File"):
with open(output_filename, "w") as text_file:
text_file.write("Extracted Tables:\n")
for i, table in enumerate(tables):
text_file.write(f"Table {i+1}:\n")
text_file.write(table.to_string() + "\n\n")
text_file.write("Extracted Key-Value Pairs:\n")
for key, value in key_values.items():
text_file.write(f"{key}: {value}\n")
if date:
text_file.write(f"Date: {date}\n")
st.success(f"Text saved to {output_filename}")
For a complete walkthrough and access to the full source code of this AWS Textract data extraction application, please visit my project’s GitHub repository at GitHub.
Run the Streamlit Application:
— In your terminal or command prompt, navigate to the directory containing `main.py`.
— Run the Streamlit application:
streamlit run main.py
Interact with the Application:
— Open the provided Streamlit URL in your web browser.
— Upload a PDF file to extract text using AWS Textract.
— View the extracted text or any errors in the Streamlit interface.
Demo of the Application
To give you a quick overview of how the AWS Textract data extraction application works, I’ve included a demo GIF below. This short animation demonstrates the user-friendly interface of the application, from uploading a document to viewing the extracted tables and key-value pairs. You’ll see how the application seamlessly processes both PDFs and images, extracting valuable information with precision. This visual guide should help you understand the application’s capabilities and how it can be used to streamline document data extraction tasks.
Conclusion
Congratulations! You have successfully created an AWS Textract data extraction application using Streamlit. This tutorial covered setting up AWS credentials, configuring a Python environment, implementing the Streamlit application, and extracting text from PDF files using AWS Textract. Experiment further by adding features such as displaying extracted tables, saving results, or handling different file formats, significantly reducing the time and effort required for manual data entry
By leveraging AWS Textract, we can handle a wide range of document types and formats, making this solution versatile and scalable. Streamlit provides an intuitive interface for users to interact with the application, upload documents, and view extracted data in real-time.
Feel free to extend and customize this application to suit your specific needs. Happy coding!
Feel free to connect on Linkedin!!