paint-brush
AI Framework has You Covered on Image-to-Text Workflowsby@ritabratamaiti
340 reads
340 reads

AI Framework has You Covered on Image-to-Text Workflows

by ritabratamaitiDecember 31st, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

What: Transform math equation images into LaTeX using AnyModal’s modular vision–language pipeline. How: Use pretrained weights for quick inference or train a custom model with your own dataset. Where: Find full examples, code, and model weights on GitHub and Hugging Face. Why: Easily integrate multiple AI components (vision + text) without writing extensive bridging code.
featured image - AI Framework has You Covered on Image-to-Text Workflows
ritabratamaiti HackerNoon profile picture

About AnyModal

AnyModal is a framework designed to unify multiple “modalities” (such as images, text, or other data) into a single, coherent workflow. Instead of juggling separate libraries or writing custom code to bridge vision and language models, AnyModal provides a structured pipeline where each component—image encoders, tokenizers, language models—can be plugged in without heavy customization. By handling the underlying connections between these pieces, AnyModal lets you focus on the high-level process: feeding in an image, for instance, and getting out a textual result.


In practice, AnyModal can help with tasks like image captioning, classification, or in the case demonstrated here, LaTeX OCR. Because the framework is modular, it’s relatively simple to swap one model for another (e.g., a different vision backbone or a new language model), making it flexible for experimentation or specialized use cases.


The LaTeX OCR Use Case

Converting an image of a mathematical expression into a valid LaTeX string requires bridging computer vision and natural language processing. The image encoder’s job is to extract features or symbolic patterns from the equation, such as recognizing “plus,” “minus,” and other symbols. The language component then uses these features to predict the proper LaTeX tokens in sequence.


LaTeX OCR with AnyModal is essentially a demonstration of how quickly you can pair a vision encoder with a language model. While this example specifically addresses equations, the overall approach can be extended to other image-to-text scenarios, including more advanced or specialized mathematical notation.


By the end of this tutorial, you will know how to use AnyModal, along with Llama 3.2 1B and Google’s SigLIP to create a small VLM for LaTeX OCR tasks:

Actual Caption: f ( u ) = u + \sum _ { n = o d d } \alpha _ { n } \left[ \frac { ( u - \pi ) } { \pi } \right] ^ { n } ,Generated Caption using AnyModal/LaTeX-OCR-Llama-3.2-1B: f ( u ) = u + \sum _ { n = o o d } \alpha _ { n } \left[ \frac { ( u - \pi ) ^ { n } } { \pi } \right],


Note that the weights released at AnyModal/LaTeX-OCR-Llama-3.2-1B are obtained by training on only 20% of unsloth/LaTeX_OCR dataset.


You will likely get a better model by training on the entire dataset, and over a greater number of epochs.


Quick Inference Example

For those primarily interested in generating LaTeX from existing images, here’s a demonstration using pretrained weights. This avoids the need to train anything from scratch, offering a quick path to see AnyModal in action. Below is a concise overview of setting up your environment, downloading the necessary models, and running inference.


Clone the AnyModal Repository:

git clone https://github.com/ritabratamaiti/AnyModal.git


Install the necessary libraries:

pip install torch torchvision huggingface_hub PIL


Then, download pretrained weights hosted on the Hugging Face Hub:

from huggingface_hub import snapshot_download

snapshot_download("AnyModal/Image-Captioning-Llama-3.2-1B", local_dir="latex_ocr_model")


These specific weights can be found here: LaTeX-OCR-Llama-3.2-1B on Hugging Face


Next, load the vision encoder and language model:

import llm
import anymodal
import vision
from PIL import Image

# Load language model and tokenizer
tokenizer, model = llm.get_llm("meta-llama/Llama-3.2-1B")

# Load vision-related components
image_processor, vision_model, vision_hidden_size = vision.get_image_encoder('google/vit-base-patch16-224')
vision_encoder = vision.VisionEncoder(vision_model)

# Configure the multimodal pipeline
multimodal_model = anymodal.MultiModalModel(
    input_processor=None,
    input_encoder=vision_encoder,
    input_tokenizer=vision.Projector(vision_hidden_size, llm.get_hidden_size(tokenizer, model), num_hidden=1),
    language_tokenizer=tokenizer,
    language_model=model,
    prompt_text="The LaTeX expression of the equation in the image is:"
)

# Load the pretrained model weights
multimodal_model._load_model("latex_ocr_model")
multimodal_model.eval()


Finally, provide the image and get the LaTeX output:

# Replace with the path to your equation image
image_path = "path_to_equation_image.png"
image = Image.open(image_path).convert("RGB")
processed_image = image_processor(image, return_tensors="pt")
processed_image = {k: v.squeeze(0) for k, v in processed_image.items()}

latex_output = multimodal_model.generate(processed_image, max_new_tokens=120)
print("Generated LaTeX:", latex_output)


This simple sequence of steps runs the entire pipeline—analyzing the image, projecting it into the language model’s space, and generating the corresponding LaTeX.


Training Tutorial

For those who want more control, such as adapting the model to new data or exploring the mechanics of a vision-language pipeline, the training process provides deeper insight. The sections below illustrate how data is prepared, how the model’s components are integrated, and how they are jointly optimized.


Rather than relying on pretrained components alone, you can acquire a training dataset of paired images and LaTeX labels. One example is the unsloth/LaTeX_OCR dataset, which contains images of equations along with their LaTeX strings. After installing dependencies and setting up your dataset, the steps to train include creating a data pipeline, initializing the model, and looping through epochs.


Here’s an outline for preparing the dataset and loading it:

from torch.utils.data import Subset
import vision

# Load training and validation sets
train_dataset = vision.ImageDataset("unsloth/LaTeX_OCR", image_processor, split='train')
val_dataset = vision.ImageDataset("unsloth/LaTeX_OCR", image_processor, split='test')

# Optionally use a smaller subset for faster iteration
subset_ratio = 0.2
train_dataset = Subset(train_dataset, range(int(subset_ratio * len(train_dataset))))
val_dataset = Subset(val_dataset, range(int(subset_ratio * len(val_dataset))))


At this point, you’d build or reuse the same AnyModal pipeline described earlier. Instead of loading pretrained weights, you would initialize the model so it can learn from scratch or from partially pretrained checkpoints.

multimodal_model = anymodal.MultiModalModel(
    input_processor=None,
    input_encoder=vision_encoder,
    input_tokenizer=vision.Projector(vision_hidden_size, llm.get_hidden_size(tokenizer, model), num_hidden=1),
    language_tokenizer=tokenizer,
    language_model=model,
    prompt_text="The LaTeX expression of the equation in the image is:"
)


You can then create a training loop to optimize the model’s parameters. A common approach uses PyTorch’s AdamW optimizer and optionally employs mixed-precision training for efficiency:

from tqdm import tqdm
import torch

optimizer = torch.optim.AdamW(multimodal_model.parameters(), lr=1e-4)
scaler = torch.cuda.amp.GradScaler()
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)

num_epochs = 5
for epoch in range(num_epochs):
    for batch_idx, batch in tqdm(enumerate(train_loader), desc=f"Epoch {epoch+1} Training"):
        optimizer.zero_grad()
        with torch.cuda.amp.autocast():
            logits, loss = multimodal_model(batch)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()


After each epoch or at least when training concludes, evaluating the model on a validation set helps ensure it generalizes to new data:

val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=False)
for batch_idx, batch in enumerate(val_loader):
    predictions = multimodal_model.generate(batch['input'], max_new_tokens=120)
    for idx, prediction in enumerate(predictions):
        print(f"Actual LaTeX: {batch['text'][idx]}")
        print(f"Generated LaTeX: {prediction}")


In addition to confirming performance, this validation step can guide improvements such as adjusting hyperparameters, switching to a different base model, or refining your data preprocessing. By following these training steps, you gain a better understanding of the interplay between the vision encoder and language model, and you can extend the workflow to additional tasks or more specialized domains.


Additional Resources