← Back to all articles
AI & Data Engineering

Using LLMs for Automated Data Extraction and Classification

February 10, 2025
By Ahmed Gharib

Extracting structured data from unstructured text is a common challenge in data engineering. Large Language Models (LLMs) have revolutionized this task, enabling more accurate and flexible extraction compared to traditional methods.

The Evolution of Data Extraction

Data extraction has evolved through several phases:

  1. Rule-based systems with regular expressions
  2. Machine learning classifiers with feature engineering
  3. Deep learning approaches like BERT and RoBERTa
  4. Large Language Models with in-context learning and task-specific fine-tuning

LLMs for Structured Data Extraction

1. Zero-Shot Information Extraction

Modern LLMs can extract structured information without explicit training:

        from langchain.llms import OpenAI
        
        llm = OpenAI(model_name="gpt-4")
        
        text = """
        In the quarterly earnings call on March 15, 2025, CEO Jane Smith announced 
        that XYZ Corporation achieved $1.2 billion in revenue, representing a 15% 
        year-over-year growth. The company's EBITDA margin improved to 28%, and 
        they plan to invest $300 million in R&D for their new AI product line.
        """
        
        prompt = f"""
        Extract the following information from the text as JSON:
        - Company name
        - CEO name
        - Revenue amount
        - Revenue growth percentage
        - EBITDA margin
        - R&D investment amount
        - Date of announcement
        
        Text: {text}
        
        JSON:
        """
        
        result = llm.generate(prompt)
        print(result)
        

Expected output:

        {
          "company_name": "XYZ Corporation",
          "ceo_name": "Jane Smith",
          "revenue_amount": "$1.2 billion",
          "revenue_growth": "15%",
          "ebitda_margin": "28%",
          "rd_investment": "$300 million",
          "announcement_date": "March 15, 2025"
        }
        

2. Document Parsing and Table Extraction

LLMs can extract structured tables from text, even when formatting is inconsistent:

        from langchain.llms import Anthropic
        
        llm = Anthropic(model="claude-2")
        
        text = """
        Product Performance Summary - Q1 2025
        
        Product A: 12,500 units sold, $2.5M revenue, 22% margin
        Product B: 8,300 units sold, $1.8M revenue, 31% margin
        Product C: 15,750 units sold, $3.2M revenue, 18% margin
        """
        
        prompt = f"""
        Convert the following product performance text into a structured JSON array with 
        each product having units_sold, revenue, and margin fields:
        
        {text}
        """
        
        result = llm.generate(prompt)
        print(result)
        

3. Classification and Categorization

LLMs excel at classifying text into categories based on content:

        customer_feedback = """
        I've been using your analytics dashboard for three months now, and while I love the 
        visualizations, the data refresh is too slow. Sometimes I wait 30 seconds for the 
        dashboard to update when I change date ranges. Could you optimize this?
        """
        
        prompt = f"""
        Categorize the following customer feedback into the most appropriate categories 
        from: UI/UX, Performance, Feature Request, Bug Report, Data Quality, or Security.
        
        Also assign a sentiment score from 1-5 where 1 is very negative and 5 is very positive.
        
        Provide your answer as JSON with 'categories' and 'sentiment_score' fields.
        
        Feedback: {customer_feedback}
        """
        
        result = llm.generate(prompt)
        print(result)
        

Building a Production-Ready Data Extraction Pipeline

1. Pre-processing for Optimal Extraction

Before sending text to LLMs, consider these preprocessing steps:

  • Document segmentation to handle context length limitations
  • Noise removal (headers, footers, watermarks)
  • Table and layout recognition for structured documents
  • Language detection for multilingual corpora

2. Prompt Engineering for Accuracy

Well-designed prompts significantly improve extraction quality:

  • Include examples of desired output format
  • Specify expected data types and formats
  • Add constraints and validation rules
  • Use chain-of-thought prompting for complex reasoning

3. Post-processing and Validation

LLM outputs require validation and normalization:

  • JSON schema validation
  • Data type conversion and normalization
  • Business rule validation
  • Outlier detection
        import json
        from jsonschema import validate
        
        # Define schema for validation
        schema = {
            "type": "object",
            "required": ["company_name", "revenue_amount", "announcement_date"],
            "properties": {
                "company_name": {"type": "string"},
                "revenue_amount": {"type": "string", "pattern": "^\$[0-9]+(\.[0-9]+)? (billion|million)$"},
                "announcement_date": {"type": "string", "format": "date"}
            }
        }
        
        # Parse LLM output
        try:
            extracted_data = json.loads(llm_output)
            validate(instance=extracted_data, schema=schema)
            # Further processing...
        except json.JSONDecodeError:
            print("Failed to parse LLM output as JSON")
        except Exception as e:
            print(f"Validation error: {e}")
        

4. Human-in-the-Loop for Edge Cases

For critical applications, implement human review workflows:

  • Confidence scoring to flag uncertain extractions
  • Review queues for human validation
  • Feedback loops to improve extraction over time

Fine-tuning for Domain-Specific Extraction

For specialized domains, fine-tuning LLMs can improve accuracy:

  • Create domain-specific training data
  • Fine-tune smaller models for production deployment
  • Consider instruction tuning for specific extraction tasks

Cost and Performance Optimization

LLM-based extraction can be expensive at scale. Consider:

  • Using smaller models for simple extraction tasks
  • Implementing caching for repeated extractions
  • Batching requests where possible
  • Using embeddings + retrieval for large document sets

Conclusion

LLMs have transformed data extraction capabilities, enabling more flexible and accurate processing of unstructured data. By implementing proper pre-processing, prompt engineering, validation, and optimization techniques, data engineers can build robust extraction pipelines that unlock value from previously inaccessible data sources.

About the Author

AG

Ahmed Gharib

Advanced Analytics Engineer with expertise in data engineering, machine learning, and AI integration.