Project Stats

Name
Ritual
Industry
Online Ordering
Employees
50-100
Location
Toronto, Ontario
Year
2022
Duration
6 months
Team
2 AI Engineers, 1 Fullstack Developer
TechStack
  • Scrape
  • LangChain
  • Pandas
KeyResults
Ritual | Automating Merchant Data Collection with GPT

Ritual | Automating Merchant Data Collection with GPT

Project Overview

Ritual, a leading ordering app with over $130M in funding, connects coworkers and colleagues for seamless pick-up and payment at local restaurants and coffee shops. To enhance operational efficiency and reduce costs, Ritual partnered with 10xStudio to develop AI-driven solutions that automate customer support processes and streamline data collection.

The project focused on creating a robust system to extract accurate and structured information from merchant websites, such as addresses, opening hours, and delivery availability.

Goals

• Implement GPT-powered automation to replace manual processes in customer support and sales.

• Develop a system to scrape accurate and consistent merchant data, such as locations and delivery options.

• Navigate anti-bot blocking technology and dynamically loaded content for uninterrupted scraping.

• Minimize processing costs and improve system efficiency with optimized AI workflows.

Challenges

1. HTML Noise in Scraped Data:

Websites often contained unnecessary HTML elements that interfered with meaningful data extraction. Cleaning this noise without losing critical details required a tailored solution.

2. Inconsistent Data Across Pages:

Merchant websites frequently spread essential details like addresses and operational hours across multiple pages, so there was no 'one-size-fits-all solution. Aggregating and validating this information to ensure accuracy was a complex task.

3. Anti-Bot Blocking Technology:

Many websites employed advanced anti-bot systems to prevent automated scraping. Overcoming these protections required careful implementation of adaptive techniques to ensure uninterrupted data collection.

4. Navigating JavaScript and Link Cycles:

• Dynamically loaded content through JavaScript made traditional scraping methods ineffective.

• Internal link cycles on some websites created infinite loops during crawling, necessitating solutions to limit scraping depth.

5. Advanced Techniques Integration:

Leveraging new technologies, including Visual LLMs and Agents, to process data more efficiently and overcome the diverse formats and structures found across websites.

Solutions

1. HTML Cleaning and Optimization:

• Built a custom algorithm to remove HTML noise, preserving only relevant content for data extraction.

• Reduced token usage in GPT calls by minimizing extraneous data, lowering costs and improving response times.

2. Advanced AI Integration:

Agents and Visual LLM: Utilized Agents to dynamically navigate complex websites and extract relevant data efficiently.

• Applied Visual LLMs to interpret structured and semi-structured visual elements, such as menus or tables, directly from merchant websites.

3. Accurate Data Aggregation:

• Extracted data from multiple pages and aggregated it into a unified format, ensuring consistency and accuracy across attributes like addresses and delivery availability.

• Applied filtering techniques to eliminate duplicates and validate extracted information.

4. Overcoming Anti-Bot Technology:

• Implemented adaptive scraping techniques to bypass anti-bot protections.

• Conducted continuous monitoring to identify and counter evolving anti-bot mechanisms.

5. Scrapy for Depth-Limited Crawling:

• Used Scrapy to restrict crawling depth to avoid infinite loops caused by cyclical internal links.

• Combined with Agents to prioritize high-value pages dynamically, maximizing data collection efficiency.

Results

1. Streamlined Data Extraction:

Successfully extracted structured, accurate data about restaurants, including operational hours, delivery options, and multiple branch locations.

2. Cost Savings:

Optimized HTML cleaning and reduced token usage in GPT calls, cutting operational costs and improving system performance.

3. Improved Data Accuracy:

By leveraging Visual LLMs and robust aggregation methods, the system provided consistent, high-quality merchant data with minimal errors.

4. Enhanced Resilience:

Adaptive techniques for anti-bot detection and JavaScript navigation ensured reliable scraping, even under restrictive conditions.

5. Foundation for Scalability:

With advanced AI tools and a scalable architecture, the system is well-positioned to handle future growth and additional use cases.

Ready to grow your idea 10x?

MVP, POC, production-grade pipelines, we have built it all

No salespeople, no commitment