Project Overview
Ritual, a leading ordering app with over $130M in funding, connects coworkers and colleagues for seamless pick-up and payment at local restaurants and coffee shops. To enhance operational efficiency and reduce costs, Ritual partnered with 10xStudio to develop AI-driven solutions that automate customer support processes and streamline data collection.
The project focused on creating a robust system to extract accurate and structured information from merchant websites, such as addresses, opening hours, and delivery availability.
Goals
• Implement GPT-powered automation to replace manual processes in customer support and sales.
• Develop a system to scrape accurate and consistent merchant data, such as locations and delivery options.
• Navigate anti-bot blocking technology and dynamically loaded content for uninterrupted scraping.
• Minimize processing costs and improve system efficiency with optimized AI workflows.
Challenges
1. HTML Noise in Scraped Data:
Websites often contained unnecessary HTML elements that interfered with meaningful data extraction. Cleaning this noise without losing critical details required a tailored solution.
2. Inconsistent Data Across Pages:
Merchant websites frequently spread essential details like addresses and operational hours across multiple pages, so there was no 'one-size-fits-all solution. Aggregating and validating this information to ensure accuracy was a complex task.
3. Anti-Bot Blocking Technology:
Many websites employed advanced anti-bot systems to prevent automated scraping. Overcoming these protections required careful implementation of adaptive techniques to ensure uninterrupted data collection.
4. Navigating JavaScript and Link Cycles:
• Dynamically loaded content through JavaScript made traditional scraping methods ineffective.
• Internal link cycles on some websites created infinite loops during crawling, necessitating solutions to limit scraping depth.
5. Advanced Techniques Integration:
Leveraging new technologies, including Visual LLMs and Agents, to process data more efficiently and overcome the diverse formats and structures found across websites.
Solutions
1. HTML Cleaning and Optimization:
• Built a custom algorithm to remove HTML noise, preserving only relevant content for data extraction.
• Reduced token usage in GPT calls by minimizing extraneous data, lowering costs and improving response times.
2. Advanced AI Integration:
• Agents and Visual LLM: Utilized Agents to dynamically navigate complex websites and extract relevant data efficiently.
• Applied Visual LLMs to interpret structured and semi-structured visual elements, such as menus or tables, directly from merchant websites.
3. Accurate Data Aggregation:
• Extracted data from multiple pages and aggregated it into a unified format, ensuring consistency and accuracy across attributes like addresses and delivery availability.
• Applied filtering techniques to eliminate duplicates and validate extracted information.
4. Overcoming Anti-Bot Technology:
• Implemented adaptive scraping techniques to bypass anti-bot protections.
• Conducted continuous monitoring to identify and counter evolving anti-bot mechanisms.
5. Scrapy for Depth-Limited Crawling:
• Used Scrapy to restrict crawling depth to avoid infinite loops caused by cyclical internal links.
• Combined with Agents to prioritize high-value pages dynamically, maximizing data collection efficiency.
Results
1. Streamlined Data Extraction:
Successfully extracted structured, accurate data about restaurants, including operational hours, delivery options, and multiple branch locations.
2. Cost Savings:
Optimized HTML cleaning and reduced token usage in GPT calls, cutting operational costs and improving system performance.
3. Improved Data Accuracy:
By leveraging Visual LLMs and robust aggregation methods, the system provided consistent, high-quality merchant data with minimal errors.
4. Enhanced Resilience:
Adaptive techniques for anti-bot detection and JavaScript navigation ensured reliable scraping, even under restrictive conditions.
5. Foundation for Scalability:
With advanced AI tools and a scalable architecture, the system is well-positioned to handle future growth and additional use cases.