How does HypeLab process 200 million ad events into ML-ready features? We use our distributed preprocessing framework running on cloud-native pipeline infrastructure to transform raw Web3 advertising data into optimized training data every two weeks. This preprocessing pipeline handles categorical encoding, target encoding, type preservation, and device filtering at scale, ensuring our crypto ad network delivers accurate predictions for advertisers and publishers alike.
At HypeLab, the leading Web3 ad platform, we serve billions of ad impressions across premium crypto apps like Phantom, Rainbow, Zerion, and DeBank. Every click, impression, and conversion generates event data that feeds our machine learning models. But raw events contain strings, nested structures, inconsistent formats, and missing values. The model expects numeric features in consistent shapes. Bridging this gap at scale requires serious data engineering, and getting it right means better ROI for advertisers and higher eCPMs for publishers.
Why does preprocessing matter for Web3 advertising?
Preprocessing quality directly affects model accuracy, which determines whether advertisers see strong ROI and publishers maximize revenue. Poor preprocessing leads to unreliable predictions, wasted ad spend, and missed monetization opportunities.
Why Does ML Preprocessing Need Its Own Infrastructure?
Dedicated preprocessing infrastructure separates concerns and enables scale. For a blockchain ad network processing hundreds of millions of events from apps like MetaMask, StepN, and Axie Infinity, in-training preprocessing simply does not work. Without proper preprocessing infrastructure, models train on inconsistent data, predictions fail in production, and ad performance suffers. Here is why dedicated infrastructure matters:
Memory constraints. Raw event data does not fit in memory on a single machine. Even if it did, transforming it in memory would require holding both raw and transformed representations simultaneously.
Compute efficiency. Preprocessing involves operations that parallelize well: encoding, filtering, joining. These operations should run on distributed infrastructure optimized for data processing, not on ML training infrastructure optimized for gradient computation.
Reproducibility. Preprocessing must be deterministic. The same raw data should produce the same features every time. Running preprocessing as part of training makes this harder to guarantee and harder to debug when things go wrong.
HypeLab scale: 200 million events spanning several weeks, processed into compressed training-ready datasets. Raw data includes device models, publisher slugs, creative types, and dozens of categorical fields from our network of 500+ Web3 publishers.
What Is Our Distributed Preprocessing Framework and Why Use It for Ad Tech?
Our distributed preprocessing framework provides a unified model for defining data processing pipelines, and it has become essential infrastructure for crypto ad networks handling high-volume event data. The key abstraction is a distributed dataset that transformations operate on. You define your pipeline using Python, then execute it on a managed pipeline service optimized for cloud infrastructure.
The "write once, run anywhere" approach means the same pipeline code can execute locally during development, on managed cloud infrastructure in production, or on alternative distributed compute platforms if needed. For Web3 advertising platforms like HypeLab, this flexibility is valuable. We develop and test locally, then deploy to our managed pipeline service for production runs processing data from publishers across DeFi, gaming, and wallet verticals.
Distributed pipeline programming takes getting used to. It looks like Python, but the execution model is fundamentally different. Operations do not execute immediately; they define a computation graph that the execution engine optimizes and executes. This is powerful but requires thinking in terms of transformations over collections rather than imperative loops.
Core concepts:
Distributed dataset: An immutable collection of data distributed across workers
Transform: An operation from one dataset to another
Pipeline: A directed graph of transforms connecting datasets
Runner: The execution engine (local for development, managed cloud service for production)
What Data Does the Preprocessing Pipeline Receive?
The preprocessing pipeline does not operate on raw event logs. An upstream cleaning job handles deduplication, filtering invalid events, and basic normalization. The preprocessing pipeline receives cleaned data covering several weeks of historical data.
Why this window size? The window balances two concerns. We want enough history to compute stable statistics for target encoding. Rare feature combinations need sufficient data to estimate click rates reliably. But we do not want stale data biasing the model toward outdated patterns. Our current window is validated by model performance experiments.
Input data lives in BigQuery, partitioned by date. The pipeline reads from BigQuery using Beam's BigQuery source, which handles pagination and parallelization automatically. We specify the date range and relevant columns, and Beam handles the rest.
How Does Categorical Encoding Work for Ad Data?
Machine learning models consume numbers, not strings. A device model like "iPhone 15 Pro Max" must become an integer the model can process. This categorical encoding (mapping string categories to integers) is fundamental to how blockchain ad platforms like HypeLab predict click-through rates and optimize ad delivery.
Simple encoding assigns integers sequentially: device model A becomes 0, B becomes 1, and so on. The challenge is consistency. The same device model must map to the same integer in training and inference. New device models appearing after training must map to an "unknown" category.
Our pipeline builds encoding dictionaries from the training data. Each categorical field gets a dictionary mapping values to integers. Values below a frequency threshold get mapped to a special "rare" category rather than their own integer. This bounds the vocabulary size and ensures the model does not memorize rare values.
Encoding example: Device model has thousands of distinct values, but the long tail (devices seen fewer than 100 times) gets grouped into a single "rare" category. This reduces vocabulary size while preserving signal for common devices.
What Is Target Encoding and Why Does It Improve Ad Predictions?
Target encoding is a powerful technique for high-cardinality categorical features. Instead of assigning arbitrary integers, we encode each category by its historical relationship with the target variable (clicks).
For example, instead of encoding publisher A as 1 and publisher B as 2, we encode each publisher by its historical click-through rate. A DeFi dashboard like DeBank with 2% CTR gets encoded as 0.02. A portfolio tracker like Zerion with 1.5% CTR becomes 0.015. This embeds predictive signal directly into the feature value.
Target encoding requires careful implementation to avoid data leakage. We compute target statistics on the training data only, never including validation or test data. We also apply smoothing, blending category-specific statistics with global statistics to handle categories with few observations.
Target encoding approach:
We blend category-specific click rates with global statistics using a smoothing parameter
The smoothing parameter balances category-specific signal against the global prior
This prevents rare categories with extreme CTRs from dominating predictions
Why Is Numeric Type Preservation Critical for ML Pipelines?
A subtle but critical preprocessing concern is preserving numeric types. Floats must stay floats. Integers must stay integers. Mixing them causes problems because some ML frameworks handle mixed types poorly, and serialization formats may lose precision.
Pipeline transformations and optimized output formats both support typed data, but you must be explicit. Our pipeline enforces type consistency at each transformation stage. If a numeric field enters as float64, it must exit as float64.
This matters for features like bid amounts (float), impression counts (integer), and time-of-day (integer hour). Accidental type coercion (converting an integer to a float) can introduce subtle bugs that are hard to catch during training but cause issues in production inference.
How Does Device Model Filtering Improve Prediction Accuracy?
Device model is one of our most important features because click behavior varies significantly by device. Crypto users on iPhone 15 Pro behave differently than those on budget Android devices. But the device model space is enormous. New devices launch constantly. The long tail of rare devices provides little predictive signal while inflating vocabulary size.
Our pipeline filters device models aggressively. Models seen fewer than a threshold number of times get mapped to a generic category based on device type (smartphone, tablet, desktop). This preserves signal for common devices while bounding complexity.
The filtering threshold is tuned experimentally. Too aggressive, and we lose signal from moderately common devices. Too lenient, and the vocabulary explodes. Our current threshold was determined by measuring model performance across different settings.
Why Is Device Category Standardization Essential for Web3 Ad Targeting?
Device category (smartphone, tablet, desktop, TV, etc.) comes from user agent parsing. Different parsing libraries produce slightly different category names. Upstream data might contain "mobile" and "smartphone" referring to the same category. Traffic from Phantom wallet users on iOS differs from MetaMask browser extension users on desktop, and the model needs consistent categorization to learn these patterns for effective crypto advertising targeting.
Standardization maps variant names to a canonical vocabulary. This is a simple lookup transformation, but it must be comprehensive. Missing mappings create new categories that the model has not seen, degrading prediction quality.
We maintain the mapping dictionary as a configuration file, updated when we discover new variants. The preprocessing pipeline applies this mapping early, ensuring all downstream transformations see consistent categories.
Why Is an Optimized Columnar Format Best for ML Training Data?
The preprocessing pipeline outputs optimized columnar format files to cloud storage. For Web3 ad platforms handling massive datasets, this format is ideal for ML training data:
- Columnar storage: Models read specific features, not entire rows. Columnar format enables reading only needed columns.
- Compression: Columnar formats achieve 5-10x compression on our data, reducing storage cost and I/O time.
- Type preservation: Schema enforcement prevents the type coercion bugs mentioned earlier.
- Partitioning: Large datasets can be partitioned into multiple files, enabling parallel reads during training.
Our training pipeline reads these files directly into distributed computing infrastructure. The preprocessing and training pipelines are decoupled: preprocessing runs on our managed pipeline service, training runs on cloud ML infrastructure, and optimized training data in cloud storage connects them.
What Makes Distributed Pipeline Programming Challenging to Learn?
Distributed pipeline programming is genuinely difficult to learn. It looks like Python, but thinking imperatively will lead you astray.
The key mental shift is that pipeline code defines a computation graph, not a sequence of operations. When you write a transformation, you are not transforming data; you are adding a node to the graph. The execution engine runs the graph later, potentially on hundreds of workers in parallel.
Debugging is particularly challenging. Errors may occur on remote workers, far from where you defined the transformation. Print statements do not work as expected because execution is distributed. You must use proper logging facilities and learn to read execution logs in the cloud console.
Debugging tips:
Test locally first: Local execution makes debugging tractable
Use assertions liberally: Validate assumptions about data shapes and types
Sample data: Test transformations on a small sample before running at scale
Log strategically: Use proper logging (not print) and check execution logs in cloud console
How Do You Manage Pipeline Operations in Production?
Running the pipeline on managed cloud infrastructure involves operational considerations beyond writing correct code.
Worker configuration. The pipeline service autoscales workers based on data volume and processing rate. You can set min/max worker counts and machine types. For our workload, we typically see distributed cloud infrastructure scaling during peak processing.
Cost management. Cloud pipelines charge for worker time. Large pipelines can become expensive. We monitor costs and tune configurations because more workers finish faster but cost proportionally more. The optimal balance depends on urgency.
Failure handling. The pipeline service retries failed tasks automatically, but some failures require intervention. We monitor pipeline status and have alerts for stuck or failed pipelines.
Job templates. For recurring pipelines, we use templates (pre-compiled pipeline definitions that can be launched with different parameters). Our biweekly preprocessing runs use a template, launched by a scheduled job.
How Does Preprocessing Fit Into the Full ML Pipeline?
Preprocessing is one stage in a larger ML pipeline that powers HypeLab's Web3 ad platform:
Full pipeline: Raw events (BigQuery) > Cleaning job (SQL + some Python) > Preprocessed data (optimized training data via our pipeline service) > Model training (distributed computing on cloud ML infrastructure) > Model selection > Calibration > Model registry > Production deployment
Each stage is independently testable and deployable. Preprocessing failures do not affect the cleaning job. Training failures do not corrupt preprocessed data. This modularity makes the overall system more reliable and easier to maintain.
The interfaces between stages are well-defined. Preprocessing expects cleaned data in BigQuery with specific columns. Training expects optimized training data in cloud storage with specific schema. These contracts are documented and tested.
What Performance Can You Expect from Distributed Preprocessing?
Our preprocessing pipeline processes approximately 200 million events in about 45 minutes using autoscaled cloud workers on distributed cloud infrastructure.
HypeLab preprocessing vs. alternatives:
Pandas on single machine: Would take 8+ hours and likely crash on memory
Self-managed distributed compute: Similar scale, but more operational complexity
Our managed pipeline approach: 45 minutes, fully managed, autoscaling
This combination gives us production-grade reliability with minimal operational burden.
The bottleneck is not computation but I/O: reading from BigQuery and writing to cloud storage. Our pipeline infrastructure handles I/O parallelization well, but there are limits. Adding more workers beyond a certain point does not speed up I/O-bound stages.
We have tuned the pipeline over time, primarily by reducing unnecessary data shuffling. Operations that require shuffling (grouping, joining) are expensive. Where possible, we restructure transformations to avoid shuffles or perform them on smaller intermediate datasets.
How Does This Impact Advertisers and Publishers?
The bottom line: Preprocessing quality directly affects model quality, which affects advertiser ROI and publisher revenue. Every improvement in our preprocessing pipeline translates to better ad targeting and higher earnings for our network.
If categorical encoding is inconsistent, the model sees different features in training versus inference. Predictions become unreliable. If target encoding leaks test data, the model appears better than it is. Performance degrades in production.
If preprocessing is slow, model updates are delayed. The system cannot adapt quickly to changing conditions in the fast-moving crypto market. If preprocessing is expensive, costs eat into margins.
Good preprocessing is invisible when done right. The ML model just works. Advertisers running campaigns on HypeLab see strong click-through rates. Publishers monetizing with our network see consistent, competitive eCPMs. But getting preprocessing wrong causes cascading failures that ultimately hurt everyone in the ecosystem.
What results does this enable for HypeLab's network?
Our preprocessing pipeline supports real-time bidding across premium Web3 inventory from apps like Phantom, Zerion, DeBank, and StepN. The resulting ML models help advertisers reach crypto-native audiences efficiently while maximizing publisher revenue through intelligent ad selection.
What Are the Key Lessons for Building ML Preprocessing Pipelines?
For teams building similar pipelines, whether for crypto ad networks, DeFi analytics, or other Web3 data processing, we offer these recommendations:
- Invest in learning distributed pipelines properly. The learning curve is steep, but the capability is worth it. Half-learned pipeline programming leads to buggy systems.
- Test locally before scaling. Local execution catches most bugs without the complexity of distributed execution.
- Define clear interfaces. Document what each pipeline stage expects and produces. Test those contracts.
- Monitor costs from the start. Cloud pipelines can get expensive. Build cost awareness into your development process.
- Prefer optimized columnar formats for ML data. Columnar, compressed, typed. It is the right format for the job.
Preprocessing is not exciting, but it is foundational. Get it right, and the rest of the ML system builds on solid ground. At HypeLab, this infrastructure enables us to deliver industry-leading ad performance for crypto advertisers while maximizing revenue for our publisher partners.
Ready to leverage ML-powered ad targeting for your Web3 project? Advertisers can launch campaigns on HypeLab's premium publisher network in minutes, with access to 500+ crypto apps and games. Publishers can start monetizing with high-quality, brand-safe crypto ads today. Get started at app.hypelab.com.
Frequently Asked Questions
- Pandas works well for data that fits in memory, but ad event data at scale exceeds single-machine limits. Our distributed preprocessing framework provides horizontal scaling. The same pipeline code runs on a laptop during development and on hundreds of cloud workers in production. This scalability is essential when processing hundreds of millions of events.
- The preprocessing pipeline handles: categorical encoding (converting strings to integers for model consumption), target encoding (encoding categories by their historical click rates), float and integer preservation (maintaining numeric types through the pipeline), device model filtering (reducing rare device models to a manageable vocabulary), and device category standardization. The output is optimized training data ready for model consumption.
- The preprocessing pipeline runs every two weeks, aligned with the model retraining schedule. Each run processes several weeks of cleaned event data. This sliding window approach ensures the model trains on recent data while having enough history for stable feature statistics. The window balances recency against sufficient data volume for rare feature combinations.



