As soon as a CAR file is uploaded to IPFS via web3.storage API, the file is copied to an S3 bucket and this creates an entry in an SQS topic. The Indexing Lambda is configured to use this SQS topic as an event trigger.
Upon starting, Lambda starts streaming the file (leveraging the fact that the CAR format is optimized for sequential parsing) and validates the format.
For each data block found in the CAR file the function saves the block type, offset and length in the Blocks DynamoDB table. The record also contains the list of all CAR files the block was found into. The multihash of the block, which is a universally unique identifier for the block itself, is enqueued in an SQS Topic which will later be accessed by the Publishing Subsystem.
At the same time, the Lambda also stores the progress in the CAR file (represented as the byte offset of the last analyzed block in the file) in another DynamoDB table.
The reason for storing CAR file indexing status is not solely for tracking progress on other systems or to report to the user. One very interesting aspect of the CAR file is that the parsing complexity is not given by the file size but from the number of blocks inside the file.
When a CAR file is the representation of big archive files (e.g. a ISO image), the number of blocks in the file will be roughly the total file size in KB divided by 256 (which is the typical raw block size). As the indexer does not have to read the entire raw block but only the block head (which is only a few bytes), parsing and indexing will be really fast.
When the CAR file is instead made of a lot of very small files (e.g. an Ethereum transactions log), the CAR file can contain millions of blocks even if the overall file size is relatively small (hundred of MBs). In that case the parsing and indexing will be really slow because the impact of database writes (which are network-close but not machine-local) becomes really relevant.
In the latter case, it’s very easy to fail to process a file in the maximum time allowed by AWS Lambda (15 minutes). Also, in case of errors, it would be extremely penalizing to have to start indexing from the start of the file.
Thus, we track and update the status of each CAR file analyzed at the end of each block and the Indexing Lambdas are able to resume the indexing exactly where the previous run left it.