Bulk Delete Documents in Azure Cosmos DB Container

While working on a project, I came across a use case to bulk delete documents in Azure Cosmos DB Container using Azure Data Factory (ADF). There are few approaches, however it mainly involves either implementing a stored procedure or a function. This article provides the steps to implement bulk delete of documents in Cosmos DB entirely in a ADF flow.

Problem statement

Azure Cosmos DB SQL API has a large quantity of documents. The Azure Data Factory pipeline requires to bulk delete documents before loading new set of documents

Step 1: Prerequisites

  • Access to Azure Cloud
  • A data source, either a csv file or excel file with the data
  • A data sink, Cosmos DB SQL API instance
  • ADF pipeline that extracts the source, transforms the data from the source and loads into Cosmos DB sink

Step 2: Add a dataflow to existing pipeline to bulk delete documents

Azure Data Factory DataFlow

i. Add source pointing to the Cosmos DB container that has records to be deleted

Azure Data Factory DataFlow Source

ii. Go to the ‘Source options’ for the source and select ‘Query’ so you can enter the SQL query that will retrieve the documents to delete. In the example here, I am selecting all 3 bedroom home listing so those can be deleted from Cosmos DB

Azure Data Factory DataFlow Alter Row

iii. Add ‘Alter Row’ transformation to the source and go to ‘Alter row settings’. Select ‘Delete if’ under ‘Alter row conditions’ and enter ‘true()’ in the expression

Azure Data Factory DataFlow Sink

iv. Add sink pointing to the same Cosmos DB container

Azure Data Factory DataFlow Sink settings

v. Go to sink settings and select ‘Allow delete’ under ‘Update method’. If Cosmos DB has a large set of documents to delete, provide ‘Batch size’ to split the deletion

To bulk delete documents in Azure Cosmos DB, Azure Data Factory provides easier approach because the flow is entirely in the data factory and it does not call external stored procedures or functions. This also allows better control over the data flow and provides better troubleshooting mechanism in case of any errors