Cleaning up 1000+ rows of bad real estate data

bernard

BuSo Pro
Joined
Dec 31, 2016
Messages
2,649
Likes
2,335
Degree
6
Suppose you had 1000 rows of real estate data, each corresponding to a project with aggregate data (features, price ranges, sq feet etc) that had errors in it, specifically miscategorized bedrooms, making the aggregate data such as price ranges for studios or 1 bedroom faulty or likewise the size ranges faulty.

The data has been collected by scraping other real estate states who in turn have it from agents who are the one's making the "mistakes" likely to get more views when people sort by low price and such.

Would you attempt to clean this up algorithmically using some kind of data engineering woodoo or would you clean it up manually and perhaps build a backend to do so with Retool for example and have some virtual assistants do it.

Since its aggregate data and project features, it's unlikely to change a lot for a while, so I'm considering if yearly manual updates would be worth it. No one else seems to be doing this though, so I wonder if they don't care, however, I need accurate data for my purposes.
 
For something like this, as much as I'm into "automating the planet", I'd consider a manual process or at the very least something with "human in the loop" for supervision. Unless you can define a list of things that need to happen to "fix" the data and every possible thing to look out for and that list would never change, any attempt at porting that to a script isn't going to work well. This doesn't sound like something that could be put into a simple algo, you'd be constantly adding the edge cases to make up for where the algo doesn't know what to do. You could consider an AI agent that can help reason things out but with something like 1000 rows, it's most cost effective in both time and money to let a good VA sort that out.
 
I would definitely attempt to automate it first (with a safe backup). Because doing it manually sounds like a total bitch.

I would use an agent for this by providing it the correct data and telling it to audit for mistakes.

If you have it locally it might be worth showing it to Claude Code or any agent in your IDE. Put it in a directory and open the agent in it and tell it what you need to do.

If the size of the data is a problem you can always batch it out by splitting 1000 rows into 5 groups of 200 for example, and doing them one at a time.

The hard part would be providing clean guidelines for which data entries are incorrect. Otherwise it is highly likely to make mistakes. And if it reaches a point where setting up a reliable automation system is time consuming it may be faster to do it manually.
 
It's kind of difficult for me to see how I can tell the difference between a scraped 1 bedroom or a misclassified studio, unless I can look at the images.

Of course I can make certain "approximate rules" for if it groups at top or bottom boundaries for size and price, but from what I see, often a 50 sqm apartment can be both a studio or a 1-bedroom, depending on if the owner decided to put up a dividing wall for example. For a user, you'd want to know that there is a separate bedroom, even if it is technically a project studio, right?

Of course I could use one of the image recognition AI apis to look at images and deduce bedrooms, but it also raises the amount of scraping to much higher levels. It's not so much the price of it, which is fairly cheap through say Scrapingbee, but the fact that it a kind of shitty thing to do from a different real estate website.

The plan was more to get some good initial data, get some traction, then get data directly from real estate agencies and tell them to clean up their data.
 
Back