Methodology
How cleaning works
CleanSheet is deterministic. The same input produces the same output every time, every change is shown to you before it is applied, and nothing is ever guessed or invented. This page documents every rule the tool uses, including its assumptions, so you can decide whether it fits your data.
Column detection
The tool samples up to the first 100 rows of each column and tests them against known patterns. A column is assigned a type (phone, email, date, ZIP code, or number) only when at least 60% of its sampled values match that type. Anything below that threshold is treated as plain text and receives only whitespace and capitalization cleanup. A structured-looking column that fails detection is never force-converted.
Phone numbers
- US 10-digit numbers in any format (555.123.4567, 5551234567, (555) 123 4567) become (555) 123-4567.
- 11-digit numbers starting with 1 become +1 (555) 123-4567.
- Anything else (international formats, extensions, short numbers) is left exactly as it was. The tool never truncates or reformats a number it cannot confidently parse.
Email addresses
Emails are lowercased and trimmed. [email protected] becomes [email protected]. Invalid-looking emails are left unchanged rather than “fixed.”
Dates
Recognized dates are standardized to ISO format (YYYY-MM-DD), which sorts correctly in every spreadsheet tool. Recognized input formats include 12/25/2023, 2023-1-5, 12-25-2023, Jan 3 2024, and 3 Jan 2024. European dotted dates (25.12.2023) are not yet supported and are left unchanged.
Important assumption: ambiguous slash dates are read as US month-first. 4/1/24 becomes 2024-04-01 (April 1), not January 4. If your data uses day-first (UK/EU) dates, do not enable date cleaning on that column yet; a day-first option is on the roadmap. Unparseable values are left unchanged.
Names and text
- Repeated spaces, tabs, and leading/trailing whitespace are collapsed.
- Values that look like names get title case with correct handling of apostrophes and hyphens: sam o’neil becomes Sam O’Neil, mary-jane becomes Mary-Jane.
- Business suffixes stay uppercase: LLC, LLP, PLLC, LP.
ZIP codes
5-digit ZIPs are kept as 5 digits; 9-digit ZIPs become ZIP+4 format (12345-6789). Other postal formats are left unchanged.
What the tool never does
- Never invents or fills in missing values. Empty cells stay empty.
- Never deletes rows or columns. Duplicate column headers are renamed (Email, Email_2) so no data is lost.
- Never applies a change without showing it to you first in the before/after preview.
- Never sends your data to an AI model. All cleaning is rule-based on our server.
Limits
Free tier: CSV files up to 2MB or 2,500 rows. Excel support is coming soon (in Excel: File > Save As > CSV). Exports are deleted from our server after 24 hours. Credit packs for larger files are coming; join the waitlist on the homepage.
How the rules improve
When the tool meets a value it cannot clean, it records only the anonymized shape of that value (letters become “a”, digits become “9”, so 555-0123 is stored as 999-9999). Real data is never stored for this purpose. We review these shapes weekly and add new rules by hand. Questions about a specific format? Ask us.