Regex - Tumblr Posts
I'm sorry for this in advance but I've seen one too many of these bullshit memes on CS meme pages and I can't take it anymore
This is a crap regex and here's why:
Who the hell is visiting a website via SMTP? That doesn't even make sense. It's an email protocol for gods sake. FTP is a stretch, but possible I guess (firefox dropped suport a year ago, but maybe this screenshot is old or from a lame non-firefox user). You've also forgotten other much saner URI schemes like file:// (local files), ws:// (websockets), among others.
Starting with an optional www, interesting start. The thing is, although we see www used as a subdomain a lot, on a technical level, it's just an arbitrary subdomain. There's nothing special about it. And you forgot to escape the dot so now you've got a wildcard which allows illegal domain name characters in.
Ok, so we're ignoring every non-www subdomain, hmmmm can't think of an obvious example of those you're missing (*cough cough* tumblr blogs *cough cough*). ANyways, good on you for realizing you can have numbers in domain names, but you've missed other domain name allowed characters, most notably the hyphen, best known for how it saved experts-exchange from being known as ExpertSexChange (nothing wrong with Expert Sex Changes but that's not what Experts Exchange was/is for).
Using "any string of letters" for TLDs is certainly a stretch, but I'll let it slide. Aside from an exhaustive list or some approximation of Mozillas public suffix list, I'm not sure if there's a much better way to do it.
Time to take a look at the path. In a real URI, the path is essentially an arbitrary collection of characters, some of which are URI encoded I guess, but trying to enforce a ton of structure there is just going to go wrong. Which is exactly what this regex does :( Notably, we've disallowed query strings, allowed anchors in the middle of the path, and restricted the character set to be alphanumeric, all of which are going to cause problems.
Also, general regex notes, you'll want to take advantage of the built in character groups like \w and \d for words and numbers, and in the last match group, if you move your \/? out of the group, you'll improve the performance on large URIs. Also, in the future, just don't use a regex for URI matching. Trust me, it gets super fucked super fast, URIs are actually really complex beasts once you get into the weeds there (source: I've been in those weeds).
a regex god
Writing regex gives me such a rush, every time. I feel like an evil wizard conjuring a dastardly curse.