Zend Framework: Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl) Component Proposal
| Proposed Component Name | Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl) |
|---|---|
| Developer Notes | http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl) |
| Proposers | Martin Hujer Alexander Veremyev (Zend Liaison) |
| Revision | 1.0 - 24th January 2008: Initial proposal 1.1 - 28th January 2008: Renamed to Zend_Filter_Sanitize (wiki revision: 16) |
Table of Contents
1. Overview
Zend_Filter_Sanitize is a component that converts string into SEO friendly url.
2. References
3. Component Requirements, Constraints, and Acceptance Criteria
- This filter must convert any string to seo url without any setting
- This component will correctly converts string into SEO url.
- This component will put all characters in lowercase.
- This component will translate special chars such as '?' or '?' into 'c' and 'u' (included in Zend_Filter_Transliteration)
- This component will allow to change the word delimiter character (default are (' ', '.', '
', '/', '-', '_')) - This component will allow to change the delimiter replacement character (default is dash)
- This component will strip characters, which are not allowed in url.
4. Dependencies on Other Framework Components
- Zend_Filter_Interface
- Zend_Filter_Exception
- Zend_Filter_StringToLower
- Zend_Filter_StringTrim
- Zend_Filter_Transliteration
5. Theory of Operation
...
6. Milestones / Tasks
- Milestone 1: [DONE] Finalize this proposal
- Milestone 2: [DONE] Working prototype checked into the http://zfdev.googlecode.com/svn/trunk/ZendFilterSanitize/
- Milestone 3: [DONE] Unit tests exist, work, and are checked into SVN.
- Milestone 4: Community and Zend review
7. Class Index
- Zend_Filter_Sanitize
8. Use Cases
9. Class Skeletons
Labels:
None
37 Comments
comments.show.hideJan 25, 2008
Tomas Markauskas
Will this work with non-Latin characters, like Cyrillic etc?
I like this, but I would suggest to create a more general transliteration filter. Then anyone could just replace all spaces with dashes ir wanted...
Jan 25, 2008
Martin Hujer
Hi,
it works with these character, because I'm from Czech Republic and I need to transform Czech characters, such as ? š ? ? ž ý á í é into their equivalent. It was one of the reasons, why I decided to write this class.
I'm not sure, if I understand your second note well. You mean to create class, which can just replace spaces with dashes? Or just to add some options to this class?
Jan 25, 2008
Tomas Markauskas
No, I meant, that this could be not a SeoURL filter, but just a transliteration filter, that outputs any text as transliterated ascii text.
And after that you could just use the output for the URL's, you would nly need to lowercase the text and replace all the spaces with dashes...
Jan 27, 2008
Martin Hujer
Yes, good idea, but when I want to transliterate '?š??žýáíé?ú' (typical Czech symbols) it works thi way:
Jan 30, 2008
Martin Hujer
I've solved it.
You can checkout from SVN of Zend_Filter_Transliteration
Jan 27, 2008
Daniel Freudenberger
I'd suggest to rename this class to Zend_Filter_Sanitize. The same name is used in other languages and it seems wrong to bind a class name to only one task (seo optimzation) when it could be used for several other tasks as well.
Jan 27, 2008
Joó Ádám
I should agree with Daniel, this class can be useful for several other task as well, so renaming it to Zend_Filter_Sanitize makes sense. Of course, it would be useful only if it can handle every latin-derivative (or close-to-it) alphabets (european alphabets with diacriticized latin letters, cyrillic and so on).
Jan 27, 2008
Martin Hujer
It works correctly with Czech, so I suppose I will work well with other European alphabets. If you have some problematic characters in your language, I'll be happy to add them to unit test.
Jan 28, 2008
Martin Hujer
Renamed to Zend_Filter_Sanitize
Jan 28, 2008
Vincent
Perhaps it'd also be appropriate to update the proposal to reflect the new name (e.g. "This component will correctly converts [sic] string into SEO url."). On the other hand, if you were going to rename it SanitizeUrl (which I also think is more appropriate) you can keep on renaming...
Jan 28, 2008
Renan Gonçalves
I think you will have problems with the encode (UTF-8 and ISO-8859-1, for example) of the characters.
How will you handle that?
I always use Sanitize in my projects. I think I can help with my experiences.
Jan 28, 2008
Martin Hujer
It converts utf-8 strings well. If the user has another input, it needs to be converted before.
I'd really appreciate your tips.
Jan 28, 2008
Cristian Bichis
Nice idea about this class, Martin...
Btw, i din't saw you on #zftalk at all last months...
Jan 28, 2008
Ralph Schindler
Hey Martin,
I might propose SanitizeUrl over simply 'Sanitize'.
The current naming doesn't give any context to what you are attempting to filter for and against.
Another thought might be to simply create a "Tansliteration" filter first, as that might have a much broader audience than that of Url's.
My 2cents
-ralph
Jan 28, 2008
Daniel Freudenberger
Hey Ralph,
I think this filter could also be used to filter directory names on the filesystem (for example). I don't think it's a good idea to rename it to *anything*Url.
Jan 28, 2008
Simone Carletti
I should agree with Ralph just because I found this class has some replacements that has been specifically designed for URLs, especially SEO URLs.
In particular, space replacement with dashes (instead of underscores) is a common practice when you design routes search engine friendly.
I have a question for you, Martin.
or dots?
How will the filter handle special characters such as slashes
Additionally, if you really want to make an URL search engine friendly you should ensure that URLs with and without trailing slash are normalized to an unique version (usually without for such this type of framework powered URLs).
Do you think this filter should normalize trailing slash too?
Feb 09, 2008
Joó Ádám
I'd prefer to leave dots untouched - would be handy when normalizing filenames like MoZzIlla FiREfOx 1.0.0.12.EXE (mozilla-firefox-1.0.0.12.exe).
Feb 10, 2008
Martin Hujer
Maybe the option to handle this would do it
Feb 18, 2008
Martin Hujer
Added
See use cases and update from SVN.
Martin H.
Jan 28, 2008
Ralph Schindler
That said, doesnt that also make a good case for a Zend_Filter_Transliteration specific filter?
After just a cursory review, I think having a trasliteration filter would be a "Good Thing" (r).
-ralph
Jan 28, 2008
Martin Hujer
Hello,
I have thought about it (discussion on irc helped me) and I will create proposal for Zend_Filter_Transliteration (or just Translitere). It will handle just transliteration and maybe apostrophes removal.
Simone: slashes and dots are currently stripped away, but they should be converted to dash (or to set replacement character). I'll do it tomorrow.
New name of this component could be one of these:
Jan 28, 2008
Tomas Markauskas
I would vote for Zend_Filter_Sanitize. It could be used then to filter strings for URLs, maybe for filenames (ie. for uploaded files, not to contain invalid/unwanted characters) and probably for lots of other things.
Jan 29, 2008
Simone Carletti
The idea behind Zend_Filter_StringToSlug is nice.
Being inspired by Tomas's feedback, what about StringToPath?
Jan 29, 2008
Lars Strojny
You proposal is basically an aggregation of three filter:
I suggest to implement that as-is and then aggregate that three helpers into a single one which could be called NiceUrl or something (btw: URLs are overrated in SEO, domains are not, but this doesn't matter here).
Jan 30, 2008
Martin Hujer
I have split off part of this component into [Zend_Filter_Transliteration http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Transliteration+-+Martin+Hujer]
Jan 30, 2008
Simone Carletti
Do you think it is necessary? :|
I don't know if it can help you, but no more than 3 days ago I did the same for a Rails project.
Here's my piece of code.
I used iconv as well as you posted at http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Sanitize+-+Martin+Hujer?focusedCommentId=42104#comment-42104
You can get inspired, if you need.
Jan 30, 2008
Martin Hujer
Thanks, Rails looks like Python
I wanted to write more general transliteration filter and when i use this icovn command, it converts some diacritics mark into ' or " or ^
Zend_Filter_Transliteration filter strips this chars out.
And the Sanitize filter has some improvements to be more general (e.g. file system paths creation)
Martin.
Feb 08, 2008
Cristian Bichis
I started using sanitize instead of my own helpers.
Works great for now, will check deeply next days.
May 28, 2008
Ben Scholzen
Some correction from my side to Zend_Filter_Transliteration, which I just found out:
The transliteration table should more look like this for German (according to german grammar):
Jun 07, 2008
Martin Hujer
Code updated to reflect this.
Jun 09, 2008
Ben Scholzen
Have to add another thing to that topic:
When the word is written entirely uppercase (ÄPFEL), it should result in "AEPFEL", while "drüben" still results in "drueben".
I guess you could check, if the next character in the word is uppercase. If there is no next character, check the previous character. And don't mind words with just two letters, afaik there aren't any with umlauts
May 29, 2008
Joó Ádám
Would be nice to have an option to supply it with a dictionary in which you can specify alternatives for signs in common use: there's nothing more annoying than URLs like /blog/dr-jekyll-mr-hide generated from the title Dr Jekyll & Mr Hide - this case the ampersand could be replaced with the text 'and', 'und', 'et', 'y', 'és' an so on...
This dictionary could be specified in a PHP array, INI or XML config file.
May 29, 2008
Ben Scholzen
Yeah, I was alsoing going to tell that. I suggest that you could supply a Zend_Translate instance.
But I also see a problem with that. Sometimes it may be problematic, when you have a title, for example, like "That silly ", which would then be converted to "that-silly-andnbsp".
Sure, in that case you could say, that there must be a space before and after the &, but how about: "The & character in HTML & other things", which would be converted to "the-and-in-html-and-other-things". You would want the first one not to be converted, but the last one.
As you can see, there is no logical way to determine, wether to convert a special character or not.
May 30, 2008
Joó Ádám
Yes, Zend_Translate support would be nice.
However, I don't really see the problem here. In the first case it is more or less unambiguous, I, personally pronounce it that way: 'and-n-b-s-p', suppose I'm not the only one. Also, we could optionally use regular expressions, to say that character references should be converted in the form of 'ampersand-nbsp', 'nbsp' or if you want, just hardcode it, and use 'non-breaking-space'.
In the second case: I suppose you admit that this seems to be no too realistic, and converted to 'the-and-character-in-html-and-other-things' is a totally acceptable conversion.
May 30, 2008
Ben Scholzen
Ok, agreed. Just wanted to make sure that everything is clear
Jul 01, 2008
Martin Hujer
This filter is ready for Zend review (also with Zend_Filter_Transliteration)
I'm not sure, whether I should decouple this from Zend_Filter_Transliteration.
These two filters were just one, 15 lines long filter to create seo-friendly URL from string.
Aug 05, 2008
Alexander Veremyev
The proposal is archived since it has hard dependency on Zend_Filter_Transliteration Component Proposal which is already archived.