src/0000755000000000000000000000000013105122453010337 5ustar rootrootsrc/test/0000755000000000000000000000000013105122453011316 5ustar rootrootsrc/test/resources/0000755000000000000000000000000013105122453013330 5ustar rootrootsrc/test/resources/test9.html0000644000000000000000000000110411501643045015265 0ustar rootroot ZOKA ROKA MIRA PERA

Paragraf 1 Mojamala nema mane...

McKenzy

freestylo

something after

bla bla bla

src/test/resources/test8.html0000644000000000000000000000061412544757566015317 0ustar rootrootThis is in different FONT Some text here....

He said,

"Hi there!" My link
This is BOLD text?!
src/test/resources/test7.html0000644000000000000000000000015611474551373015304 0ustar rootrootTEST src/test/resources/test6.html0000644000000000000000000000136311502705145015272 0ustar rootrootat first text bibi
content 1111

brbrbrbr KAKAAK

var x = 10;

zeka Type your name:

this "works" wrong

fKa
one two three PRITISNI
['][B][ ][•]
A B
[α][é][‾]
src/test/resources/test5HTML5.html0000644000000000000000000041320712523434626016056 0ustar rootrootYahoo!
Yahoo! Search - Hot artists to listen to: Mariah Carey, Janet Jackson.

toggle search suggestions
Yahoo! Search

News Navigation

In the News

Open a no-fee IRA at Scottrade
Β Popular puzzle games
Jewel Quest II. Bejeweled 2. Burger Rush. Get the most popular puzzle games at Yahoo! Games. Download now.


Home theaters can be confusing - Learn the ins and outs of your new high-definition equipment on Yahoo! Tech.
Go to Yahoo! Shopping and save on the hot new styles.
Monopoly Here and Now is here now. The classic game with a modern makeover. Play now at Yahoo! Games. src/test/resources/test5.html0000644000000000000000000040742112113037735015300 0ustar rootroot Yahoo!
Yahoo! Search - Hot artists to listen to: Mariah Carey, Janet Jackson.

toggle search suggestions

News Navigation

In the News

Open a no-fee IRA at Scottrade
 Popular puzzle games
Jewel Quest II. Bejeweled 2. Burger Rush. Get the most popular puzzle games at Yahoo! Games. Download now.

Home theaters can be confusing - Learn the ins and outs of your new high-definition equipment on Yahoo! Tech.
Go to Yahoo! Shopping and save on the hot new styles.
Monopoly Here and Now is here now. The classic game with a modern makeover. Play now at Yahoo! Games.
src/test/resources/test4.html0000644000000000000000000000132512113037735015270 0ustar rootroot
&"'<>
ô‰×Ÿ€
І‹
content of unknown tag content of deprecated tag
LINK 1
aaa
1.&"'<>
2.&"'<>
src/test/resources/test33_expected.html0000644000000000000000000000051713046661522017240 0ustar rootroot src/test/resources/test33.html0000644000000000000000000000040513046661522015353 0ustar rootrootsrc/test/resources/test32.html0000644000000000000000000000006112513700446015345 0ustar rootroot

Text

src/test/resources/test31.html0000644000000000000000000000057612502765267015371 0ustar rootroot
]]>
src/test/resources/test30.html0000644000000000000000000017473112457157675015404 0ustar rootroot Cool Quiz! Trivia, Quizzes, Puzzles, Jokes, Useless Knowledge, FUN!

 Search Cool Quiz!
 
 Advanced Search »

Trivia Quizzes Puzzles Humor Fun Pages Connect Make a Quiz!Message BoardsSend This to a Friend!View Your Profile

Did You Know? Back
Back
Tell a Friend!
Tell a Friend

"Merry Christmas" Around The World "Merry Christmas" Around The World

When is "Merry Christmas" not "Merry Christmas"?

When you say it in another language.

Read on to see how people around the world wish each other a happy holiday.

 

Country Native Greeting(s)
Afghanistan De Christmas akhtar de bakhtawar au newai kal de mubarak sha
Albania G?zuar Krishlindjet
Algeria Mboni Chrismen
American Samoa La Maunia Le Kilisimasi
Andorra Bon Nadal
Angola Boas Festas
Antarctica Merry Christmas, Felices Pasquas, Hristos Razdajetsja
Antigua and Barbuda Merry Christmas
Argentina Feliz Navidad!
Armenia Shnorhavor Sourp Dzunount
Aruba Bon Pasco, Bon Anja
Australia Happy Christmas
Austria Frohe Weihnachten
Azerbaijan Tezze Iliniz Yahsi Olsun
Bahamas Happy Christmas
Bahrain Mboni Chrismen
Bangladesh Shuvo Baro Din
Barbados Merry Christmas
Belarus Winshuyu sa Svyatkami
Belgium Zalig Kerstfeest
Belize Merry Christmas
Benin Joyeux Noel
Bermuda Merry Christmas
Bhutan krist Yesu Ko Shuva Janma Utsav Ko Upalaxhma Hardik Shuva
Bolivia Feliz Navidad
Bosnia and Herzegowina Sretam Bozic, Hristos se rodi
Botswana Merry Christmas
Brazil Feliz Natal
British Indian Ocean Territory Happy Christmas
Brunei Darussalam Selamat Hari Natal
Bulgaria Vessela Koleda
Burkina Faso Joyeux Noel
Burundi Noeli Nziza, Joyeux Noel,
Cameroon Merry Christmas, Joyeux Noel
Canada Merry Christmas, Joyeux Noel, Merry Christmas, Selamat Hari Natal
Cape Verde Boas Festas
Cayman Islands Merry Christmas
Central African Republic Joyeux Noel
Chad Joyeux Noel, Mboni Chrismen
Chile Feliz Navidad
China Sheng Tan Kuai Loh
Christmas Island Merry Christmas
Colombia Feliz Navidad para todos
Comoros Joyeux Noel, Mboni Chrismen
Congo Joyeux Noel
Cook Islands Merry Christmas, Kia orana e kia manuia rava i teia Kiritime
Costa Rica Feliz Navidad
Cote D'ivoire Joyeux Noel
Croatia Sretan Bozic
Cuba Feliz Navidad
Cyprus Eftihismena Christougenna, Noeliniz kutlu olsun ve yeni yili
Czech Republic Vesele Vanoce
Democratic People's Republic of Korea Sung Tan Chuk Ha
Denmark Glaedelig Jul
Djibouti Joyeux Noel, Mboni Chrismen
Dominica Merry Christmas
Dominican Republic Feliz Navidad
Ecuador Feliz Navidad
Egypt Mboni Chrismen
El Salvador Feliz Navidad
Equatorial Guinea Joyeux Noel, Feliz Navidad
Eritrea Melkam Yelidet Beaal, Poket Kristmet
Estonia Haid Joule, R??msaid J?ule
Ethiopia Melkam Yelidet Beaal, Poket Kristmet, Merry Christmas
Falkland Islands (Malvinas) Merry Christmas
Faroe Islands Gledhilig jol
Federated States of Mirconesia Merry Christmas
Fiji Merry Christmas
Finland Hauskaa Joulua
France Joyeux Noel
French Guiana Joyeux Noel
French Polynesia Joyeux Noel, La ora i te Noera
French Southern Territories Joyeux Noel
Gabon Joyeux Noel
Gambia Merry Christmas
Georgia Gilotsavt Krist'es Shobas
Germany Frohliche Weihnachten
Ghana Afishapa
Gibraltar Merry Christmas, Feliz Navidad
Greece Eftihismena Christougenna
Greenland Gl?delig Jul, Juullimi Ukiortaassamilu Pilluarit
Grenada Merry Christmas
Guadeloupe Joyeux Noel
Guam Merry Christmas, Felis Pasgua
Guatemala Feliz Navidad
Guinea Joyeux Noel
Guinea-bissau Boas Festas
Guyana Merry Christmas
Haiti Jwaye Nwel
Honduras Feliz Navidad
Hong Kong Sing dan fiy loc, Merry Christmas
Hungary Boldog Kar?csonyt
Iceland Gle?ileg J?l
India Shub Naya Baras
Indonesia Salamet Hari Natal
Iraq Idah Saidan Wasanah Jadidah
Ireland Nollaig Shona dhuit
Israel Mo'adim Lesimkha
Italy Buon Natale
Jamaica Merry Christmas
Japan Merii Kurisumasu
Jordan Mboni Chrismen, Merry Christmas
Kazakhstan Hristos Razdajetsja, Rozdjestvom Hristovim
Kenya Merry Christmas
Kiribati Merry Christmas
Kuwait Mboni Chrismen, Merry Christmas
Kyrgyzstan Hristos Razdajetsja
Latvia Priecigus ziemassvetkus!
Lebanon Milad Majeed
Lesotho Happy Christmas
Liberia Happy Christmas
Libyan Arab Jamahiriya Mboni Chrismen, Buon Natale, Happy Christmas
Liechtenstein Frohliche Weihnachten
Lithuania Laimingu Kaledu
Luxembourg Sch?i Kr?schtdeeg
Macau Boas Festas, Sing dan fiy loc
Madagascar Joyeux Noel, Arahaba tratry ny Krismasy
Malawi Merry Christmas, Moni Wa Chikondwelero Cha X'mas
Malaysia Selamat Hari Krimas
Mali Joyeux Noel
Malta Il-Milied it-Tajjeb
Malta Il-Festi t-Tajba
Marshall islands Monono ilo raaneoan Nejin
Martinique Joyeux Noel
Mauritius Merry Christmas
Mayotte Krismas Njema Na Heri Za Mwaka Mpya, Joyeux Noel
Mexico Feliz Navidad
Monaco Joyeux Noel
Montserrat Merry Christmas
Morocco Mboni Chrismen
Mozambique Boas Festas
Namibia Geseende Kersfees
Nepal krist Yesu Ko Shuva Janma Utsav Ko Upalaxhma Hardik Shuva
Netherlands Prettige Kerstdagen
Netherlands Antilles Bon Pasco, Bon Anja
New Caledonia Joyeux Noel
New zealand Happy Christmas
Nicaragua Feliz Navidad
Niger Joyeux Noel
Nigeria Merry Christmas
Norfolk Island Merry Christmas
Northern Mariana Islands Filis Pasgua, Merry Christmas
Norway Gledelig Jul
Oman Mboni Chrismen
Pakistan Bara Din Mubarrak Ho
Palau Merry Christmas
Panama Feliz Navidad
Papua New Guinea Bikpela hamamas blong dispela Krismas
Paraguay Feliz Navidad
Peru Feliz Navidad
Philippines Maligayang Pasko
Pitcairn Merry Christmas
Poland Boze Narodzenie
Portugal Boas Festas
Puerto Rico Feliz Navidad, Felices Pascuas, Felicidades
Qatar Mboni Chrismen
Republic of Korea Sungtan Chukha
Republic of Moldova Craciun fericit si un An Nou fericit!
Reunion Joyeux Noel
Romania Sarbatori vesele
Russian Federation Hristos Razdajetsja, Rozdjestvom Hristovim
Rwanda Noheli Nziza
Saint Kitts and Nevis Happy Christmas
Saint Lucia Happy Christmas
Saint Vincent and The Grenadines Happy Christmas
Samoa Manuia Le Kirisimasi
San Marino Buon Natale
Sao Tome and Principe Boas Festas
Saudi Arabia Mboni Chrismen
Senegal Joyeux Noel
Seychelles Happy Christmas, Joyeux Noel
Sierra Leone Happy Christmas
Singapore Sheng Tan Kuai Loh, Nathar Puthu Varuda Valthukkal, Happy Ch
Slovakia (Slovak Republic) Vesele Vianoce
Slovenia Srecen Bozic
South Africa Gese?nde Kersfees, Happy Christmas
South Georgia and The South Sandwich Islands Happy Christmas
Spain Feliz Navidad
Sri Lanka Subha nath thalak Vewa, Nathar Puthu Varuda Valthukkal
St. Helena Happy Christmas
St. Pierre and Miquelon Joyeux Noel
Sudan Wilujeng Natal
Suriname Zalig Kersfeest, Wang swietie Kresnetie
Svalbard and Jan Mayen Islands Hristos Razdajetsja, Gledelig Jul
Swaziland Happy Christmas
Sweden God Jul
Switzerland Fr?hlichi Wiehnacht, Joyeux Noel
Syrian Arab Republic Mboni Chrismen
Taiwan Kung His Hsin Nien bing Chu Shen Tan
Thailand Ewadee Pe-e Mai
The Democratic Republic of The Congo Joyeux Noel
The Former Yugoslav Republic of Macedonia Streken Bozhik
Togo Joyeux Noel
Tokelau Merry Christmas
Tonga Kilisimasi Fiefia
Trinidad and Tobago Happy Christmas
Tunisia Mboni Chrismen
Turkey Mutlu Noeller
Turks and Caicos Islands Happy Christmas
Uganda Webale Krismasi
Ukraine Veseloho Vam Rizdva
United Arab Emirates I'd miilad said oua sana saida
United Kingdom Merry Christmas, Happy Christmas, Nadolig Llawen
United Republic of Tanzania Krismas Njema Na Heri Za Mwaka Mpya, Happy Christmas
United States Merry Christmas, Happy Holidays, Season's Greetings
Uruguay Feliz Navidad
Vanuatu Merry Christmas, Joyeux Noel
Venezuela Feliz Navidad
Viet Nam Chuc mung Giang Sinh
Virgin Islands (British) Merry Christmas
Virgin Islands (U.S.) Merry Christmas
Wallis and Futuna Islands Joyeux Noel
Yemen Mboni Chrismen
Yugoslavia Cestitamo Bozic
Zambia Happy Christmas
Zimbabwe Happy Christmas

More Trivia Fun!
» Trivia Facts about Christmas
» How did Christmas cards come to be?
» What exactly are the "Twelve Days of Christmas?"
» How did the idea for Santa Claus originate?
» Why do we get kissed if we stand under the mistletoe?
Join Cool Quiz and Win Prizes!JOIN COOL QUIZ!

Login (your email)


Password (forget?)

Featured Trivia
Phobias - What are you afraid of?
What is a BOOGER made of?
Smileys and E-mail Shorthand
What do you call a group of?
Unusual
U.S. Town Names
More...

Privacy Policy | Terms of Use | Media Kit | About Us | Make Us Your Homepage
src/test/resources/test3.html0000644000000000000000000016500011474551373015300 0ustar rootroot BMW of North America, LLC
topLeftPageGradient
topRightPageGradient
 
 
 
BMW Owners
My BMW
Create a My BMW account to save configurations, rate videos & more.
 
View All View all 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/test/resources/test29.html0000644000000000000000000000027512411226141015352 0ustar rootroot
Col1 Col2
Content Content
src/test/resources/test28_expected.html0000644000000000000000000000027712317631450017244 0ustar rootroot
Col1 Col2
Content Content
src/test/resources/test28.html0000644000000000000000000000026512317631450015360 0ustar rootroot
Col1 Col2
Content Content
src/test/resources/test27_expected.html0000644000000000000000000000031612317631450017235 0ustar rootroot
Col1 Col2
Content Content
src/test/resources/test27.html0000644000000000000000000000026512317631450015357 0ustar rootroot
Col1 Col2
Content Content
src/test/resources/test26_expected.html0000644000000000000000000000025112317631450017232 0ustar rootroot
Col1 Col2
Content Content
src/test/resources/test26.html0000644000000000000000000000027012317631450015352 0ustar rootroot
Col1 Col2
Content Content
src/test/resources/test25_expected.html0000644000000000000000000000211312312077674017237 0ustar rootroot Test case for #104

Test case for #104

Linking from SVG: see bug #104 src/test/resources/test25.html0000644000000000000000000000201412312077674015356 0ustar rootroot Test case for #104

Test case for #104

Linking from SVG: see bug #104 src/test/resources/test24_expected.html0000644000000000000000000000043012312075660017227 0ustar rootroot

Rubrique 1

Un projet low-cost au coeur des discussions PSA-General Motors ?

src/test/resources/test24.html0000644000000000000000000000036412312075660015354 0ustar rootroot

Rubrique 1

Un projet low-cost au coeur des discussions PSA-General Motors ?

src/test/resources/test23.html0000644000000000000000000004050612271674571015367 0ustar rootroot OSS Watch - independent expert advice on open source software

Open Source Options for Education

open source options logo

Updated 03/01/2014 Looking for open source alternatives to closed source software? We maintain a list of options for open source software for use in the education sector for everything from libraries and administration systems to subject-specific teaching tools. Wherever possible we also link to real-life examples and case studies of use by educators and institutions.


Briefing Notes

We publish Briefing Notes covering a wide range of issues relating to open source software use and development.

Latest briefing notes:

More Briefing Notes

You can search our resources using the search box above or you can browse specific resource aimed at your particular role or by subject using the links at the bottom of the page.

If you don't find what you are looking for please mail us and we will do our best to help.

Recent Posts

More on the OSS Watch blog

Sign Up For Our Newsletter

Each month OSS Watch publishes a newsletter, and you can subscribe to receive it by email.

Sign up now

Events

Come and meet us at the events below

src/test/resources/test22.html0000644000000000000000000000025012254020046015335 0ustar rootroot testcase src/test/resources/test21_expected.html0000644000000000000000000000077712250332403017232 0ustar rootroot Test SVG cleaning

before

Circle

after

src/test/resources/test21.html0000644000000000000000000000077512250332403015347 0ustar rootroot Test SVG cleaning

before

Circle

after

src/test/resources/test20_expected.html0000644000000000000000000000073312250332403017221 0ustar rootroot Test SVG cleaning

before

after

src/test/resources/test20.html0000644000000000000000000000073112250332403015336 0ustar rootroot Test SVG cleaning

before

after

src/test/resources/test2.html0000644000000000000000000000017411474551373015277 0ustar rootroot
AAA
mama
var x = 10; src/test/resources/test19_expected.html0000644000000000000000000000103412250332403017224 0ustar rootroot Test SVG cleaning

before

A circle

after

src/test/resources/test19.html0000644000000000000000000000105012250332403015341 0ustar rootroot Test SVG cleaning

before

A circle

after

src/test/resources/test18_expected.html0000644000000000000000000000146412312067122017234 0ustar rootroot Test SVG cleaning

before

after

src/test/resources/test18.html0000644000000000000000000000146212312067122015351 0ustar rootroot Test SVG cleaning

before

after

src/test/resources/test17_expected.html0000644000000000000000000000042312245366372017243 0ustar rootroot src/test/resources/test17.html0000644000000000000000000000037612245366372015371 0ustar rootroot src/test/resources/test16_expected.html0000755000000000000000000000015212235021564017232 0ustar rootroot
src/test/resources/test16.html0000755000000000000000000000014012235021564015346 0ustar rootroot
src/test/resources/test15_expected.html0000644000000000000000000000021012230164650017220 0ustar rootroot

A test of mixing XML and HTML

src/test/resources/test15.html0000644000000000000000000000017612230164650015352 0ustar rootroot

A test of mixing XML and HTML

src/test/resources/test14_expected.html0000644000000000000000000000020512230164650017223 0ustar rootroot A test of mixing XML and HTML src/test/resources/test14.html0000644000000000000000000000017312230164650015346 0ustar rootroot A test of mixing XML and HTML src/test/resources/test13_expected.html0000644000000000000000000001202512230164650017225 0ustar rootroot A Simple HTML5 RDFa Example

A Simple HTML5 RDFa Example

Crete 2010

figurines
Minoan Figurines, Crete photo by Irene.

Richard knows

src/test/resources/test13.html0000644000000000000000000001202312230164650015342 0ustar rootroot A Simple HTML5 RDFa Example

A Simple HTML5 RDFa Example

Crete 2010

figurines
Minoan Figurines, Crete photo by Irene.

Richard knows

src/test/resources/test12_expected.html0000644000000000000000000000042312663076123017232 0ustar rootroot src/test/resources/test12.html0000644000000000000000000000041512220563212015337 0ustar rootroot src/test/resources/test11_expected.html0000644000000000000000000000031412245660030017220 0ustar rootroot

<![CDATA[test]]>

src/test/resources/test11.html0000644000000000000000000000024012220532065015334 0ustar rootroot

src/test/resources/test10.html0000644000000000000000000000046312136224000015333 0ustar rootroot Strict DTD XHTML Example

Hello World

src/test/resources/test1.html0000644000000000000000000000054511474551373015300 0ustar rootrootat first text bibi
content 1111

brbrbrbr KAKAAK

var x = 10;

zeka Type your name: src/test/resources/test-chinese-issue-64.html0000644000000000000000000014224712124633725020211 0ustar rootroot

 

作者:李航

 

机器学习是关于计算机基于数据构建模型并运用模型来模拟人类智能活动的一门学科。随着计算机与网络的飞速发展,机器学习在我们的生活与工作中起着越来越大的作用,正在改变着我们的生活和工作。

 

1.日常生活中的机器学习

 

我们在日常生活经常使用数码相机。你也许不知道,数码相机上的人脸检测技术是基于机器学习技术的!我认识三位了不起的科学家与工程师,他们是Robert Schapire,Paul Viola,劳世竑。他们三位都与这有关。RobertYoav Freund一起发明了非常有效的机器学习算法AdaBoost。Paul将AdaBoost算法成功地应用到人脸检测。劳世竑和他领导的Omron团队将AdaBoost人脸检测算法做到了芯片上。据说现在世界上有百分之六七十的数码相机上的人脸检测都是用Omron的芯片。

 

在我们的工作与生活中,这种例子曾出不穷。互联网搜索、在线广告、机器翻译、手写识别、垃圾邮件过滤等等都是以机器学习为核心技术的。

 

不久以前,机器学习国际大会(International Conference on Machine Learning,ICML 2011)在美国华盛顿州的Bellevue市举行。约有7百多位科研人员、教授、学生参加,创造了历史最高纪录。大会的三个主题演讲分别介绍了机器学习在微软的Kinnect游戏机用户感应系统、谷歌的Goggles图片搜索系统、IBM的 Watson自动问答系统中的应用。这些事实让人预感到机器学习被更广泛应用的一个新时代的到来。

 

2.机器学习与人工智能

 

智能化是计算机发展的必然趋势。人类从事的各种智能性活动,如数学、美术、语言、音乐、运动、学习、游戏、设计、研究、教学等等,让计算机做起来,现在还都是很困难的。这是几十年来人工智能研究得到的结论。

 

人工智能研究中,人们曾尝试过三条路。我将它们称之为外观(extrospection)、内省(introspection)和模拟(simulation)。

 

所谓外观,指的是观察人的大脑工作情况,探求其原理,解明其机制,从而在计算机上“实现”人类大脑的功能。比如,计算神经学(computational neuroscience)的研究就是基于这个动机的。然而,人脑的复杂信息处理过程很难观测和模型化。就像我们仅仅观测某个计算机内的信号传输过程,很难判断它正在做什么样的计算一样。

 

内省就是反思自己的智能行为,将自己意识到的推理、知识等记录到计算机上,从而“再现”人的智能,比如专家系统(expert system)的尝试就属于这一类。内省的最大问题是它很难泛化,也就是举一反三。无论是在什么样的图片中,甚至是在抽象画中,人们能够轻而易举地找出其中的人脸。这种能力称为泛化能力。通过内省的方法很难使计算机拥有泛化能力。自己的智能原理,对人类来说很有可能是不可知的(agnostic)。笼子里的老鼠可能认为触动把手是得到食物的“原因”,但它永远也不能了解到整个笼子的食物投放机制。

 

模拟就是将人的智能化操作的输入与输出记录下来,用模型来模拟,使模型对输入输出给出同人类相似的表现,比如统计机器学习(statistical machine learning)。实践表明,统计机器学习是实现计算机智能化这一目标的最有效手段。统计学习最大的优点是它具有泛化能力;而缺点是它得到的永远是统计意义下的最优解(例如,人脸检测)。现在当人们提到机器学习时,通常是指统计机器学习或统计学习。

 

3.机器学习的优缺点

 

下面看一个简单的例子。由这个例子可以说明统计学习的基本原理,以及由此带来的优缺点。

 

假设我们观测到一个系统的输出是一系列的10,要预测它的下一个输出是什么。如果观测数据中1和0各占一半,那么我们只能0.5的准确率做出预测。但是,如果我们同时观测到这个系统有输入,也是一系列的1和0,并且输入是1时输出是0的比例是0.9,输入是0时输出是1的比例也是0.9。这样我们就可以从已给数据中学到“模型”,根据系统的输入预测其输出,并且把预测准确率从0.5提高到0.9。以上就是统计学习,特别是监督学习的基本想法。事实上,这是世界上最简单的统计机器学习模型!条件概率分布P(Y|X),其中随机变量X与Y表示输入与输出,取值1与0。可以认为所有的监督学习模型都是这个简单模型的复杂版。我们用这个模型根据给定的输入特征,预测可能的输出。

 

统计学习最大的优点是它具有泛化能力,对于任意给定的X,它都能预测相应的Y。Vapnik的统计学习理论还能对预测能力进行分析,给出泛化上界。但从这个例子中也可以看到统计学习的预测准确率是不能保证100%的。比如,人脸检测会出错,汉语分词会出错。

 

统计学习是“乡下人”的办法。有个笑话。一个乡下人进城,到餐馆吃饭,不知如何在餐馆用餐,就模仿旁边的人。别人做什么,他也就学着做什么。邻桌的一位故意戏弄他,将桌上的蜡烛卷在饼里,趁乡下人不注意时把蜡烛扔到地上,然后咬了一口卷着的饼。乡下人也跟着学,大咬了一口自己的饼。统计学习只是根据观测的输入与输出,“模仿”人的智能行为。有时能够显得非常智能化。但它本质上只是基于数据的,是统计平均意义下的“模仿”。如果观测不到关键的特征,它就会去“咬卷着蜡烛的饼”。

 

4.机器学习与互联网搜索

 

我与同事们在从事互联网搜索相关的研究。据调查,60%的互联网用户每天至少使用一次搜索引擎,90%的互联网用户每周至少使用一次搜索引擎。搜索引擎大大提高了人们工作、学习以及生活的质量。而互联网搜索的基本技术中,机器学习占据着重要的位置。

 


 

在我看来,互联网搜索有两大挑战和一大优势。挑战包括规模挑战与人工智能挑战;优势主要是规模优势。

 

规模挑战:比如,搜索引擎能看到trillion量级的URL,每天有几亿、几十亿的用户查询,需要成千上万台的机器抓取、处理、索引网页,为用户提供服务。这需要系统、软件、硬件等多方面的技术研发与创新。

 

人工智能挑战:搜索最终是人工智能问题。搜索系统需要帮助用户尽快、尽准、尽全地找到信息。这从本质上需要对用户需求(如查询语句),以及互联网上的文本、图像、视频等多种数据进行“理解”。现在的搜索引擎通过关键词匹配以及其他“信号”,能够在很大程度上帮助用户找到信息。但是,还是远远不够的。

 

规模优势:互联网上有大量的内容数据,搜索引擎记录了大量的用户行为数据。这些数据能够帮助我们找到看似很难找到的信息。比如,“纽约市的人口是多少”,“约市的人口是多少”,“春风又绿江南岸作者是谁”。注意这些数据都是遵循幂函数分布的。它们能帮助Head(高频)需求,对 tail(低频)需求往往是困难的。所以,对tail说人工智能的挑战就更显著。

 

现在的互联网搜索在一定程度上能够满足用户信息访问的一些基本需求。这归结于许多尖端技术包括机器学习技术的成功开发与应用,比如排序学习算法、网页重要度算法等等。这些机器学习算法在一定程度上能够利用规模优势去应对人工智能挑战。

 

但是、当今的互联网搜索距离 “有问必答,且准、快、全、好”这一理想还是有一定距离的。这就需要开发出更多更好的机器学习技术解决人工智能的挑战,特别是在tail中的挑战。

 

展望未来,机器学习技术的研究与开发会帮助我们让明天更美好!

(注:本文所有图片均来自网络)

社交网搜索成为网络搜索学界最炙手可热的话题

李航博士

微软亚洲研究院互联网搜索与挖掘组高级研究员及主任研究员。李航的研究方向包括信息检索,自然语言处理,统计机器学习,及数据挖掘。

 

相关阅读

跳出盒子的想象与机器学习

社交网搜索成为网络搜索学界最炙手可热的话题

下一代互联网搜索的前沿:意图、知识与云

自然用户界面在微软技术节大放异彩

                                                                               

欢迎关注

微软亚洲研究院人人网主页:http://page.renren.com/600674137

微软亚洲研究院微博:http://t.sina.com.cn/msra 

src/test/resources/severalTagsClosedByChildBreakHTML5.html0000644000000000000000000000027712523434626022627 0ustar rootroot
  • Some incomplete li

  • Another li
  • src/test/resources/severalTagsClosedByChildBreak.html0000644000000000000000000000027712113037735022050 0ustar rootroot
  • Some incomplete li

  • Another li
  • src/test/resources/severalTagsClosedByChildBreak-cleanedHTML5.html0000644000000000000000000000023112523434626024206 0ustar rootrootsrc/test/resources/severalTagsClosedByChildBreak-cleaned.html0000644000000000000000000000022112113037735023426 0ustar rootroot
  • Some incomplete li

  • Another li
    • This is some text
  • src/test/resources/script_test.html0000644000000000000000000000032112113211532016550 0ustar rootroot
    home
    vvflag_greekvflag_engvfb

    Technology is complex. Technology is fast. At the same time, technology is the only way to go.

    In aace, we definitely believe in technology and that is why we specialize in the design and installation of advanced systems for the construction industry. Particularly, we specialize in the areas of home automation and structural health monitoring (SHM).

    Home Automation

    aace provides advanced systems for home automation that integrate multimedia (audio, video), lighting, climate control, security and home management (e.g. water sprinklers, doors etc). We are proud to offer the Control4 system, one of the best systems in world, since it combines cutting edge technology and uncomparable design. For more please click here.

    Structural Health Monitoring

    In SHM, our sole mission is to provide state-of-the-art, comprehensive, reliable and timely solutions through the use of advanced wireless systems. We offer the unique Sensametrics wireless sensor network technology to assess structural integrity of buildings, bridges, energy facilities, tunnels and other civil structures. For more please click here.

    src/test/resources/oome_70.html0000644000000000000000000004630712124737107015505 0ustar rootroot "Most topped" in a Blog/Give your blog links/ | Article Directory

    "Most topped" in a Blog/Give your blog links/

    354 pageview(s)  2007-10-26 05:43  
    See Why Our Team Is The Fastest
    Growing team in Network Marketing VT and we are all over YOUTUBE!
    How I Became #1 On The Leader Board
    Find The Secrets To Get An Unlimited Supply of Leads And Make 100% Commission At The Same Time.
    How To Build A Targeted List
    The List Building Club 10 Free Quick Start Videos “How To Get Your Own Profitable Website”
    What i thought when i got this idea was to get a concentrated look to the best quality writings that got more tops here at Apsense.I was thinking that i would do a "favor" to the members who will be placing their blog links here, as their blogs will get hits and be read but also to new members joining, to give them the possibility to have an idea about Apsense.

    So i will be posting for the moment the blogs which are topped over 20. 
    And then anyone who reaches this number can post a comment and i will put their blog link too.



    Top Blogs

    These are the Steps to get your Blog Topped / Attention Newbies

    I am at a loss today...

    How to Achieve What You Desire With the Step By Step Goal Achievement System – Part I

     



    And i also want to invite members who created blogs which have got over 50 comments to post their blog links too.This is important because the discussions and comments about the blog can contain useful and helpful information.I will create another special table for that too. 

    And at last please top this blog as it would be seen and this can make your blog links too be seen.
    Regards Indrit


    Related Articles

    View all 3 comments
    runningman72  Committed  Oct 29th 09:26
    Thanks for sharing this useful info. Stephen
    Blog
    [url=www.apsense.com/group/hits4pay]Group[/url]

    indrit  Senior  Oct 29th 09:49
    You are welcome.I will be soon be adding top subjects and top blogs for comments

    web20empire  Advanced  Jan 16th 12:04
    How does one get their blog posts "topped" here on Apsense. I know how to generate traffic other places but I'm not sure of the infrastructure here just yet. Usually, I have friend requests by now but it seems that here I have yet to exist. You got any tips on exposure on Apsense? Sheree Motiska



    src/test/resources/gg_prob_cleaned.html0000644000000000000000000000016512113037735017320 0ustar rootroot
    Some text 0
    Some text 1
    Some text 3
    src/test/resources/gg_prob.html0000644000000000000000000002363012113037735015647 0ustar rootrootGoogle
    Παγκόσμιος ιστός Εικόνες Ειδήσεις Μετάφραση Ιστολόγια Ημερολόγιο Gmail περισσότερα
    iGoogle | Αναζήτηση ρυθμίσεων | Είσοδος
    Για πιο γρήγορη περιήγηση στο διαδίκτυο


     
      Σύνθετη Αναζήτηση
      Γλωσσικά εργαλεία
    Αναζήτηση:


    Προγράμματα Διαφήμισης - Επιχειρηματικές λύσεις - Σχετικά με τη Google - Google.com in English

    ©2009 - Απόρρητο

    src/test/resources/Expected_1.html0000644000000000000000000002333312245660030016205 0ustar rootroot home
    home
    vvflag_greekvflag_engvfb

    Technology is complex. Technology is fast. At the same time, technology is the only way to go.

    In aace, we definitely believe in technology and that is why we specialize in the design and installation of advanced systems for the construction industry. Particularly, we specialize in the areas of home automation and structural health monitoring (SHM).

    Home Automation

    aace provides advanced systems for home automation that integrate multimedia (audio, video), lighting, climate control, security and home management (e.g. water sprinklers, doors etc). We are proud to offer the Control4 system, one of the best systems in world, since it combines cutting edge technology and uncomparable design. For more please click here.

    Structural Health Monitoring

    In SHM, our sole mission is to provide state-of-the-art, comprehensive, reliable and timely solutions through the use of advanced wireless systems. We offer the unique Sensametrics wireless sensor network technology to assess structural integrity of buildings, bridges, energy facilities, tunnels and other civil structures. For more please click here.

    src/test/java/0000755000000000000000000000000013105122453012237 5ustar rootrootsrc/test/java/org/0000755000000000000000000000000013105122453013026 5ustar rootrootsrc/test/java/org/htmlcleaner/0000755000000000000000000000000013105122453015324 5ustar rootrootsrc/test/java/org/htmlcleaner/XPatherTest.java0000644000000000000000000001376211474551373020431 0ustar rootrootpackage org.htmlcleaner; import junit.framework.TestCase; import java.io.File; /** * Testing XPath expressions against TagNodes results from cleaning process. */ public class XPatherTest extends TestCase { private TagNode rootNode; protected void setUp() throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); rootNode = cleaner.clean( new File("src/test/resources/test5.html") ); } public void testPathExpression() throws XPatherException { assertTrue( rootNode.evaluateXPath( "//div//a" ).length == 160 ); assertStringArray( rootNode.evaluateXPath("//div//a[@id][@class]"), new Object[] { "Ocean", "More Yahoo! Services" } ); assertStringArray( rootNode.evaluateXPath("/body/*[1]/@type"), new Object[] { "text/javascript" } ); assertStringArray( rootNode.evaluateXPath("//div[3]//a[@id]"), new Object[] { "In the News", "World", "Local", "Finance" } ); assertStringArray( rootNode.evaluateXPath("//div[3]//a[@id][@href='r/n4']"), new Object[] { "Local" } ); assertStringArray( rootNode.evaluateXPath("//div[3]//a['video'=@class]"), new Object[] { "An on-court proposal", "See a one-armed basketball champ ", "Israeli police raid the home of gunman behind school shooting", "Clinton continues to question Obama's experience", "Zero emission sports car unveiled at Switzerland auto show", } ); assertStringArray( rootNode.evaluateXPath("//div[3]//a[@style]/..//li[a]"), new Object[] { "News", "Popular", "Election '08" } ); assertStringArray( rootNode.evaluateXPath("(//body//div[3][@class]/span)[4]/@id"), new Object[] { "featured4ct" } ); assertStringArray( rootNode.evaluateXPath("//body//div[3][@class]//span[2]/@id"), new Object[] { "featured2ct", "worldnewsct" } ); assertStringArray( rootNode.evaluateXPath("(//div[last() >= 4]//./div[position() = last()])[position() > 22]//li[2]//a"), new Object[] { "Awesome Chicken Noodle...", "Celebrity Rehab", "24" } ); assertEquals( rootNode.evaluateXPath("//*[@class][@id]//*[@style]").length, 23 ); assertEquals( rootNode.evaluateXPath("//div/@class").length, 43 ); assertEquals( rootNode.evaluateXPath("//div//@class").length, 130 ); assertStringArray( rootNode.evaluateXPath("(//div[@id]//@class)[position() < 5]"), new Object[] { "eyebrowborder", "mastheadbd", "iemw", "ac_container" } ); assertEquals( rootNode.evaluateXPath("//div[2]/@*").length, 33 ); assertStringArray( rootNode.evaluateXPath("//div[2]/@*[2]"), new Object[] { "ad", "bd", "bd", "papreviewdiv", "ad", "bd", "bd" } ); assertStringArray( rootNode.evaluateXPath("//div[2]//a[. = \"Images\"]/@href"), new Object[] { "r/00/*-http://images.search.yahoo.com/search/images" } ); } public void testFunctions() throws XPatherException { assertNumber( rootNode.evaluateXPath("count(//div//img)"), 26 ); assertStringArray( rootNode.evaluateXPath("data(//div//a[@id][@class])"), new Object[] { "Ocean", "More Yahoo! Services" } ); assertStringArray( rootNode.evaluateXPath("count(//a)"), new Object[] { "160" } ); assertStringArray( rootNode.evaluateXPath("//p/last()"), new Object[] { "2", "2" } ); assertStringArray( rootNode.evaluateXPath("//style/position()"), new Object[] { "1", "2", "3", "4", "5", "6", "7" } ); assertStringArray( rootNode.evaluateXPath("//body//div[3][@class]//span[last()<=4]/@id"), new Object[] { "inthenews2ct", "worldnewsct", "localnewsct", "finsnewsct" } ); assertStringArray( rootNode.evaluateXPath("//body//div[3][@class]//span[12.2= 4][position() <= 2]//li[4]//a)"), new Object[] { "Video", "7 top cities for a great weekend trip", "Chrysler", "Jeep", "Saturn", "Insurance" } ); assertStringArray( rootNode.evaluateXPath("//a['v' < @id]/@id"), new Object[] { "vsearchmore", "worldnews" } ); assertStringArray( rootNode.evaluateXPath("data(//a['v' < @id])"), new Object[] { "More", "World" } ); } private void assertNumber(Object array[], double number) { assertTrue(array != null); assertTrue(array.length == 1); assertTrue(array[0] instanceof Number); assertTrue(array[0] instanceof Number); assertTrue(((Number)array[0]).doubleValue() == number); } private void assertStringArray(Object array1[], Object array2[]) { assertNotNull( array1 ); assertNotNull( array2 ); assertEquals( array1.length, array2.length ); for (int i = 0; i < array1.length; i++) { assertNotNull( array1[i] ); assertNotNull( array2[i] ); String s1 = array1[i] instanceof TagNode ? ((TagNode)array1[i]).getText().toString() : array1[i].toString(); String s2 = array2[i] instanceof TagNode ? ((TagNode)array2[i]).getText().toString() : array2[i].toString(); assertEquals(s1, s2); } } } src/test/java/org/htmlcleaner/XmlDeclarationsTest.java0000644000000000000000000001134212177141130022122 0ustar rootroot/* Copyright (c) 2006-2013, HtmlCleaner project team (Vladimir Nikic, Scott Wilson, Pat Moore) All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import org.junit.BeforeClass; import org.junit.Test; public class XmlDeclarationsTest { static HtmlCleaner cleaner; static CompactXmlSerializer serializer; static final String expectedOutput = "\n\n\n\n

    test

    "; @BeforeClass public static void setup(){ cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); properties.setOmitXmlDeclaration(false); properties.setOmitDoctypeDeclaration(false); properties.setIgnoreQuestAndExclam(false); serializer = new CompactXmlSerializer(properties); } // // No Newlines // @Test public void checkXml(){ TagNode cleaned = cleaner.clean("\n

    test

    "); String output = serializer.getAsString(cleaned); assertEquals(DoctypeToken.XHTML1_0_STRICT, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); assertEquals(expectedOutput, output); } // // Newlines // @Test public void checkWhitespace(){ TagNode cleaned = cleaner.clean("\n\n

    test

    "); String output = serializer.getAsString(cleaned); assertEquals(DoctypeToken.XHTML1_0_STRICT, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); assertEquals(expectedOutput, output); } /** * This is to test issue #67 */ @Test public void checkXmlNoExtraWhitesapce(){ String expected = "\n\n

    test

    "; TagNode cleaned = cleaner.clean(expected); cleaner.getProperties().setAddNewlineToHeadAndBody(false); Serializer theSerializer = new SimpleXmlSerializer(cleaner.getProperties()); String output = theSerializer.getAsString(cleaned); cleaner.getProperties().setAddNewlineToHeadAndBody(true); assertEquals(expected, output); } @Test public void checkXmlNoEncoding(){ TagNode cleaned = cleaner.clean("\n

    test

    "); String output = serializer.getAsString(cleaned); assertEquals(DoctypeToken.XHTML1_0_STRICT, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); assertEquals(expectedOutput, output); } } src/test/java/org/htmlcleaner/VisitorTest.java0000644000000000000000000000333212113037735020475 0ustar rootrootpackage org.htmlcleaner; import java.io.File; import java.io.IOException; import junit.framework.TestCase; /** * Testing XPath expressions against TagNodes results from cleaning process. */ public class VisitorTest extends TestCase { private TagNode node; private CleanerProperties props; protected void setUp() throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); props = cleaner.getProperties(); node = cleaner.clean( new File("src/test/resources/test9.html") ); } public void testNodeTraverse() throws IOException, XPatherException { final StringBuffer superstar = new StringBuffer(); node.traverse(new TagNodeVisitor() { public boolean visit(TagNode parentNode, HtmlNode node) { if (node instanceof TagNode) { TagNode tagNode = (TagNode) node; String name = tagNode.getName(); if ( "p".equals(name) ) { tagNode.removeAllChildren(); } else if ("h1".equals(name)) { if ("superstar".equals(tagNode.getAttributeByName("id"))) { superstar.append(tagNode.getText()); return false; } } } else if (node instanceof ContentNode) { } else if (node instanceof CommentNode) { } return true; } }); assertEquals(node.evaluateXPath("//p[1]/*").length, 0); assertTrue("freestylo".equals(superstar.toString())); assertEquals(node.evaluateXPath("//p[2]/*").length, 1); } } src/test/java/org/htmlcleaner/UtilsTest.java0000644000000000000000000000445312574564070020153 0ustar rootrootpackage org.htmlcleaner; import static org.junit.Assert.assertEquals; import org.junit.Test; /** * * @author Eugene Sapozhnikov (blackorangebox@gmail.com) * */ public class UtilsTest extends Utils { /** * Test for code points above 65535 - see bug #152 */ @Test public void testConvertUnicode(){ String result = new String("UTF-8"); String input = "😎"; String output = "😎"; result = Utils.escapeXml(input, true, true, true, false, false, false); assertEquals(output, result); input = "🙏"; output = "🙏"; result = Utils.escapeXml(input, true, true, true, false, false, false); assertEquals(output, result); } @Test public void testEscapeXml_transResCharsToNCR() { String res = Utils.escapeXml("1.&\"'<>", true, true, true, false, true, false); assertEquals("1.&"'<>", res); res = Utils.escapeXml("2.&"'<>", true, true, true, false, true, false); assertEquals("2.&"'<>", res); res = Utils.escapeXml("1.&\"'<>", true, true, true, false, false, false); assertEquals("1.&"'<>", res); res = Utils.escapeXml("2.&"'<>", true, true, true, false, false, false); assertEquals("2.&"'<>", res); } @Test public void testEscapeXml_recognizeUnicodeChars() { String res = Utils.escapeXml("[α][é][‾]", true, false, true, false, false, false); assertEquals("[α][é][‾]", res); res = Utils.escapeXml("[α][é][‾][Σ]", true, true, true, false, false, false); assertEquals("[α][é][‾][Σ]", res); } @Test public void testEscapeXml_transSpecialEntitiesToNCR_withHex() { String res = Utils.escapeXml("'¡", true, false, true, false, false, true); assertEquals("'¡", res); res = Utils.escapeXml("'¡", true, false, true, false, false, true); assertEquals("'¡", res); res = Utils.escapeXml("'¡", true, false, true, false, false, false); assertEquals("'¡", res); } } src/test/java/org/htmlcleaner/TransformationTest.java0000644000000000000000000001152512113037735022047 0ustar rootrootpackage org.htmlcleaner; import junit.framework.TestCase; import java.io.IOException; import java.io.File; import java.util.regex.Pattern; /** * Testing tag transformations. */ public class TransformationTest extends TestCase { private HtmlCleaner cleaner; @Override protected void setUp() throws Exception { cleaner = new HtmlCleaner(); } public void test1() throws IOException { CleanerTransformations transformations = new CleanerTransformations(); TagTransformation tagTransformation = new TagTransformation("strong", "span", false); tagTransformation.addAttributeTransformation("style", "font-weight:bold"); transformations.addTransformation(tagTransformation); CleanerProperties props = cleaner.getProperties(); props.setCleanerTransformations(transformations); props.setOmitXmlDeclaration(true); props.setAddNewlineToHeadAndBody(false); TagNode node = cleaner.clean("
    Mama
    "); assertEquals( "
    Mama
    ", new CompactXmlSerializer(props).getAsString(node) ); } public void test2() throws IOException { CleanerProperties props = cleaner.getProperties(); CleanerTransformations transformations = props.getCleanerTransformations(); TagTransformation t = new TagTransformation("blockquote"); transformations.addTransformation(t); t = new TagTransformation("tags:bold", "td", false); t.addAttributeTransformation("style", "font-weight:bold;"); transformations.addTransformation(t); t = new TagTransformation("table", "table", false); t.addAttributeTransformation("style", "${style};background:${bgcolor};border:solid ${border};"); transformations.addTransformation(t); t = new TagTransformation("font", "span", true); t.addAttributeTransformation("style", "${style};font-family:${face};font-size:${size};color:${color};"); t.addAttributeTransformation("face"); t.addAttributeTransformation("size"); t.addAttributeTransformation("color"); t.addAttributeTransformation("name", "${face}_1"); transformations.addTransformation(t); TagNode node = cleaner.clean( new File("src/test/resources/test8.html"), "UTF-8" ); String xml = new PrettyXmlSerializer(props).getAsString(node); assertTrue("Shouldn't have blockquote in it "+xml, xml.indexOf("blockquote") < 0 ); assertTrue( xml.indexOf(""Hi there!"") >= 0 ); assertTrue( xml.indexOf("tags:bold") < 0 ); assertTrue( xml.indexOf("This is BOLD text?!") >= 0 ); assertTrue( xml.indexOf("bgcolor=#DDDDDD") < 0 ); assertTrue( xml.indexOf("style=\"padding:5\"") < 0 ); assertTrue( xml.indexOf("") >= 0 ); assertTrue( xml.indexOf("") < 0 ); assertTrue( xml.indexOf("color=red") < 0 ); assertTrue( xml.indexOf("color=\"red\"") < 0 ); assertTrue( xml.indexOf("size=16") < 0 ); assertTrue( xml.indexOf("size=\"16\"") < 0 ); assertTrue( xml.indexOf("face=\"Arial\"") < 0 ); assertTrue( xml.indexOf("id=\"fnt_1\"") >= 0 ); assertTrue( xml.indexOf("name=\"Arial_1\"") >= 0 ); assertTrue( xml.indexOf("style=\";font-family:Arial;font-size:16;color:red;\"") >= 0 ); } /** * * @throws IOException */ public void testGlobalTransformations() throws IOException { CleanerTransformations transformations = new CleanerTransformations(); // no "on*" attributes AttributeTransformationPatternImpl attPattern = new AttributeTransformationPatternImpl(Pattern.compile("^\\s*on", Pattern.CASE_INSENSITIVE), null, null); transformations.addGlobalTransformation(attPattern); AttributeTransformationPatternImpl attPattern1 = new AttributeTransformationPatternImpl(null, Pattern.compile("^\\s*javascript:", Pattern.CASE_INSENSITIVE), null); transformations.addGlobalTransformation(attPattern1); CleanerProperties props = cleaner.getProperties(); props.setCleanerTransformations(transformations); props.setOmitXmlDeclaration(true); props.setAddNewlineToHeadAndBody(false); TagNode node = cleaner.clean("

    Mama

    "); assertEquals( "

    Mama

    ", new CompactXmlSerializer(props).getAsString(node) ); } }src/test/java/org/htmlcleaner/ThreadSafetyTest.java0000644000000000000000000001576412176737510021444 0ustar rootroot/* Copyright (c) 2006-2013, HtmlCleaner Team (Vladimir Nikic, Pat Moore, Scott Wilson) All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ package org.htmlcleaner; import java.io.IOException; import java.io.StringWriter; import java.util.ArrayList; import java.util.List; import java.util.UUID; import java.util.regex.Matcher; import java.util.regex.Pattern; import junit.framework.TestCase; import org.htmlcleaner.CleanerProperties; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.Serializer; import org.htmlcleaner.SimpleHtmlSerializer; import org.htmlcleaner.TagNode; /** * Test case for determining whether HtmlCleaner is thread-safe. * * Thanks to Tobias for the test case and report (see bug #86) * */ public class ThreadSafetyTest extends TestCase { private static final int NUM_THREADS = 20; private static final int NUM_RUNS = 100; private static final HtmlCleaner HTML_CLEANER; private static final Serializer SERIALIZER; private static final Pattern uidPattern = Pattern.compile( "\\b[A-F0-9]{8}(?:-[A-F0-9]{4}){3}-[A-Z0-9]{12}\\b", Pattern.CASE_INSENSITIVE); static { final CleanerProperties props = new CleanerProperties(); props.setOmitDoctypeDeclaration(true); props.setOmitXmlDeclaration(true); props.setPruneTags("script"); props.setTranslateSpecialEntities(true); props.setTransSpecialEntitiesToNCR(true); props.setTransResCharsToNCR(true); props.setRecognizeUnicodeChars(false); props.setUseEmptyElementTags(false); props.setIgnoreQuestAndExclam(false); props.setUseCdataForScriptAndStyle(false); props.setIgnoreQuestAndExclam(true); HTML_CLEANER = new HtmlCleaner(props); SERIALIZER = new SimpleHtmlSerializer(props); } public ThreadSafetyTest() { super(); } public void testThreadSafety() throws Exception { Thread[] threads = new Thread[NUM_THREADS]; CheckHtmlCleaner[] runnables = new CheckHtmlCleaner[NUM_THREADS]; for (int i = 0; i < NUM_THREADS; i++) { runnables[i] = new CheckHtmlCleaner(); threads[i] = new Thread(runnables[i]); threads[i].start(); } for (int i = 0; i < NUM_THREADS; i++) { threads[i].join(); if (false == runnables[i].errors.isEmpty()) { throw runnables[i].errors.get(0); } } } private static final class CheckHtmlCleaner implements Runnable { boolean onlyDetectForeignMarkers = true; List errors = new ArrayList(); public void run() { for (int i = 0; i < NUM_RUNS; i++) { String marker = UUID.randomUUID().toString(); String html = "\n" + " \n" + " \n" + " \n" + " \n" + "
    \n" + " wurst\n" + "
    \n" + "
    \n" + " gurke\n" + "
    \n" + "
    \n" + " hund\n" + "
    \n" + "
    \n" + " " + marker +"\n" + "
    \n" + "
    \n" + " autobahn\n" + "
    \n" + "
    \n" + " suppe\n" + "
    \n" + "
    \n" + "  \n" + "
    \n" + " \n" + "" ; try { TagNode htmlNode = HTML_CLEANER.clean(html); StringWriter writer = new StringWriter(); SERIALIZER.write(htmlNode, writer, "UTF-8"); String cleanedHtml = writer.getBuffer().toString(); assertNotNull(cleanedHtml); Matcher matcher = uidPattern.matcher(cleanedHtml); if (onlyDetectForeignMarkers) { if (matcher.find()) { assertEquals("Cleaned HTML contains foreign marker", marker, matcher.group()); } } else { assertTrue("Cleaned HTML contains no marker", matcher.find()); assertEquals("Cleaned HTML contains foreign marker", marker, matcher.group()); assertTrue("Cleaned HTML appears to be too short", cleanedHtml.length() > 600); assertTrue("Cleaned HTML appears to be too long", cleanedHtml.length() < 700); } } catch (AssertionError e) { errors.add(e); break; } catch (RuntimeException e) { // we want to find assertion errors continue; } catch (IOException e) { fail(e.getMessage()); } } } } } src/test/java/org/htmlcleaner/TagManipulationTest.java0000644000000000000000000000401112544757566022150 0ustar rootrootpackage org.htmlcleaner; import junit.framework.TestCase; import java.io.File; import java.io.IOException; /** * Testing node manipulation after cleaning. */ public class TagManipulationTest extends TestCase { private HtmlCleaner cleaner; @Override protected void setUp() throws Exception { cleaner = new HtmlCleaner(); } public void testInnerHtml() throws XPatherException, IOException { TagNode node = cleaner.clean(new File("src/test/resources/test2.html")); cleaner.setInnerHtml((TagNode) (node.evaluateXPath("//table[1]")[0]), "
    "); } public void testManipulation() throws XPatherException, IOException { TagNode node9 = cleaner.clean(new File("src/test/resources/test9.html")); TagNode pNode = (TagNode) node9.evaluateXPath("//p[1]")[0]; pNode.removeAllChildren(); TagNode h3 = new TagNode("h3"); pNode.addChild(h3); TagNode h2 = new TagNode("h2"); TagNode h4 = new TagNode("h4"); pNode.insertChildBefore(h3, h2); pNode.insertChildAfter(h3, h4); ContentNode testContent = new ContentNode("TEST BEFORE H3 AND AFTER H2"); pNode.insertChild(1, testContent); pNode.addChild(new ContentNode("LAST_ONE")); assertTrue(pNode.getChildIndex(h4) == 3); CleanerProperties props = new CleanerProperties(); props.setOmitXmlDeclaration(true); props.setNamespacesAware(false); String pNodeAsString = new CompactXmlSerializer(props).getAsString(pNode); pNodeAsString = pNodeAsString.replaceAll("\n", ""); assertEquals("

    TEST BEFORE H3 AND AFTER H2

    LAST_ONE

    ", pNodeAsString); } }src/test/java/org/htmlcleaner/TagCopyingAndLimitingTest.java0000644000000000000000000001453212524371242023226 0ustar rootroot/* Copyright (c) 2006-2014, The HtmlCleaner Project All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ package org.htmlcleaner; import java.io.IOException; import java.io.StringReader; import java.io.StringWriter; import javax.xml.parsers.ParserConfigurationException; import org.jdom2.Document; import org.jdom2.output.Format; import org.jdom2.output.XMLOutputter; import junit.framework.TestCase; /** * Tests the effect of successively having to copy identical tags in a list. */ public class TagCopyingAndLimitingTest extends TestCase { public void testTagCopyingAndLimitingHTML4() throws IOException, ParserConfigurationException { StringBuilder sb = new StringBuilder(); sb.append(" // collapsing assertTrue(xmlString, xmlString.indexOf("") >= 0); properties.setUseEmptyElementTags(false); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("
    row1row2row3"); assertEquals(node.evaluateXPath("//table[1]/tbody[1]/tr[1]/td").length, 3); assertEquals(cleaner.getInnerHtml((TagNode) (node.evaluateXPath("//table[1]")[0])), "
    row1row2row3
    ") >= 0); } public void testAllowMultiWordAttributes() throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); String xmlString; properties.setAdvancedXmlEscape(false); properties.setUseEmptyElementTags(false); properties.setAllowMultiWordAttributes(false); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("
    ") < 0); assertTrue(xmlString.indexOf("
    ") >= 0); properties.setAllowMultiWordAttributes(true); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("
    ") >= 0); properties.setAllowHtmlInsideAttributes(true); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("LINK 1") >= 0); properties.setAllowHtmlInsideAttributes(false); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("LINK 1") < 0); assertTrue(xmlString.indexOf("Title is here">LINK 1") >= 0); properties.setIgnoreQuestAndExclam(true); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("<!INSTRUCTION1 id="aaa">") < 0); assertTrue(xmlString.indexOf("<?INSTRUCTION2 id="bbb">") < 0); properties.setIgnoreQuestAndExclam(false); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("<!INSTRUCTION1 id="aaa">") >= 0); assertTrue(xmlString.indexOf("<?INSTRUCTION2 id="bbb">") >= 0); properties.setNamespacesAware(true); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("") >= 0); assertTrue(xmlString.indexOf("aaa") >= 0); properties.setNamespacesAware(false); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("= 0); assertTrue(xmlString.indexOf("aaa") >= 0); } public void testAllowHtmlInsideAttributes() throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); String xmlString; properties.setAdvancedXmlEscape(false); properties.setAllowHtmlInsideAttributes(true); xmlString = getXmlString(cleaner, properties); assertTrue( xmlString.indexOf("LINK 1") >= 0 ); properties.setAllowHtmlInsideAttributes(false); xmlString = getXmlString(cleaner, properties); assertTrue( xmlString.indexOf("LINK 1") < 0 ); xmlString = getXmlString(cleaner, properties); assertTrue( xmlString.indexOf("Title is here">LINK 1") >= 0 ); } public void testIgnoreQuestAndExclam() throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); String xmlString; properties.setAdvancedXmlEscape(false); properties.setIgnoreQuestAndExclam(true); xmlString = getXmlString(cleaner, properties); assertTrue( xmlString.indexOf("<!INSTRUCTION1 id="aaa">") < 0 ); xmlString = getXmlString(cleaner, properties); assertTrue( xmlString.indexOf("<?INSTRUCTION2 id="bbb">") < 0 ); properties.setIgnoreQuestAndExclam(false); xmlString = getXmlString(cleaner, properties); assertTrue( xmlString.indexOf("<!INSTRUCTION1 id="aaa">") >= 0 ); xmlString = getXmlString(cleaner, properties); assertTrue( xmlString.indexOf("<?INSTRUCTION2 id="bbb">") >= 0 ); } /** * @throws IOException */ public void testComments() throws IOException { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); properties.setNamespacesAware(false); properties.setOmitComments(false); assertTrue(getXmlString(cleaner, properties).indexOf("") >= 0); properties.setOmitComments(true); assertTrue(getXmlString(cleaner, properties).indexOf("") < 0); properties.setOmitComments(false); assertTrue(getXmlString(cleaner, properties).indexOf("") >= 0); properties.setHyphenReplacementInComment("*"); assertTrue(getXmlString(cleaner, properties).indexOf("") >= 0); } /** * @throws IOException */ public void testOmitXmlDeclaration() throws IOException { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); properties.setNamespacesAware(false); properties.setOmitXmlDeclaration(false); assertTrue(getXmlString(cleaner, properties).indexOf("= 0); properties.setOmitXmlDeclaration(true); assertTrue(getXmlString(cleaner, properties).indexOf("") >= 0); properties.setOmitDoctypeDeclaration(true); assertTrue(getXmlString(cleaner, properties).indexOf( "") < 0); } /** * @throws IOException */ public void testOmitHtmlEnvelope() throws IOException { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); properties.setHtmlVersion(4); properties.setNamespacesAware(false); properties.setAddNewlineToHeadAndBody(false); String xmlString; properties.setOmitHtmlEnvelope(true); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("") < 0); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("") < 0); properties.setOmitHtmlEnvelope(false); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString, xmlString.indexOf("") >= 0); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString, xmlString.indexOf("") >= 0); } /** * @throws IOException */ public void testOmitHtml5Envelope() throws IOException { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); properties.setHtmlVersion(5); properties.setNamespacesAware(false); properties.setAddNewlineToHeadAndBody(false); String xmlString; properties.setOmitHtmlEnvelope(true); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("") < 0); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("") < 0); properties.setOmitHtmlEnvelope(false); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString, xmlString.indexOf("") >= 0); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString, xmlString.indexOf("") >= 0); } public void testPruneProperties() throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); properties.reset(); properties.setPruneTags("div,mytag"); String xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("= 0); assertTrue(getXmlString(cleaner, properties).indexOf("") >= 0); properties.setBooleanAttributeValues("empty"); assertTrue(getXmlString(cleaner, properties).indexOf("") >= 0); properties.setBooleanAttributeValues("true"); assertTrue(getXmlString(cleaner, properties).indexOf("") >= 0); properties.setBooleanAttributeValues("selft"); assertTrue(getXmlString(cleaner, properties).indexOf("") >= 0); } private String getXmlString(HtmlCleaner cleaner, CleanerProperties properties) throws IOException { TagNode node = cleaner.clean(new File("src/test/resources/test4.html"), "UTF-8"); String xmlString = new SimpleXmlSerializer(properties).getAsString(node); return xmlString; } public void testNbsp() throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); properties.setTranslateSpecialEntities(false); properties.setOmitDoctypeDeclaration(false); properties.setOmitXmlDeclaration(true); properties.setAdvancedXmlEscape(true); properties.setAddNewlineToHeadAndBody(false); // test first when generating xml TagNode node = cleaner.clean("\n" + "
    &"''<> &garbage;&
    "); SimpleXmlSerializer simpleXmlSerializer = new SimpleXmlSerializer(properties); String xmlString = simpleXmlSerializer.getAsString(node, "UTF-8"); assertEquals("\n" + "
    &"''<>" + String.valueOf((char) 160) + "&garbage;&
    ", xmlString.trim()); simpleXmlSerializer.setCreatingHtmlDom(true); // then test when generating html String domString = simpleXmlSerializer.getAsString(node, "UTF-8"); assertEquals("\n" + // "
    &"''<> &garbage;&
    ", "
    &"''<> &garbage;&
    ", domString.trim()); } /** * make sure that the unicode character has leading 'x'. *
      *
    • ŠA; is converted by FF to 3 characters: Š + 'A' + ';'
    • *
    • �x138A; is converted by FF to 6? 7? characters: � 'x'+'1'+'3'+ * '8' + 'A' + ';' #0 is displayed kind of weird
    • *
    • ᎊ is a single character
    • *
    * * @throws Exception */ public void testHexConversion() throws Exception { CleanerProperties properties = new CleanerProperties(); properties.setOmitHtmlEnvelope(true); properties.setOmitXmlDeclaration(true); SimpleXmlSerializer simpleXmlSerializer = new SimpleXmlSerializer(properties); simpleXmlSerializer.setCreatingHtmlDom(false); String xmlString = simpleXmlSerializer.getAsString( "
    ŠA;
    "); assertEquals("
    "+new String(new char[] {138, 'A',';'})+"
    ", xmlString); xmlString = simpleXmlSerializer.getAsString( "
    "); assertEquals("
    "+new String(new char[] {0x138A})+"
    ", xmlString); properties.reset(); } public void testPattern() { for (Object[] test : new Object[][] { new Object[] { "0x138A;", false, -1, -1, null, true, 0, 7, "x138A", true, 0, 1, "0" }, new Object[] { "x138A;", true, 0, 6, "x138A", true, 0, 6, "x138A", false, -1, -1, null }, new Object[] { "138;", false, -1, -1, null, false, -1, -1, null, true, 0, 4, "138" }, new Object[] { "139", false, -1, -1, null, false, -1, -1, null, true, 0, 3, "139" }, new Object[] { "x13A", true, 0, 4, "x13A", true, 0, 4, "x13A", false, -1, -1, null }, new Object[] { "13F", false, -1, -1, null, false, -1, -1, null, true, 0, 2, "13" }, new Object[] { "13", false, -1, -1, null, false, -1, -1, null, true, 0, 2, "13" }, new Object[] { "X13AZ", true, 0, 4, "X13A", true, 0, 4, "X13A", false, -1, -1, null } }) { int i = 0; String input = (String) test[i++]; boolean strict = (Boolean) test[i++]; int sstart = (Integer) test[i++]; int send = (Integer) test[i++]; String sgroup = (String) test[i++]; boolean relaxed = (Boolean) test[i++]; int rstart = (Integer) test[i++]; int rend = (Integer) test[i++]; String rgroup = (String) test[i++]; boolean decimal = (Boolean) test[i++]; int dstart = (Integer) test[i++]; int dend = (Integer) test[i++]; String dgroup = (String) test[i++]; Matcher m = Utils.HEX_STRICT.matcher(input); boolean actual = m.find(); assertEquals(input, strict, actual); if (actual) { assertEquals(input + " strict start ", sstart, m.start()); assertEquals(input + " strict end ", send, m.end()); assertEquals(input + " strict group ", sgroup, m.group(1)); } m = Utils.HEX_RELAXED.matcher(input); actual = m.find(); assertEquals(input, relaxed, actual); if (actual) { assertEquals(input + " relaxed start ", rstart, m.start()); assertEquals(input + " relaxed end ", rend, m.end()); assertEquals(input + " relaxed group ", rgroup, m.group(1)); } m = Utils.DECIMAL.matcher(input); actual = m.find(); assertEquals(input, decimal, actual); if (actual) { assertEquals(input + " decimal start ", dstart, m.start()); assertEquals(input + " decimal end ", dend, m.end()); assertEquals(input + " decimal group ", dgroup, m.group(1)); } } } public void testConvertUnicode() throws Exception { CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitHtmlEnvelope(true); cleanerProperties.setOmitXmlDeclaration(true); cleanerProperties.setUseEmptyElementTags(false); // right tick is special unicode character 8217 String output = new SimpleXmlSerializer(cleanerProperties).getAsString( "

    President’s Message

    "); assertEquals("

    President’s Message

    ", output); } private static final String HTML_COMMENT_OUT_BEGIN = ""; private static final String SAMPLE_JS = "var x = ['foo','bar'];"; private static final String COMMENT_START = ""; /** * Test conversion of former ( now bad practice ) of: * *
         * <style><!-- style info --></style>
         * 
    * * into <style>/(star)<![CDATA[(star)/ style info * /(star)]]>(star)/</style> * * Note: disabled because it doesn't test actual behavior * @throws IOException */ public void disabledTestConvertOldStyleComments() throws IOException { // TODO: May need additional flag to handle '<' inside of scripts // dontEscape() in xml serializer should not be triggered based on use // cdata // but dontEscape is used by subclasses -- need to investigate best // solution. // maybe o.k. to have the < > be translated. That is what original test // does. // but the ' should probably not be touched?? HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = new CleanerProperties(); properties.setOmitXmlDeclaration(true); properties.setUseCdataForScriptAndStyle(true); properties.setAddNewlineToHeadAndBody(false); // test for positive matches to old-style comment hacks for (String[] testData : new String[][] { // normal case - remove old-style comment out hack new String[] { HTML_COMMENT_OUT_BEGIN + "//" + COMMENT_START + "\n" + SAMPLE_JS + "//" + COMMENT_END + "\n" + HTML_COMMENT_OUT_END, HTML_COMMENT_OUT_BEGIN + CData.SAFE_BEGIN_CDATA + "\n" + SAMPLE_JS + CData.SAFE_END_CDATA + "\n" + HTML_COMMENT_OUT_END }, // don't let random whitespace confuse things new String[] { HTML_COMMENT_OUT_BEGIN + "\n\n\n\n" + "//" + " \t" + COMMENT_START + "\n" + SAMPLE_JS + "\n\n\n" + "//" + COMMENT_END + "\n\n\t\n" + HTML_COMMENT_OUT_END, HTML_COMMENT_OUT_BEGIN + "\n\n\n\n" + CData.SAFE_BEGIN_CDATA + "\n" + SAMPLE_JS + "\n\n\n" + "//" + CData.SAFE_END_CDATA + "\n\n\t\n" + HTML_COMMENT_OUT_END }, }) { doTestConvertOldStyleComments(cleaner, properties, testData); } // test for false positives for (String[] testData : new String[][] { // make sure not to remove real comments new String[] { HTML_COMMENT_OUT_BEGIN + "//" + "an ordinary comment" + "\n" + SAMPLE_JS + "//" + "a final remark" + HTML_COMMENT_OUT_END, HTML_COMMENT_OUT_BEGIN + CData.SAFE_BEGIN_CDATA + "//" + "an ordinary comment" + "\n" + SAMPLE_JS + "//" + "a final remark" + CData.SAFE_END_CDATA + HTML_COMMENT_OUT_END }, }) { doTestConvertOldStyleComments(cleaner, properties, testData); } } /** * @param cleaner * @param properties * @param testData */ private void doTestConvertOldStyleComments(HtmlCleaner cleaner, CleanerProperties properties, String[] testData) throws IOException { TagNode node = cleaner.clean(testData[0]); // test to make sure the no-op still works properties.setUseCdataForScriptAndStyle(false); String xmlString = new SimpleXmlSerializer(properties).getAsString(node); assertEquals(testData[0], xmlString); // now test actual properties.setUseCdataForScriptAndStyle(true); xmlString = new SimpleXmlSerializer(properties).getAsString(node); assertEquals(testData[1], xmlString); } public void testIgnoreClosingCData() throws IOException { String html = "\n" + "ASWA - Events" + ""; CleanerProperties properties = new CleanerProperties(); properties.setOmitXmlDeclaration(true); properties.setUseCdataForScriptAndStyle(true); properties.setAddNewlineToHeadAndBody(false); properties.setIgnoreQuestAndExclam(false); HtmlCleaner cleaner = new HtmlCleaner(properties); TagNode node = cleaner.clean(html); //properties.setUseCdataForScriptAndStyle(false); String xmlString = new SimpleXmlSerializer(properties).getAsString(node); assertEquals(html, xmlString); } public void testTransResCharsToNCR() throws Exception { HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); String xmlString; properties.setNamespacesAware(false); properties.setAdvancedXmlEscape(true); properties.setTransResCharsToNCR(true); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("
    1.&"'<>
    ") >= 0); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("
    2.&"'<>
    ") >= 0); properties.setTransResCharsToNCR(false); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("
    1.&"'<>
    ") >= 0); xmlString = getXmlString(cleaner, properties); assertTrue(xmlString.indexOf("
    2.&"'<>
    ") >= 0); } } src/test/java/org/htmlcleaner/NamespacesTest.java0000644000000000000000000001271212513703102021107 0ustar rootroot/* Copyright (c) 2006-2013, the HtmlCleaner Project All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ package org.htmlcleaner; import java.io.IOException; import org.junit.Test; public class NamespacesTest extends AbstractHtmlCleanerTest{ /** * Tests that we can handle XMLNS="" attributes. See issue #135 * @throws IOException */ @Test public void xmlnsAttributeInUpperCase() throws IOException{ String initial = ""; String expected = "\n\n"; assertCleaned(initial, expected); } /** * Tests that we can handle xmlns="" attributes. See issue #135 * @throws IOException */ @Test public void emptyNamespaces() throws IOException{ String initial = readFile("src/test/resources/test32.html"); String expected = "\n\n

    Text

    "; assertCleaned(initial, expected); } /** * Uses an RDFa example to test that we retain namespace declarations. See issue #63 * @throws IOException */ @Test public void RDFa() throws IOException{ String initial = readFile("src/test/resources/test13.html"); String expected = readFile("src/test/resources/test13_expected.html"); assertCleaned(initial, expected); } /** * Uses a namespace prefix for an element. See issue #63 * @throws IOException */ @Test public void DCElement() throws IOException{ String initial = readFile("src/test/resources/test14.html"); String expected = readFile("src/test/resources/test14_expected.html"); assertCleaned(initial, expected); } /** * Uses a namespace prefix for an attribute. See issue #63 * @throws IOException */ @Test public void DCAttribute() throws IOException{ String initial = readFile("src/test/resources/test15.html"); String expected = readFile("src/test/resources/test15_expected.html"); assertCleaned(initial, expected); } /** * If we aren't NS aware, strip out the xmlns attr and process everything * as HTML. */ @Test public void testTableCellsWithoutNamespaceAwareness() throws IOException{ cleaner.getProperties().setNamespacesAware(false); String initial = readFile("src/test/resources/test26.html"); String expected = readFile("src/test/resources/test26_expected.html"); assertCleaned(initial, expected); } /** * If we are namespace-aware and use the legacy HTML namespace, we should * treat the content as HTML. See issue #115 */ @Test public void testTableCellsUsingNamespaceAwareAndLegacyHtmlNS() throws IOException{ cleaner.getProperties().setNamespacesAware(true); cleaner.getProperties().setOmitUnknownTags(true); String initial = readFile("src/test/resources/test26.html"); String expected = readFile("src/test/resources/test26_expected.html"); assertCleaned(initial, expected); } /** * If we're NS-aware and using XHTML, treat the content as HTML tags and * insert TBODY into the table (etc) but retain the xmlns attr on the html * tag */ @Test public void testTableCellsUsingNamespaceAwareAndXhtmlNS() throws IOException{ cleaner.getProperties().setNamespacesAware(true); cleaner.getProperties().setOmitUnknownTags(true); String initial = readFile("src/test/resources/test27.html"); String expected = readFile("src/test/resources/test27_expected.html"); assertCleaned(initial, expected); } /** * If we are namespace-aware and use an unknown namespace, * all the content will be treated as foreign markup; this means * there will be no insertion of TBODY tags as the table element * is not interpreted as being a HTML table element */ @Test public void testTableCellsUsingNamespaceAwareAndUnknownNS() throws IOException{ cleaner.getProperties().setNamespacesAware(true); cleaner.getProperties().setOmitUnknownTags(true); String initial = readFile("src/test/resources/test28.html"); String expected = readFile("src/test/resources/test28_expected.html"); assertCleaned(initial, expected); } } src/test/java/org/htmlcleaner/JDomSerializerTest.java0000644000000000000000000002332013046661522021723 0ustar rootroot/* Copyright (c) 2006-2013, the HtmlCleaner Project All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ package org.htmlcleaner; import static org.junit.Assert.assertEquals; import java.io.IOException; import org.jdom2.Document; import org.jdom2.Namespace; import org.jdom2.output.Format; import org.jdom2.output.XMLOutputter; import org.junit.Test; public class JDomSerializerTest extends AbstractHtmlCleanerTest { /** * Tests that we comment CDATA in JDom * @throws IOException */ @Test public void SafeCData1() throws IOException{ String initial = ""; String expected = "\n\n"; CleanerProperties props = new CleanerProperties(); props.setOmitCdataOutsideScriptAndStyle(true); props.setAddNewlineToHeadAndBody(false); TagNode tagNode = new HtmlCleaner(props).clean(initial); Document doc = new JDomSerializer(props, true).createJDom(tagNode); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String output = outputter.outputString(doc); assertEquals(expected, output); } /** * Tests that we comment CDATA in JDom; in this case preserving existing comments * @throws IOException */ @Test public void SafeCData2() throws IOException{ String initial = ""; String expected = "\n\n"; CleanerProperties props = new CleanerProperties(); props.setOmitCdataOutsideScriptAndStyle(true); props.setAddNewlineToHeadAndBody(false); TagNode tagNode = new HtmlCleaner(props).clean(initial); Document doc = new JDomSerializer(props, true).createJDom(tagNode); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String output = outputter.outputString(doc); assertEquals(expected, output); } /** * Tests that we comment CDATA in JDom; in this case that we normalise comment style * @throws IOException */ @Test public void SafeCData3() throws IOException{ String initial = ""; String expected = "\n\n"; CleanerProperties props = new CleanerProperties(); props.setOmitCdataOutsideScriptAndStyle(true); props.setAddNewlineToHeadAndBody(false); TagNode tagNode = new HtmlCleaner(props).clean(initial); Document doc = new JDomSerializer(props, true).createJDom(tagNode); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String output = outputter.outputString(doc); assertEquals(expected, output); } /** * Tests that we comment CDATA in JDom; in this case a more complex example * @throws IOException */ @Test public void SafeCData4() throws IOException{ String initial = readFile("src/test/resources/test33.html"); String expected = readFile("src/test/resources/test33_expected.html");; CleanerProperties props = new CleanerProperties(); props.setOmitCdataOutsideScriptAndStyle(true); props.setAddNewlineToHeadAndBody(false); TagNode tagNode = new HtmlCleaner(props).clean(initial); Document doc = new JDomSerializer(props, true).createJDom(tagNode); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String output = outputter.outputString(doc); assertEquals(expected, output); } /** * Tests that we comment CDATA in JDom * @throws IOException */ @Test public void SafeCData5() throws IOException{ String initial = ""; String expected = "\n\n"; CleanerProperties props = new CleanerProperties(); props.setOmitCdataOutsideScriptAndStyle(true); props.setUseCdataForScriptAndStyle(true); props.setDeserializeEntities(true); props.setAddNewlineToHeadAndBody(false); TagNode tagNode = new HtmlCleaner(props).clean(initial); Document doc = new JDomSerializer(props, true).createJDom(tagNode); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String output = outputter.outputString(doc); assertEquals(expected, output); } /** * Tests that we comment CDATA in JDom; this test uses CSS * @throws IOException */ @Test public void SafeCData6() throws IOException{ String initial = ""; String expected = "\n\n"; CleanerProperties props = new CleanerProperties(); props.setOmitCdataOutsideScriptAndStyle(true); props.setUseCdataForScriptAndStyle(true); props.setAddNewlineToHeadAndBody(false); TagNode tagNode = new HtmlCleaner(props).clean(initial); Document doc = new JDomSerializer(props, true).createJDom(tagNode); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String output = outputter.outputString(doc); assertEquals(expected, output); } /** * See issue #95 */ @Test public void testNPE(){ String validhtml5StringCode = ""; CleanerProperties props = new CleanerProperties(); props.setOmitHtmlEnvelope(true); TagNode tagNode = new HtmlCleaner(props).clean(validhtml5StringCode); new JDomSerializer(props, true).createJDom(tagNode); } /** * See issue 106 * @throws IOException */ @Test public void CDATA() throws Exception{ cleaner.getProperties().setUseCdataForScriptAndStyle(true); cleaner.getProperties().setOmitCdataOutsideScriptAndStyle(true); String initial = readFile("src/test/resources/test22.html"); TagNode tagNode = cleaner.clean(initial); JDomSerializer ser = new JDomSerializer(cleaner.getProperties()); Document doc = ser.createJDom(tagNode); assertEquals("org.jdom2.CDATA", doc.getRootElement().getChild("head").getChild("script").getContent().get(1).getClass().getName()); } /** * See issue 106 * @throws IOException */ @Test public void noCDATA() throws Exception{ cleaner.getProperties().setUseCdataForScriptAndStyle(false); cleaner.getProperties().setOmitCdataOutsideScriptAndStyle(true); String initial = readFile("src/test/resources/test22.html"); TagNode tagNode = cleaner.clean(initial); JDomSerializer ser = new JDomSerializer(cleaner.getProperties()); Document doc = ser.createJDom(tagNode); assertEquals("org.jdom2.Text", doc.getRootElement().getChild("head").getChild("script").getContent().get(0).getClass().getName()); } /** * Test we handle foreign markup OK * @throws Exception */ @Test public void namespaces() throws Exception{ cleaner.getProperties().setNamespacesAware(true); String initial = readFile("src/test/resources/test21.html"); TagNode tagNode = cleaner.clean(initial); JDomSerializer ser = new JDomSerializer(cleaner.getProperties()); Document doc = ser.createJDom(tagNode); // // These will fail with an NPE if the namespaces are not correct // doc.getRootElement().getChild("body", Namespace.getNamespace("http://www.w3.org/1999/xhtml")).getNamespaceURI(); doc.getRootElement().getChild("body", Namespace.getNamespace("http://www.w3.org/1999/xhtml")).getChild("svg", Namespace.getNamespace("http://www.w3.org/2000/svg")).getNamespaceURI(); doc.getRootElement().getChild("body", Namespace.getNamespace("http://www.w3.org/1999/xhtml")).getChild("svg", Namespace.getNamespace("http://www.w3.org/2000/svg")).getChild("title", Namespace.getNamespace("http://www.w3.org/2000/svg")); } } src/test/java/org/htmlcleaner/HtmlCleanerTest.java0000644000000000000000000011424313104425710021233 0ustar rootrootpackage org.htmlcleaner; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.StringWriter; import javax.xml.parsers.ParserConfigurationException; import org.junit.Ignore; import org.junit.Test; public class HtmlCleanerTest extends AbstractHtmlCleanerTest { /** * Prune tags test - see bug #188 */ @Test public void pruneTest() throws Exception { String initial = "

    alert using script:ipt>alert(\"Hello\");ipt>

    \n"; String expected = "

    alert using script:

    "; cleaner.getProperties().setPruneTags("script"); cleaner.getProperties().setOmitHtmlEnvelope(true); assertCleanedHtml(initial, expected); } /** * first attribute of duplicates is selected - see bug #57 */ @Test public void duplicateAttributes() throws Exception { cleaner.getProperties().setOmitHtmlEnvelope(true); assertCleanedHtml("

    ", "

    "); assertCleanedHtml("

    ", "

    "); assertCleanedHtml("

    ", "

    "); assertCleanedHtml("

    ", "

    "); } /** * attribute names for HTML and XML - see bug #175 */ @Test public void attributeNames() throws Exception { cleaner.getProperties().setOmitHtmlEnvelope(true); cleaner.getProperties().setNamespacesAware(true); // Try to quietly fix bad names with no prefixes assertCleanedHtml("

    ", "

    "); // OK - characters assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // Numbers - OK in HTML, invalid in XML // First, lets clean them with a prefix. cleaner.getProperties().setInvalidXmlAttributeNamePrefix("hc-generated-"); assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // Now, without a prefix - they have to be removed cleaner.getProperties().setInvalidXmlAttributeNamePrefix(""); assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // Colons - OK but assumed to be NS prefixes assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // Dashes - OK in HTML and in XML assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // Semicolons - OK in HTML, invalid in XML cleaner.getProperties().setInvalidXmlAttributeNamePrefix("hc-generated-"); assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); cleaner.getProperties().setAllowInvalidAttributeNames(true); assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); cleaner.getProperties().setAllowInvalidAttributeNames(false); cleaner.getProperties().setInvalidXmlAttributeNamePrefix(""); assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // SOLIDUS - invalid in both assertCleanedHtml("

    ", "

    "); cleaner.getProperties().setAllowInvalidAttributeNames(false); cleaner.getProperties().setInvalidXmlAttributeNamePrefix("hc-generated-"); assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); cleaner.getProperties().setAllowInvalidAttributeNames(true); assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // SOLIDUS assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // APOS assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // EQUALS assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); // NULL assertCleanedHtml("

    ", "

    "); assertCleaned("

    ", "

    "); assertCleanedDom("

    ", "

    "); assertCleanedJDom("

    ", "

    "); } @Test public void attributesRealExample() throws IOException{ cleaner.getProperties().setOmitHtmlEnvelope(true); cleaner.getProperties().setAllowInvalidAttributeNames(true); String original = "

    "; String expected = "
    "; assertCleanedHtml(original, expected); } // // Test for bug #142 // @Test @Ignore // TODO Still need to fix this public void tokens() throws IOException{ String html = "TEST ONE
    --- TEST TWO (THREE) FOUR"; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setHtmlVersion(5); props.setOmitUnknownTags(true); props.setIgnoreQuestAndExclam(true); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertEquals("\nTEST ONE
    --- TEST TWO (THREE) FOUR", htmlcontent); } // // Tables with missing TDs // @Test public void tableFix() throws IOException{ String html = "

    Hello

    "; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setHtmlVersion(5); props.setAllowHtmlInsideAttributes(true); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertEquals("\n

    Hello

    ", htmlcontent); } // // Tables with missing TDs // @Test public void tableFix2() throws IOException{ String html = "
    Hello"; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setHtmlVersion(5); props.setAllowHtmlInsideAttributes(true); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertEquals("\n
    Hello
    ", htmlcontent); } // // Tables with missing TDs // @Test public void tableFix3() throws IOException{ String html = "

    Hello

    "; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setHtmlVersion(5); props.setAllowHtmlInsideAttributes(true); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertEquals("\n

    Hello

    ", htmlcontent); } // // Test for bug #166 - ensure we insert a LI rather than just shove the tag into the parent UL // @Test public void html5pos() throws IOException{ String html = "

      Hello

    "; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertEquals("\n
    • Hello

    ", htmlcontent); } // // Test for bug #170 // @Test public void zoom() throws IOException{ String html = "test"; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setAllowHtmlInsideAttributes(true); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertEquals("\ntest", htmlcontent); } // // Test for bug #168 - if ns-aware is false, we shouldn't have any xmlns attributes // @Test public void ignoreNStest() throws IOException{ String html = ""; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setNamespacesAware(false); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertEquals("\n", htmlcontent); } // // Test for bug #173 // @Test public void loopTest() throws IOException{ String html = "

    Some text

    Other text.

    "; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setNamespacesAware(true); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertTrue(htmlcontent.contains("

    Some text

    Other text.

    ")); } // // Test for bug #182 // @Test public void directivesIgnoreQuestandExclaim() throws IOException{ String html = "
    Hmailserver service shutdown:Ok
    "; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setIgnoreQuestAndExclam(false); props.setNamespacesAware(true); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertTrue(htmlcontent.contains("<!==><!==>Hmailserver service shutdown:<!==><!==>Ok")); } // // Test for bug #183 // @Test public void casing() throws IOException{ String html = "" + "" + "about" + "" + "About INMA"; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setNamespacesAware(true); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertTrue(htmlcontent.contains("aboutAbout INMA")); } // // Test for bug #178 // @Test public void arrayError(){ final String HTML = "" + "" + "" + "
      " + "

      d

      " + "
    " + "
    " + "
    " + "" + ""; final HtmlCleaner cleaner = new HtmlCleaner(); cleaner.clean(HTML); } // // See issue #118 // @Test public void nbsp() throws IOException{ String html = "One Two"; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setTranslateSpecialEntities(false); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertTrue(htmlcontent.contains("One Two")); } // // See issue #118 // @Test public void pound() throws IOException{ String html = "£160"; ByteArrayOutputStream htmlOutputStream = new ByteArrayOutputStream(); HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setTranslateSpecialEntities(false); TagNode node = cleaner.clean(html); new SimpleHtmlSerializer(props).writeToStream(node, htmlOutputStream); String htmlcontent = htmlOutputStream.toString(); assertTrue(htmlcontent.contains("£160")); } // // Test for issue #176 // @Test public void invalidIUnicodeCodePoint() { final String HTML = "" + "Brine�s." + "" + ""; try { final TagNode tagNode = new HtmlCleaner().clean(HTML); final CleanerProperties cleanerProperties = new CleanerProperties(); new DomSerializer(cleanerProperties).createDOM(tagNode); } catch (IllegalArgumentException e) { fail(); } catch (ParserConfigurationException e) { fail(); } } // // Tests for \u0000 (UTF8 Null) - see issue #165 // @Test public void UTFnulls() throws IOException{ String input = "\u0000"; InputStream is = new ByteArrayInputStream(input.getBytes()); HtmlCleaner cleaner = new HtmlCleaner(); cleaner.getProperties().setTranslateSpecialEntities(true); TagNode html = cleaner.clean(is, "UTF-8"); String cleanHtml = new SimpleXmlSerializer(cleaner.getProperties()).getAsString(html); if(cleanHtml.contains("\u0000")) throw new AssertionError("U+0000 is an invalid XHTML char."); } @Test public void whiteSpace() throws IOException{ String html = "One Two"; TagNode node = cleaner.clean(html); StringWriter writer = new StringWriter(); new PrettyHtmlSerializer(cleaner.getProperties(), " ") .serialize(node, writer); } // // MathML-specific test - see bug #172 // @Test public void mtdMissingParentDefinition() throws IOException{ String initial = "S"; String expected = "S"; cleaner.getProperties().setAddNewlineToHeadAndBody(false); cleaner.getProperties().setNamespacesAware(true); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } @Test public void testScriptEscape() throws IOException { final String input = ""; HtmlCleaner cleaner = new HtmlCleaner(); cleaner.getProperties().setUseCdataForScriptAndStyle(true); cleaner.getProperties().setAdvancedXmlEscape(true); cleaner.getProperties().setDeserializeEntities(true); TagNode cleaned = cleaner.clean(input); StringWriter writer = new StringWriter(); serializer = new SimpleXmlSerializer(cleaner.getProperties()); serializer.write(cleaned, writer, "UTF-8"); } @Test public void testEscape() throws IOException { final String input = "
    <?xml version=\"1.0\"?>
    "; HtmlCleaner cleaner = new HtmlCleaner(); TagNode cleaned = cleaner.clean(input); StringWriter writer = new StringWriter(); serializer = new PrettyHtmlSerializer(cleaner.getProperties()); serializer.write(cleaned, writer, "UTF-8"); } /** * @throws IOException */ @Test @Ignore // Still an issue with this one - basically self-closing tags don't seem to close properly public void testSelfClosingTagNonHtml() throws IOException { final String input = "

    "; final String expected = "


    "; TagNode cleaned = new HtmlCleaner().clean(input); StringWriter writer = new StringWriter(); serializer = new SimpleHtmlSerializer(cleaner.getProperties()); serializer.write(cleaned, writer, "UTF-8"); assertEquals(expected, writer.toString()); } /** * @throws IOException */ @Test public void testSelfClosingTag() throws IOException { final String input = "

    "; final String expected = "

    "; TagNode cleaned = new HtmlCleaner().clean(input); StringWriter writer = new StringWriter(); serializer = new SimpleHtmlSerializer(cleaner.getProperties()); serializer.write(cleaned, writer, "UTF-8"); assertEquals(expected, writer.toString()); } /** * Test for bug #158 * @throws IOException */ @Test public void testNPE() throws IOException { final String HTML = "" + "" + "" + "" + "" + "" + "" + "" + "
    " + "

    " + "
    " + "
    " + "
    " + "" + ""; final String expected = "

    "; TagNode cleaned = new HtmlCleaner().clean(HTML); StringWriter writer = new StringWriter(); serializer = new SimpleHtmlSerializer(cleaner.getProperties()); serializer.write(cleaned, writer, "UTF-8"); assertEquals(expected, writer.toString()); } /** * Test for bug #156 * @throws IOException */ @Test public void testStyleIsNotRemoved() throws IOException{ final String original = "
    42
    "; final String expected = "
    42
    "; cleaner.getProperties().setOmitHtmlEnvelope(true); TagNode node = cleaner.clean(original); StringWriter writer = new StringWriter(); serializer = new SimpleHtmlSerializer(cleaner.getProperties()); serializer.write(node, writer, "UTF-8"); assertEquals(expected, writer.toString()); } /** * Test for bug #154 * @throws IOException */ @Test public void attributeSerialization() throws IOException{ final String original = "

    text

    "; final String expectedHtml = "

    text

    "; final String expectedXml = "

    text

    "; cleaner.getProperties().setOmitHtmlEnvelope(true); TagNode node = cleaner.clean(original); StringWriter writer = new StringWriter(); serializer = new SimpleHtmlSerializer(cleaner.getProperties()); serializer.write(node, writer, "UTF-8"); assertEquals(expectedHtml, writer.toString()); // // TODO this should also work for XML in some cases - I've commented this out for now but will return to it later. // //writer = new StringWriter(); //serializer = new SimpleXmlSerializer(cleaner.getProperties()); //assertEquals(expectedXml, writer.toString()); } /** * This is to test issue #157 * @throws IOException */ @Test public void math() throws IOException{ String initial = ""; String expected = ""; cleaner.getProperties().setAddNewlineToHeadAndBody(false); cleaner.getProperties().setNamespacesAware(true); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); initial = ""; expected = ""; cleaner.getProperties().setAddNewlineToHeadAndBody(false); cleaner.getProperties().setNamespacesAware(true); cleaned = cleaner.clean(initial); output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * This is to test issue #131 * @throws IOException */ @Ignore // We should fix this, but it isn't critical @Test public void moveTableContent() throws IOException{ String initial = "

    hi

    "; String expected = "

    hi

    "; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * This is to test issue #136 * @throws IOException */ @Test public void emptyXmlns() throws IOException{ String initial = ""; String expected = ""; cleaner.getProperties().setAddNewlineToHeadAndBody(false); cleaner.getProperties().setNamespacesAware(true); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * This is to test issue #139 * @throws IOException */ @Test public void optGroupTest() throws IOException{ String initial = ""; String expected = ""; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * This is to test issue #149 * @throws IOException */ @Test public void rbTest() throws IOException{ String initial = "




    "; String expected = "




    "; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } // See bug #147 @Test public void testCorrectUlStructure(){ String initial = "
    • 1
    • 2
      • 3
      "; String expected = "
      • 1
      • 2
        • 3
        "; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } // See bug #145 @Test @Ignore // We do want to fix this, but its not critical public void testCorrectTableStructure(){ String initial = "
        "; String expected = "
        "; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } // See bug #146 @Test public void testMissingTr(){ String initial = "
        banana
        "; String expected = "
        banana
        "; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } // See bug #129 @Test public void testLegend(){ String initial = "
        banana"; String expected = "
        banana
        "; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } // See bug #126 @Test public void testFragment(){ String initial = " * TODO: Passes but not with ideal result. */ @Test public void testUselessTr() throws IOException { cleaner.getProperties().setAddNewlineToHeadAndBody(false); String start = "
        "; String expected = "
        "; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } // See bug #140 @Test public void testSource(){ String initial = ""; String expected = ""; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } @Test public void testTwiddleTR(){ String initial = "
        test
        "; String expected = "
        test
        "; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } @Test public void testMissingRuby(){ String initial = "test"; String expected = ""+initial+""; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } @Test public void testMissingRt(){ String initial = "(ㄏㄢˋ)"; String expected = ""+initial+""; cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * Label tag - see Bug #138 */ @Test public void testLabel(){ String initial = "
        "; String expected = ""+initial+""; cleaner.getProperties().setNamespacesAware(true); cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * Option tags have two fatal tags - see Bug #137 */ @Test public void testSelect(){ String initial = ""; String expected = ""; cleaner.getProperties().setNamespacesAware(true); cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * This is to test that we don't get an NPE with a malformed HTTPS XHTML namespace. See issue #133 */ @Test public void testNPEWithHttpsNamespace(){ String initial="
        "; String expected="
        "; cleaner.getProperties().setNamespacesAware(true); cleaner.getProperties().setAddNewlineToHeadAndBody(false); TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * This is to test issue #132 * @throws IOException */ @Test public void classCastTest() throws IOException{ String initial = readFile("src/test/resources/test30.html"); TagNode node = cleaner.clean(initial); } /** * This is to test issue #93 */ @Test public void closingDiv(){ // // Check that when a tag is self-closing, we close it and start again rather than // let it remain open and enclose the following tags // String initial = "
        something
        "; String expected = "\n\n
        something
        "; TagNode cleaned = cleaner.clean(initial); String output = serializer.getAsString(cleaned); assertEquals(expected, output); // // This should also result in the same output // initial = "
        something
        "; cleaned = cleaner.clean(initial); output = serializer.getAsString(cleaned); assertEquals(expected, output); } /** * This is to test issue #67 */ @Test public void testXmlNoExtraWhitesapce(){ CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitXmlDeclaration(false); cleanerProperties.setOmitDoctypeDeclaration(false); cleanerProperties.setIgnoreQuestAndExclam(true); cleanerProperties.setAddNewlineToHeadAndBody(false); HtmlCleaner theCleaner = new HtmlCleaner(cleanerProperties); String initial = "\n\n

        test

        \n"; String expected = "\n\n

        test

        "; TagNode cleaned = theCleaner.clean(initial); Serializer theSerializer = new SimpleXmlSerializer(theCleaner.getProperties()); String output = theSerializer.getAsString(cleaned); assertEquals(expected, output); } /** * Test for #2901. */ @Test public void testWhitespaceInHead() throws IOException { String initial = readFile("src/test/resources/Real_1.html"); String expected = readFile("src/test/resources/Expected_1.html"); assertCleaned(initial, expected); } /** * Mentioned in #2901 - we should eliminate the first
        "; String end = ""; assertCleaned(start + "" + end, //start+"
        stuff
        stuff
        " + end // "ideal" output start + "stuff" + end // actual ); } /** * Collapsing empty tr to */ @Test public void testUselessTr2() throws IOException { cleaner.getProperties().setAddNewlineToHeadAndBody(false); String start = ""; String end = "
        "; assertCleaned(start + " stuff" + end, start + "stuff" + end); } /** * For #2940 */ @Test public void testCData() throws IOException { cleaner.getProperties().setAddNewlineToHeadAndBody(false); String start = ""; String end = "1"; assertCleaned(start + "" + end, start + "" + end); } /** * Report in issue #64 as causing issues. * @throws Exception */ @Test public void testChineseParsing() throws Exception { String initial = readFile("src/test/resources/test-chinese-issue-64.html"); TagNode node = cleaner.clean(initial); final TagNode[] imgNodes = node.getElementsByName("img", true); assertEquals(5, imgNodes.length); } /** * Report in issue #70 as causing issues. * @throws Exception */ @Test public void testOOME_70() throws Exception { String initial = readFile("src/test/resources/oome_70.html"); TagNode node = cleaner.clean(initial); final TagNode[] imgNodes = node.getElementsByName("img", true); assertEquals(17, imgNodes.length); } @Test public void testOOME_59() throws Exception { String in = "
        "; CleanerProperties cp = new CleanerProperties(); cp.setOmitUnknownTags(true); HtmlCleaner c = new HtmlCleaner(cp); TagNode root = c.clean(in); assertEquals(1, root.getElementsByName("legend", true).length); } /** * Check that we no longer require block-level restrictions for anchors, as per HTML5. See issue #82 * @throws IOException */ @Test public void noAnchorBlockLevelRestriction() throws IOException{ String initial = readFile("src/test/resources/test24.html"); String expected = readFile("src/test/resources/test24_expected.html"); assertCleaned(initial,expected); } } src/test/java/org/htmlcleaner/EntityDeserializationTest.java0000644000000000000000000000533012336370166023366 0ustar rootroot/* Copyright (c) 2006-2014, the HtmlCleaner project All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ package org.htmlcleaner; import junit.framework.TestCase; public class EntityDeserializationTest extends TestCase { private HtmlCleaner cleaner; @Override public void setUp() { CleanerProperties cp = new CleanerProperties(); cp.setDeserializeEntities(true); cleaner = new HtmlCleaner(cp); } @Override public void tearDown() { cleaner = null; } private void doTest(String input, String output) { assertEquals( output, cleaner.clean("" + input + "") .findElementByName("body", true) .getText() .toString() ); } public void testNamedEntity() { doTest(""", "\""); } public void testDecimalEntity() { doTest(" ", "\u00a0"); } public void testHexadecimalEntity() { doTest(" ", "\u00a0"); } public void testAbortedEntity() { doTest("&"", "&\""); } public void testCData() { doTest("", "&"); } } src/test/java/org/htmlcleaner/DomSerializerTest.java0000644000000000000000000001302313100100214021561 0ustar rootroot/* Copyright (c) 2006-2013, the HtmlCleaner Project All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ package org.htmlcleaner; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertNotNull; import java.io.IOException; import javax.xml.parsers.ParserConfigurationException; import org.jdom2.input.DOMBuilder; import org.jdom2.output.Format; import org.jdom2.output.XMLOutputter; import org.junit.Assert; import org.junit.Ignore; import org.junit.Test; import org.w3c.dom.Document; public class DomSerializerTest extends AbstractHtmlCleanerTest { @Test public void errorChecking() throws ParserConfigurationException{ TagNode node = cleaner.clean("

        "); DomSerializer ser = new DomSerializer(cleaner.getProperties(), true, true, false); Document document = ser.createDocument(node); assertFalse(document.getStrictErrorChecking()); } /** * See issue 108 * @throws IOException */ @Test @Ignore public void html5doctype() throws Exception{ cleaner.getProperties().setUseCdataForScriptAndStyle(true); cleaner.getProperties().setOmitCdataOutsideScriptAndStyle(true); String initial = readFile("src/test/resources/test23.html"); TagNode tagNode = cleaner.clean(initial); DomSerializer ser = new DomSerializer(cleaner.getProperties()); Document dom = ser.createDOM(tagNode); assertNotNull(dom.getChildNodes().item(0).getChildNodes().item(0)); assertEquals("head", dom.getChildNodes().item(0).getChildNodes().item(0).getNodeName()); } /** * See issue 127 * @throws IOException */ @Test public void rootNodeAttributes() throws Exception{ cleaner.getProperties().setUseCdataForScriptAndStyle(true); cleaner.getProperties().setOmitCdataOutsideScriptAndStyle(true); String initial = readFile("src/test/resources/test29.html"); TagNode tagNode = cleaner.clean(initial); DomSerializer ser = new DomSerializer(cleaner.getProperties()); Document dom = ser.createDOM(tagNode); assertNotNull(dom.getChildNodes().item(0).getChildNodes().item(0)); assertEquals("http://unknown.namespace.com", dom.getChildNodes().item(0).getAttributes().getNamedItem("xmlns").getNodeValue()); assertEquals("27", dom.getChildNodes().item(0).getAttributes().getNamedItem("id").getNodeValue()); // // Check we have a real ID attribute in the DOM and not just a regular attribute // assertEquals("http://unknown.namespace.com", dom.getElementById("27").getAttribute("xmlns")); } @Test public void cdata() throws Exception{ cleaner.getProperties().setUseCdataForScriptAndStyle(true); cleaner.getProperties().setOmitCdataOutsideScriptAndStyle(true); String initial = ""; TagNode tagNode = cleaner.clean(initial); DomSerializer ser = new DomSerializer(cleaner.getProperties(), cleaner.getProperties().isAdvancedXmlEscape(), true); Document dom = ser.createDOM(tagNode); DOMBuilder in = new DOMBuilder(); org.jdom2.Document jdomDoc = in.build(dom); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String actual = outputter.outputString(jdomDoc); Assert.assertTrue(actual.contains("this > that")); } @Test public void cdata2() throws Exception{ cleaner.getProperties().setUseCdataForScriptAndStyle(true); cleaner.getProperties().setOmitCdataOutsideScriptAndStyle(true); String initial = ""; TagNode tagNode = cleaner.clean(initial); DomSerializer ser = new DomSerializer(cleaner.getProperties(), cleaner.getProperties().isAdvancedXmlEscape(), false); Document dom = ser.createDOM(tagNode); DOMBuilder in = new DOMBuilder(); org.jdom2.Document jdomDoc = in.build(dom); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String actual = outputter.outputString(jdomDoc); Assert.assertTrue(actual.contains("this > that")); } } src/test/java/org/htmlcleaner/DocTypesTest.java0000644000000000000000000005233512337607100020574 0ustar rootroot/* Copyright (c) 2006-2013, HtmlCleaner project team (Vladimir Nikic, Scott Wilson, Pat Moore) All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; import java.io.File; import java.io.IOException; import javax.xml.parsers.ParserConfigurationException; import org.junit.BeforeClass; import org.junit.Test; import org.w3c.dom.Document; public class DocTypesTest { static HtmlCleaner cleaner; static SimpleHtmlSerializer serializer; @BeforeClass public static void setup(){ cleaner = new HtmlCleaner(); CleanerProperties properties = cleaner.getProperties(); properties.setOmitXmlDeclaration(true); properties.setOmitDoctypeDeclaration(false); serializer = new SimpleHtmlSerializer(properties); } @Test public void DocTypeUsingDom() throws IOException, ParserConfigurationException{ CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitXmlDeclaration(false); cleanerProperties.setOmitDoctypeDeclaration(false); cleanerProperties.setIgnoreQuestAndExclam(false); cleaner = new HtmlCleaner(cleanerProperties); DomSerializer domSerializer = new DomSerializer(cleaner.getProperties()); String initial = readFile("src/test/resources/test12.html"); TagNode cleaned = cleaner.clean(initial); Document doc = domSerializer.createDOM(cleaned); assertEquals("html", doc.getDoctype().getName()); assertEquals("-//W3C//DTD XHTML 1.0 Strict//EN", doc.getDoctype().getPublicId()); assertEquals("http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd", doc.getDoctype().getSystemId()); } // TODO remove and make this class a subclass of AbstractHtmlCleanerTest protected String readFile(String filename) throws IOException { File file = new File(filename); CharSequence content = Utils.readUrl(file.toURI().toURL(), "UTF-8"); return content.toString(); } // // Check all the valid doctypes // @Test public void html_5() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("html", cleaned.getDocType().getPart1()); assertEquals(null, cleaned.getDocType().getPart2()); assertEquals("", cleaned.getDocType().getPublicId()); assertEquals("", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML5, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_5_upper() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals(null, cleaned.getDocType().getPart2()); assertEquals("", cleaned.getDocType().getPublicId()); assertEquals("", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML5, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_5_legacy() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals("SYSTEM", cleaned.getDocType().getPart2()); assertEquals("about:legacy-compat", cleaned.getDocType().getPublicId()); assertEquals("", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML5_LEGACY_TOOL_COMPATIBLE, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_5_legacy_alternate() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals("SYSTEM", cleaned.getDocType().getPart2()); assertEquals("about:legacy-compat", cleaned.getDocType().getPublicId()); assertEquals("", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML5_LEGACY_TOOL_COMPATIBLE, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_4_0() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD HTML 4.0//EN", cleaned.getDocType().getPublicId()); assertEquals("", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML4_0, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_4_0_strict() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD HTML 4.0//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/REC-html40/strict.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML4_0, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_4_01_strict_identifierOnly() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD HTML 4.01//EN", cleaned.getDocType().getPublicId()); assertEquals("", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML4_01_STRICT, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_4_01_strict_mixed() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("html", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD HTML 4.01//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/html4/strict.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML4_01_STRICT, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_4_01_strict() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD HTML 4.01//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/html4/strict.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML4_01_STRICT, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_4_01_transitional() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD HTML 4.01 Transitional//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/html4/loose.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML4_01_TRANSITIONAL, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void html_4_01_frameset() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("HTML", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD HTML 4.01 Frameset//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/html4/frameset.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.HTML4_01_FRAMESET, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void xhtml_1_strict() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("html", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD XHTML 1.0 Strict//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.XHTML1_0_STRICT, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void xhtml_1_transitional() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("html", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD XHTML 1.0 Transitional//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.XHTML1_0_TRANSITIONAL, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void xhtml_1_frameset() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("html", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD XHTML 1.0 Frameset//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.XHTML1_0_FRAMESET, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void xhtml_1_1() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("html", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD XHTML 1.1//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.XHTML1_1, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } @Test public void xhtml_1_1_basic() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals("html", cleaned.getDocType().getPart1()); assertEquals("PUBLIC", cleaned.getDocType().getPart2()); assertEquals("-//W3C//DTD XHTML Basic 1.1//EN", cleaned.getDocType().getPublicId()); assertEquals("http://www.w3.org/TR/xhtml11/DTD/xhtml-basic11.dtd", cleaned.getDocType().getSystemId()); assertEquals(DoctypeToken.XHTML1_1_BASIC, cleaned.getDocType().getType()); assertTrue(cleaned.getDocType().isValid()); } // // Now some invalid ones // @Test public void empty() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.UNKNOWN, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void not_html() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.UNKNOWN, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void html_4_0_wrong_id_type() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.UNKNOWN, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void html_4_0_wrong_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.HTML4_0, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void html_4_01_wrong_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.HTML4_01_STRICT, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void html_4_01_transitional_bad_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.HTML4_01_TRANSITIONAL, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void html_4_01_frameset_bad_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.HTML4_01_FRAMESET, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void xhtml_1_0_with_wrong_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.XHTML1_0_STRICT, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void xhtml_1_0_transitional_with_wrong_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.XHTML1_0_TRANSITIONAL, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void xhtml_1_0_frameset_with_wrong_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.XHTML1_0_FRAMESET, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void xhtml_1_1_with_wrong_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.XHTML1_1, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void xhtml_1_1_with_no_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertFalse(cleaned.getDocType().isValid()); assertEquals(DoctypeToken.XHTML1_1, cleaned.getDocType().getType()); } @Test public void xhtml_1_1_basic_with_no_id() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.XHTML1_1_BASIC, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } @Test public void weird_token() throws IOException{ TagNode cleaned = cleaner.clean(""); assertEquals(DoctypeToken.UNKNOWN, cleaned.getDocType().getType()); assertFalse(cleaned.getDocType().isValid()); } // // Serializer // @Test public void html_4_01_serialize() throws IOException{ TagNode cleaned = cleaner.clean(""); String output = serializer.getAsString(cleaned); assertTrue(output.startsWith("")); } @Test public void html_4_01_domserialize() throws IOException, ParserConfigurationException{ TagNode cleaned = cleaner.clean(""); DomSerializer domSerializer = new DomSerializer(cleaner.getProperties()); Document doc = domSerializer.createDOM(cleaned); assertEquals("html", doc.getDocumentElement().getNodeName()); assertEquals("HTML", doc.getDoctype().getName()); assertEquals("-//W3C//DTD HTML 4.01//EN", doc.getDoctype().getPublicId()); assertEquals("http://www.w3.org/TR/html4/strict.dtd", doc.getDoctype().getSystemId()); } @Test public void html_4_01_case_correct() throws IOException{ TagNode cleaned = cleaner.clean(""); String output = serializer.getAsString(cleaned); assertTrue(output.startsWith("")); } @Test public void xhtml_1_1_serialize() throws IOException{ TagNode cleaned = cleaner.clean(""); String output = serializer.getAsString(cleaned); assertTrue(output.startsWith("")); } @Test public void xhtml_1_0_strict_serialize() throws IOException{ TagNode cleaned = cleaner.clean(""); String output = serializer.getAsString(cleaned); assertTrue(output.startsWith("")); } @Test public void xhtml_1_0_strict_serialize_case_correct() throws IOException{ TagNode cleaned = cleaner.clean(""); String output = serializer.getAsString(cleaned); assertTrue(output.startsWith("")); } @Test public void html5_serialize() throws IOException{ TagNode cleaned = cleaner.clean(""); String output = serializer.getAsString(cleaned); assertTrue(output.startsWith("")); } @Test public void html5_serialize_case_correct() throws IOException{ TagNode cleaned = cleaner.clean(""); String output = serializer.getAsString(cleaned); assertTrue(output.startsWith("")); } // // Misc // @Test public void checkToString(){ TagNode cleaned = cleaner.clean(""); assertEquals(cleaned.getDocType().getContent(), cleaned.getDocType().toString()); } } src/test/java/org/htmlcleaner/ConstructorTest.java0000644000000000000000000000266412200235145021362 0ustar rootrootpackage org.htmlcleaner; import junit.framework.TestCase; import java.io.ByteArrayInputStream; /** * Testing HtmlCleaner constructors. */ public class ConstructorTest extends TestCase { public void testPropertiesConstructor() throws Exception { CleanerProperties props = new CleanerProperties(); props.setOmitComments(true); HtmlCleaner cleaner1 = new HtmlCleaner(props); TagNode node1 = cleaner1.clean("text text"); assertTrue( new SimpleXmlSerializer(props).getAsString(node1).indexOf("") < 0 ); HtmlCleaner cleaner2 = new HtmlCleaner(props); TagNode node2 = cleaner2.clean("DDDD text"); assertTrue( new SimpleXmlSerializer(props).getAsString(node2).indexOf("") < 0 ); HtmlCleaner cleaner3 = new HtmlCleaner(props); props.setOmitComments(false); TagNode node3 = cleaner3.clean("EEEEEEE text"); assertTrue( new SimpleXmlSerializer(props).getAsString(node3).indexOf("") > 0 ); TagNode node4 = cleaner3.clean( new ByteArrayInputStream( ("FIRST" + (char)0x2 + (char)0x3 + "SECOND").getBytes() ), "ASCII" ); assertTrue( new CompactXmlSerializer(props).getAsString(node4).indexOf("FIRST SECOND") >= 0 ); } }src/test/java/org/htmlcleaner/CollapseHtmlTest.java0000644000000000000000000002366112523434626021441 0ustar rootrootpackage org.htmlcleaner; import java.io.IOException; import org.htmlcleaner.conditional.TagNodeEmptyContentCondition; import org.htmlcleaner.conditional.TagNodeInsignificantBrCondition; import junit.framework.TestCase; /** * Various tests for collapseNullHtml mode. */ public class CollapseHtmlTest extends TestCase { /** * */ private static final String CANNOT_ELIMINATE_ANYTHING_IN_THIS_TR = "

        "; /** * */ private static final String IMAGE = ""; /** * */ private static final String DONT_COLLAPSE = "" + IMAGE + "" + "

        " + IMAGE + "

        " + "

        bar

        Cannot eliminate anything in this row
        " + IMAGE + "
        foo

        "; private static final String DONT_COLLAPSE_OUTPUT = "" + IMAGE + "" + "

        " + IMAGE + "

        " + "

        bar

        " + IMAGE + "

        foo

        "; private HtmlCleaner cleaner; private CleanerProperties properties; private SimpleXmlSerializer serializer; @Override protected void setUp() throws Exception { cleaner = new HtmlCleaner(); properties = cleaner.getProperties(); properties.setOmitHtmlEnvelope(true); properties.setOmitXmlDeclaration(true); serializer = new SimpleXmlSerializer(properties); properties.addPruneTagNodeCondition(new TagNodeEmptyContentCondition(properties.getTagInfoProvider())); properties.addPruneTagNodeCondition(new TagNodeInsignificantBrCondition()); } /** * Make sure that single empty tag is dropped out. * * @throws IOException */ public void testCollapseSingleEmptyTag() throws IOException { TagNode collapsed = cleaner.clean(""); assertEquals("", serializer.getAsString(collapsed)); } /** * Make sure that tags with internal blanks are collapsed. */ public void testCollapseSingleTagWithBlanks() throws IOException { TagNode collapsed = cleaner.clean(" "); assertEquals("", serializer.getAsString(collapsed)); collapsed = cleaner.clean(" "); assertEquals("", serializer.getAsString(collapsed)); // Strange msword insert // collapsed = // cleaner.clean("  "); // assertEquals("", serializer.getAsString(collapsed)); } /** * make sure that non-breaking spaces are also collapsed away. */ public void testCollapseSingleTagWithNbsp() throws IOException { TagNode collapsed = cleaner.clean("   "); assertEquals("", serializer.getAsString(collapsed)); collapsed = cleaner.clean("   "); assertEquals("", serializer.getAsString(collapsed)); collapsed = cleaner.clean("   "); assertEquals("", serializer.getAsString(collapsed)); collapsed = cleaner.clean(" " + SpecialEntities.NON_BREAKABLE_SPACE + " "); assertEquals("", serializer.getAsString(collapsed)); } /** * make sure that multiple null tags are collapsed. */ public void testCollapseMultipleEmptyTags() throws IOException { TagNode collapsed = cleaner.clean(""); assertEquals("", serializer.getAsString(collapsed)); // test with slightly bad html. collapsed = cleaner.clean(""); assertEquals("", serializer.getAsString(collapsed)); // test with slightly bad html. collapsed = cleaner.clean("notme"); assertEquals("notme", serializer.getAsString(collapsed)); } /** * make sure that insignificant br tags are collapsed */ public void testCollapseInsignificantBr() throws IOException { TagNode collapsed = cleaner.clean("


        Some text

        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); collapsed = cleaner.clean("

        Some text

        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); collapsed = cleaner.clean("


        Some
        text

        "); assertEquals("

        Some
        text

        ", serializer.getAsString(collapsed)); collapsed = cleaner.clean("



        Some text look here

        "); assertEquals("

        Some text look here

        ", serializer.getAsString(collapsed)); collapsed = cleaner.clean("Some text
        "); assertEquals("Some text", serializer.getAsString(collapsed)); } /** * make sure TagTransformations do not interfere with collapse */ public void testCollapseEmptyWithTagTransformations() throws IOException { CleanerTransformations transformations = properties.getCleanerTransformations(); TagTransformation t = new TagTransformation("font", "span", true); t.addAttributeTransformation("style", "${style};font-family:${face};font-size:${size};color:${color};"); t.addAttributeTransformation("face"); t.addAttributeTransformation("size"); t.addAttributeTransformation("color"); t.addAttributeTransformation("name", "${face}_1"); transformations.addTransformation(t); TagNode collapsed = cleaner.clean(""); assertEquals("", serializer.getAsString(collapsed)); } /** * test to make sure that multiple
        * elements are eliminated */ public void testChainCollapseInsignificantBrs() throws IOException { TagNode collapsed = cleaner.clean("



        Some
        text


        "); assertEquals("

        Some
        text

        ", serializer.getAsString(collapsed)); } /** * make sure that intervening empty elements still cause unneeded
        * s to be eliminated. */ public void testCollapseInsignificantBrWithEmptyElementsHTML4() throws IOException { properties.setHtmlVersion(HtmlCleaner.HTML_4); properties.addPruneTagNodeCondition(new TagNodeEmptyContentCondition(properties.getTagInfoProvider())); TagNode collapsed = cleaner.clean("

         
        Some text

        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); collapsed = cleaner.clean("

        Some text


        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); collapsed = cleaner.clean("

        Some text


        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); } public void testCollapseInsignificantBrWithEmptyElementsHTML5() throws IOException { properties.setHtmlVersion(HtmlCleaner.HTML_5); properties.addPruneTagNodeCondition(new TagNodeEmptyContentCondition(properties.getTagInfoProvider())); TagNode collapsed = cleaner.clean("

         
        Some text

        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); collapsed = cleaner.clean("

        Some text


        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); collapsed = cleaner.clean("

        Some text


        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); } /** * Br nested in formating elements should be eliminated. */ public void testInsureMeaninglessBrsStillCollapseEmptyElementsHTML4() throws IOException { properties.setHtmlVersion(HtmlCleaner.HTML_4); properties.addPruneTagNodeCondition(new TagNodeEmptyContentCondition(properties.getTagInfoProvider())); TagNode collapsed; collapsed = cleaner.clean("


        Some text


        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); } public void testInsureMeaninglessBrsStillCollapseEmptyElementsHTML5() throws IOException { properties.setHtmlVersion(HtmlCleaner.HTML_5); properties.addPruneTagNodeCondition(new TagNodeEmptyContentCondition(properties.getTagInfoProvider())); TagNode collapsed; collapsed = cleaner.clean("


        Some text


        "); assertEquals("

        Some text

        ", serializer.getAsString(collapsed)); } /** * because elements with ids can be referred to by javascript, don't assume * that such elements can be eliminated. */ public void testCollapseOnlyFormattingElementsWithNoIds() throws IOException { TagNode collapsed = cleaner.clean(""); assertEquals("", serializer.getAsString(collapsed)); collapsed = cleaner.clean(""); assertEquals("", serializer.getAsString(collapsed)); } public void testCollapseAggressively() throws IOException { properties.addPruneTagNodeCondition(new TagNodeEmptyContentCondition(properties.getTagInfoProvider())); TagNode collapsed; collapsed = cleaner.clean("

        "); assertEquals("", serializer.getAsString(collapsed)); collapsed = cleaner.clean(DONT_COLLAPSE); assertEquals(DONT_COLLAPSE_OUTPUT, serializer.getAsString(collapsed)); collapsed = cleaner .clean("

        " + " \n" + CANNOT_ELIMINATE_ANYTHING_IN_THIS_TR + "
        Nor me
        "); assertEquals("

        " + CANNOT_ELIMINATE_ANYTHING_IN_THIS_TR + "
        Nor me
        ", serializer.getAsString(collapsed)); } } src/test/java/org/htmlcleaner/ClosedTagReopenTest.java0000644000000000000000000002142012523434626022057 0ustar rootrootpackage org.htmlcleaner; import java.io.IOException; import org.junit.Test; import junit.framework.TestCase; /** * Tests that tag closed due to one of its children (when the child tag is not allowed to be inside parent) is then * reopened. * Examples: *
         * 

        text1
        text2
        text3

        *
        * table is not allowed inside a

        most browsers handle this by placing the table close to line before and line after and in general allowing it. * * Cleaning here normally would result in : *

         * 

        text1
        text2
        text3

        *
        * 'text3' is no longer inside the original element type ( 'p' ). Instead 'text3' is now within a 'div'. * text3 would no longer be styled correctly. * * A more correct result is: *
         * 

        text1
        text2

        text3

        *
        */ public class ClosedTagReopenTest extends TestCase { public void testSimpleHTML4() throws IOException { CleanerProperties properties = new CleanerProperties(); properties.setHtmlVersion(HtmlCleaner.HTML_4); properties.setOmitXmlDeclaration(true); properties.setOmitHtmlEnvelope(true); SimpleXmlSerializer serializer = new SimpleXmlSerializer(properties); String[][] tests= { new String[] { "

        text1
        text2
        text3

        ", "

        text1

        text2

        text3

        " }, new String[] {"

        text1","text1"}, new String[] {"

        text1

        text2
        text3

        ", "

        text1

        text2

        text3

        "}, new String[] { "
        text1

        text2

        text3
        ", "
        text1

        text2

        text3
        "}, new String[] {"text1

        text2

        text3
        ", "text1

        text2

        text3"}, new String[] {"

        text1

        text2
        text3
        text4

        ", "

        text1

        text2

        text3

        text4
        "}, new String[] {"

        text1

        text2

        ", "

        text1

        text2
        "}, new String[] {"

        text1

        text2

        ", "

        text1

        text2

        "}, //test multiple internal breaks new String[] {"

        text1

        text2

        text3

        text4

        text5

        ","

        text1

        text2

        text3

        text4

        text5
        "}, // test attribute preservation new String[] { "

        text1
        text2
        text3

        ", "

        text1

        text2

        text3

        " }, // but not all attributes ( id attribute must be unique ) // TODO: maybe a generated id so that correlation can be found? new String[] { "

        text1
        text2
        text3

        ", "

        text1

        text2

        text3

        " }, // test multiple replacements // test to see if nested good

        can be handled. new String[] { "

        text1
        text2

        text2a

        text3

        • text4
        text5
        • text6

        ", "

        text1

        text2

        text2a

        " + "

        text3

        " + "
        • text4
        " + "

        text5

        " + "
        • text6
        " }, new String[] { "

        text1
        text2

        text2a

        test2b
        test2c

        text3

        • text4
        text5
        • text6

        ", "

        text1

        text2

        text2a

        test2b

        test2c

        " + "

        text3

        " + "
        • text4
        " + "

        text5

        " + "
        • text6
        " }, new String[]{"

        text1
        text2
        text3
        text4

        ","

        text1

        text2
        text3

        text4

        "} }; for(String[] test: tests) { String cleaned = serializer.getAsString(test[0]); assertEquals("started with="+test[0], test[1], cleaned); } } @Test public void testSimpleHTML5() throws IOException { CleanerProperties properties = new CleanerProperties(); properties.setHtmlVersion(HtmlCleaner.HTML_5); properties.setOmitXmlDeclaration(true); properties.setOmitHtmlEnvelope(true); SimpleXmlSerializer serializer = new SimpleXmlSerializer(properties); String[][] tests= { new String[] { "

        text1
        text2
        text3

        ", "

        text1

        text2

        text3

        " }, new String[] {"

        text1","text1"}, new String[] {"

        text1

        text2
        text3

        ", "

        text1

        text2

        text3

        "}, new String[] { "
        text1

        text2

        text3
        ", "
        text1

        text2

        text3
        "}, new String[] {"text1

        text2

        text3", "text1

        text2

        text3"}, new String[] {"

        text1

        text2
        text3
        text4

        ", "

        text1

        text2

        text3

        text4
        "}, new String[] {"

        text1

        text2

        ", "

        text1

        text2
        "}, new String[] {"

        text1

        text2

        ", "

        text1

        text2

        "}, //test multiple internal breaks new String[] {"

        text1

        text2

        text3

        text4

        text5

        ","

        text1

        text2

        text3

        text4

        text5
        "}, // test attribute preservation new String[] { "

        text1
        text2
        text3

        ", "

        text1

        text2

        text3

        " }, // but not all attributes ( id attribute must be unique ) // TODO: maybe a generated id so that correlation can be found? new String[] { "

        text1
        text2
        text3

        ", "

        text1

        text2

        text3

        " }, // test multiple replacements // test to see if nested good

        can be handled. new String[] { "

        text1
        text2

        text2a

        text3

        • text4
        text5
        • text6

        ", "

        text1

        text2

        text2a

        " + "

        text3

        " + "
        • text4
        " + "

        text5

        " + "
        • text6
        " }, new String[] { "

        text1
        text2

        text2a

        test2b
        test2c

        text3

        • text4
        text5
        • text6

        ", "

        text1

        text2

        text2a

        test2b

        test2c

        " + "

        text3

        " + "
        • text4
        " + "

        text5

        " + "
        • text6
        " }, new String[]{"

        text1
        text2
        text3
        text4

        ","

        text1

        text2
        text3

        text4

        "} }; for(String[] test: tests) { String cleaned = serializer.getAsString(test[0]); assertEquals("started with="+test[0], test[1], cleaned); } } } src/test/java/org/htmlcleaner/CDATATest.java0000644000000000000000000005722113074160135017656 0ustar rootroot/* Copyright (c) 2006-2013, the HtmlCleaner Project All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ package org.htmlcleaner; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import java.io.IOException; import javax.xml.parsers.ParserConfigurationException; import org.junit.Test; public class CDATATest extends AbstractHtmlCleanerTest { // // Test for bug #185 // @Test public void noEndTokenLong() throws IOException{ String html = ""; String expected = initial; assertCleaned(initial, expected); } /** * In this test the script has no CDATA, an unescaped CDATAsection in a * script tag, and there is also an incorrect CDATA declaration in a * paragraph tag. * * @throws IOException */ @Test public void CDATAmixed() throws IOException{ String initial = readFile("src/test/resources/test11.html"); String expected = readFile("src/test/resources/test11_expected.html"); assertCleaned(initial, expected); } @Test public void CDATAandDocType() throws IOException{ CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitXmlDeclaration(false); cleanerProperties.setOmitDoctypeDeclaration(false); cleanerProperties.setIgnoreQuestAndExclam(false); this.cleaner = new HtmlCleaner(cleanerProperties); this.serializer = new SimpleXmlSerializer(cleaner.getProperties()); String initial = readFile("src/test/resources/test12.html"); String expected = readFile("src/test/resources/test12_expected.html"); assertCleaned(initial, expected); } @Test public void scriptAndCData() throws IOException { CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitXmlDeclaration(false); cleanerProperties.setOmitDoctypeDeclaration(false); cleanerProperties.setIgnoreQuestAndExclam(false); cleanerProperties.setAddNewlineToHeadAndBody(false); cleanerProperties.setUseCdataFor("script,style,altscript"); this.cleaner = new HtmlCleaner(cleanerProperties); this.serializer = new SimpleXmlSerializer(cleaner.getProperties()); assertHTML("", ""); assertHTML("", ""); assertHTML("", ""); assertHTML("", ""); assertHTML("", ""); assertHTML("", ""); assertHTML("", ""); assertHTML("", ""); assertHTML("", ""); assertHTML("", ""); assertHTML("/*\n/*]]>*/", "<>"); assertHTML( "", "" ); } @Test public void scriptAndCDataDom() throws IOException, ParserConfigurationException { CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitXmlDeclaration(false); cleanerProperties.setOmitDoctypeDeclaration(false); cleanerProperties.setIgnoreQuestAndExclam(false); cleanerProperties.setAddNewlineToHeadAndBody(false); cleanerProperties.setUseCdataFor("script,style,altscript"); this.cleaner = new HtmlCleaner(cleanerProperties); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("", ""); assertHTMLUsingDomSerializer("/*\n/*]]>*/", "<>"); assertHTMLUsingDomSerializer( "", "" ); } @Test public void scriptAndCDataJDom() throws IOException, ParserConfigurationException { CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitXmlDeclaration(false); cleanerProperties.setOmitDoctypeDeclaration(false); cleanerProperties.setIgnoreQuestAndExclam(false); cleanerProperties.setAddNewlineToHeadAndBody(false); cleanerProperties.setUseCdataFor("script,style,altscript"); this.cleaner = new HtmlCleaner(cleanerProperties); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("", ""); assertHTMLUsingJDomSerializer("/*\n/*]]>*/", "<>"); assertHTMLUsingJDomSerializer( "", "" ); } @Test public void escapingCDATA() throws IOException{ CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitXmlDeclaration(false); cleanerProperties.setOmitDoctypeDeclaration(false); cleanerProperties.setIgnoreQuestAndExclam(false); cleanerProperties.setAdvancedXmlEscape(true); cleanerProperties.setAddNewlineToHeadAndBody(false); cleanerProperties.setDeserializeEntities(true); cleanerProperties.setUseCdataFor("script,style,altscript"); this.cleaner = new HtmlCleaner(cleanerProperties); this.serializer = new SimpleXmlSerializer(cleaner.getProperties()); assertHTML("", ""); assertHTML("/*\n/*]]>*/", "<>"); } @Test public void removeCDATA() throws IOException{ CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitCdataOutsideScriptAndStyle(true); cleanerProperties.setAddNewlineToHeadAndBody(false); cleanerProperties.setUseCdataFor("script,style,altscript"); cleaner = new HtmlCleaner(cleanerProperties); serializer = new SimpleXmlSerializer(cleaner.getProperties()); // Verify that CDATA not inside SCRIPT or STYLE elements are considered comments in HTML and thus stripped // when cleaned. assertHTML("

        ", "

        "); assertHTML("

        &&

        ", "

        &&

        "); assertHTML("", ""); } /** * Using the default setup, we should strip out CData outside * of script and style tags. */ @Test public void CDATAinthewrongplace(){ CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setIgnoreQuestAndExclam(true); cleaner = new HtmlCleaner(cleanerProperties); String testData = "" + "

        " + "\n" + "

        "; TagNode cleaned = cleaner.clean(testData); TagNode p = cleaned.findElementByName("p", true); // // We should have no CData nodes, instead the contents should // be processed as content and escaped as usual // assertTrue(p.getAllChildren().get(0) instanceof ContentNode); } @Test public void nonSafeCDATA(){ String testData = "" + ""; TagNode cleaned = cleaner.clean(testData); TagNode script = cleaned.findElementByName("script", true); // // We should have a CData node for the CDATA section // assertTrue(script.getAllChildren().get(0) instanceof CData); CData cdata = (CData)script.getAllChildren().get(0); String content = cdata.getContentWithoutStartAndEndTokens(); assertEquals("\nfunction helloWorld() {\n};\n", content); } @Test public void safeOutput(){ String testData = "" + ""; TagNode cleaned = cleaner.clean(testData); TagNode script = cleaned.findElementByName("script", true); // // We should have a CData node for the CDATA section // assertTrue(script.getAllChildren().get(0) instanceof CData); CData cdata = (CData)script.getAllChildren().get(0); String content = cdata.getContentWithoutStartAndEndTokens(); assertEquals("\nfunction helloWorld() {\n};\n", content); String safeContent = cdata.getContentWithStartAndEndTokens(); assertEquals("/**/", safeContent); } /** * For a CDATA section we need to ignore '<' and '>' and keep going to keep the content * within a single CData instance. */ @Test public void safeCDATAAlternate(){ String testData = "" + ""; TagNode cleaned = cleaner.clean(testData); TagNode script = cleaned.findElementByName("script", true); // // We should have a CData node for the CDATA section // assertTrue(script.getAllChildren().get(1) instanceof CData); CData cdata = (CData)script.getAllChildren().get(1); String content = cdata.getContentWithoutStartAndEndTokens(); assertEquals("\nfunction escapeForXML(origtext) {\n return origtext.replace(/\\&/g,'&'+'amp;').replace(//g,'&'+'gt;').replace(/'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');}\n", content); } /** * For a CDATA section we need to ignore '<' and '>' and keep going to keep the content * within a single CData instance */ @Test public void safeCDATA(){ String testData = "" + ""; TagNode cleaned = cleaner.clean(testData); TagNode script = cleaned.findElementByName("script", true); // // We should have a CData node for the CDATA section // assertTrue(script.getAllChildren().get(1) instanceof CData); CData cdata = (CData)script.getAllChildren().get(1); String content = cdata.getContentWithoutStartAndEndTokens(); assertEquals("\nfunction escapeForXML(origtext) {\n return origtext.replace(/\\&/g,'&'+'amp;').replace(//g,'&'+'gt;').replace(/'/g,'&'+'apos;').replace(/\"/g,'&'+'quot;');}\n", content); } @Test public void style(){ String testData = ""; TagNode cleaned = cleaner.clean(testData); TagNode style = cleaned.findElementByName("style", true); assertTrue(style.getAllChildren().get(0) instanceof CData); String content = (((CData)style.getAllChildren().get(0)).getContentWithoutStartAndEndTokens()); assertEquals("\n#ampmep_188 { }\n", content); } @Test public void preserveComments() throws IOException{ cleaner.getProperties().setOmitXmlDeclaration(false); String initial = readFile("src/test/resources/test17.html"); String expected = readFile("src/test/resources/test17_expected.html"); assertCleaned(initial, expected); } @Test public void preserveCommentsXwiki() throws IOException{ cleaner.getProperties().setOmitXmlDeclaration(false); cleaner.getProperties().setAddNewlineToHeadAndBody(false); assertHTML("", "" ); } @Test public void preserveComments2() throws IOException{ cleaner.getProperties().setOmitXmlDeclaration(false); cleaner.getProperties().setAddNewlineToHeadAndBody(false); assertHTML("", "" ); } } src/test/java/org/htmlcleaner/BrowserCompactXmlSerializerTest.java0000644000000000000000000001037312243374741024513 0ustar rootrootpackage org.htmlcleaner; import java.io.*; import junit.framework.*; /** * Test cases for for {@link BrowserCompactXmlSerializer} * * @author Konstantin Burov (aectann@gmail.com) * */ public class BrowserCompactXmlSerializerTest extends TestCase { private BrowserCompactXmlSerializer compactXmlSerializer; private CleanerProperties properties; @Override protected void setUp() throws Exception { properties = new CleanerProperties(); properties.setOmitHtmlEnvelope(true); properties.setOmitXmlDeclaration(true); compactXmlSerializer = new BrowserCompactXmlSerializer(properties); } public void testInlineWhitespaceHandling(){ String cleaned = compactXmlSerializer.getAsString("

        Test1 Linktext Test2

        "); assertEquals("

        Test1 Linktext Test2

        \n", cleaned); cleaned = compactXmlSerializer.getAsString("

        Test1LinktextTest2

        "); assertEquals("

        Test1LinktextTest2

        \n", cleaned); cleaned = compactXmlSerializer.getAsString("one
        two
        threefour"); assertEquals("one
        twothreefour", cleaned); cleaned = compactXmlSerializer.getAsString("one
        two
        three four"); assertEquals("one
        twothree four", cleaned); } /** * Tests that serializer removes white spaces properly. * @throws IOException */ public void testRemoveInsignificantWhitespaces() throws IOException{ String cleaned = compactXmlSerializer.getAsString( " text here, some text "); assertEquals("text here, some text", cleaned); cleaned = compactXmlSerializer.getAsString( "
        2 roots < here >
        "); assertEquals("
        2 roots < here >
        \n", cleaned); cleaned = compactXmlSerializer.getAsString( "
        2 roots \n < here >
        "); assertEquals("
        2 roots < here >
        \n", cleaned); cleaned = compactXmlSerializer.getAsString( "
        2 roots \n\n < here >
        "); assertEquals("
        2 roots
        < here >
        \n", cleaned); } /** * Non-breakable spaces also must be removed from start and end. * @throws IOException */ public void testRemoveLeadingAndEndingNbsp() throws IOException { String cleaned = compactXmlSerializer.getAsString( "  We have just released Jericho Road. Listen to Still Waters the lead-off track."); assertEquals("We have just released Jericho Road. Listen to Still Waters the lead-off track.", cleaned); cleaned = compactXmlSerializer.getAsString( " We have just released Jericho Road. Listen to Still Waters the lead-off track. "); assertEquals("We have just released Jericho Road. Listen to Still Waters the lead-off track.", cleaned); cleaned = compactXmlSerializer.getAsString( " We have just released Jericho Road. Listen to Still Waters the lead-off track. "); assertEquals("We have just released Jericho Road. Listen to Still Waters the lead-off track.", cleaned); cleaned = compactXmlSerializer.getAsString( SpecialEntities.NON_BREAKABLE_SPACE + "We have just released Jericho Road. Listen to Still Waters the lead-off track. " + SpecialEntities.NON_BREAKABLE_SPACE); assertEquals("We have just released Jericho Road. Listen to Still Waters the lead-off track.", cleaned); } /** * Tests that contents of 'pre' tag are untouched. * @throws IOException */ public void testPreTagIsUntouched() throws IOException{ String cleaned = compactXmlSerializer.getAsString( "
        some text
        "); assertEquals("
        some text
        \n", cleaned); cleaned = compactXmlSerializer.getAsString( "
             some text
        "); assertEquals("
             some text
        \n", cleaned); cleaned = compactXmlSerializer.getAsString( "
        some /n/n text
        "); assertEquals("
        some /n/n text
        \n", cleaned); } } src/test/java/org/htmlcleaner/BadTerminationTest.java0000644000000000000000000000271212113746650021742 0ustar rootrootpackage org.htmlcleaner; import junit.framework.TestCase; /** * @author patmoore * */ public class BadTerminationTest extends TestCase { public void testHandleGarbageInEndTag() throws Exception { CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitHtmlEnvelope(true); cleanerProperties.setOmitXmlDeclaration(true); cleanerProperties.setUseEmptyElementTags(false); String output = new SimpleXmlSerializer(cleanerProperties).getAsString( "
        "); assertEquals("
        ", output); } // public void testWhiteSpaceInTag() throws Exception { // String s = // "\n" // + // " \n" + // " \n" + // " < /table>"; // CleanerProperties cleanerProperties = new CleanerProperties(); // cleanerProperties.setOmitHtmlEnvelope(false); // cleanerProperties.setOmitXmlDeclaration(true); // cleanerProperties.setUseEmptyElementTags(false); // String output = new // SimpleXmlSerializer().getXmlAsString(cleanerProperties, s, "UTF-8"); // assertEquals("
         
         
        ",output); // } } src/test/java/org/htmlcleaner/AbstractHtmlCleanerTest.java0000644000000000000000000001603013077712530022722 0ustar rootroot/* Copyright (c) 2006-2013, the HtmlCleaner Project All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ package org.htmlcleaner; import java.io.File; import java.io.IOException; import java.io.StringWriter; import javax.xml.parsers.ParserConfigurationException; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerFactory; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import org.jdom2.input.DOMBuilder; import org.jdom2.output.Format; import org.jdom2.output.XMLOutputter; import org.junit.Assert; import org.junit.Before; import org.w3c.dom.Document; import static org.junit.Assert.assertEquals; /** * Abstract test class with utility methods */ public abstract class AbstractHtmlCleanerTest { protected HtmlCleaner cleaner; protected Serializer serializer; @Before public void setup(){ CleanerProperties cleanerProperties = new CleanerProperties(); cleanerProperties.setOmitXmlDeclaration(true); cleanerProperties.setOmitDoctypeDeclaration(false); cleanerProperties.setAdvancedXmlEscape(true); cleanerProperties.setTranslateSpecialEntities(false); cleanerProperties.setOmitComments(false); cleanerProperties.setIgnoreQuestAndExclam(false); cleaner = new HtmlCleaner(cleanerProperties); serializer = new SimpleXmlSerializer(cleanerProperties); } protected void assertCleaned(String initial, String expected) throws IOException { TagNode node = cleaner.clean(initial); StringWriter writer = new StringWriter(); serializer.write(node, writer, "UTF-8"); assertEquals(expected, writer.toString()); } protected void assertCleanedHtml(String initial, String expected) throws IOException { TagNode node = cleaner.clean(initial); StringWriter writer = new StringWriter(); Serializer ser = new SimpleHtmlSerializer(cleaner.getProperties()); ser.write(node, writer, "UTF-8"); assertEquals(expected, writer.toString()); } protected void assertCleanedDom(String initial, String expected) throws Exception { cleaner.getProperties().setOmitHtmlEnvelope(false); TagNode node = cleaner.clean(initial); StringWriter writer = new StringWriter(); DomSerializer domSerializer = new DomSerializer(cleaner.getProperties()); Document document = domSerializer.createDOM(node); TransformerFactory tf = TransformerFactory.newInstance(); Transformer transformer = tf.newTransformer(); transformer.transform(new DOMSource(document), new StreamResult(writer)); String actual = writer.getBuffer().toString(); actual = actual.substring(actual.indexOf("\n")+7, actual.indexOf("\n")); assertEquals(expected, actual); cleaner.getProperties().setOmitHtmlEnvelope(true); } protected void assertCleanedJDom(String initial, String expected) throws Exception { boolean env = cleaner.getProperties().isOmitHtmlEnvelope(); cleaner.getProperties().setOmitHtmlEnvelope(false); TagNode node = cleaner.clean(initial); StringWriter writer = new StringWriter(); JDomSerializer domSerializer = new JDomSerializer(cleaner.getProperties()); org.jdom2.Document document = domSerializer.createJDom(node); XMLOutputter out = new XMLOutputter(); out.output(document, writer); String actual = writer.getBuffer().toString(); actual = actual.substring(actual.indexOf("")+6, actual.indexOf("")); assertEquals(expected, actual); cleaner.getProperties().setOmitHtmlEnvelope(env); } protected String readFile(String filename) throws IOException { File file = new File(filename); CharSequence content = Utils.readUrl(file.toURI().toURL(), "UTF-8"); return content.toString(); } public static final String HEADER = "\n"; //+ "\n"; private static final String HEADER_FULL = HEADER + ""; private static final String FOOTER = ""; protected void assertHTML(String expected, String input) throws IOException { StringWriter writer = new StringWriter(); serializer.write(cleaner.clean(input), writer, "UTF-8"); String actual = writer.toString(); Assert.assertEquals(HEADER_FULL + expected + FOOTER, actual); } protected void assertHTMLUsingDomSerializer(String expected, String input) throws IOException, ParserConfigurationException { DomSerializer ser = new DomSerializer(cleaner.getProperties()); Document document = ser.createDOM(cleaner.clean(input)); DOMBuilder in = new DOMBuilder(); org.jdom2.Document jdomDoc = in.build(document); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String actual = outputter.outputString(jdomDoc); Assert.assertEquals(HEADER_FULL + expected + FOOTER + "\n", actual); } protected void assertHTMLUsingJDomSerializer(String expected, String input) throws IOException, ParserConfigurationException { JDomSerializer ser = new JDomSerializer(cleaner.getProperties()); org.jdom2.Document document = ser.createJDom(cleaner.clean(input)); XMLOutputter outputter = new XMLOutputter(Format.getRawFormat().setEncoding("UTF-8").setLineSeparator("\n")); String actual = outputter.outputString(document); Assert.assertEquals(HEADER_FULL + expected + FOOTER + "\n", actual); } } src/main/0000755000000000000000000000000013105122452011262 5ustar rootrootsrc/main/java/0000755000000000000000000000000013105122452012203 5ustar rootrootsrc/main/java/org/0000755000000000000000000000000013105122452012772 5ustar rootrootsrc/main/java/org/htmlcleaner/0000755000000000000000000000000013105122453015271 5ustar rootrootsrc/main/java/org/htmlcleaner/XPatherException.java0000644000000000000000000000436111504062110021364 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; /** *

        Exception that could occure during XPather evaluation.

        */ public class XPatherException extends Exception { public XPatherException() { this("Error in evaluating XPath expression!"); } public XPatherException(Throwable cause) { super(cause); } public XPatherException(String message) { super(message); } public XPatherException(String message, Throwable cause) { super(message, cause); } }src/main/java/org/htmlcleaner/XPather.java0000644000000000000000000006135312311632321017515 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.util.ArrayList; import java.util.Collection; import java.util.Iterator; import java.util.LinkedHashSet; import java.util.List; import java.util.StringTokenizer; /** *

        Utility for searching cleaned document tree with XPath expressions.

        * Examples of supported axes: * *
          *
        • //div//a
        • *
        • //div//a[@id][@class]
        • *
        • /body/*[1]/@type
        • *
        • //div[3]//a[@id][@href='r/n4']
        • *
        • //div[last() >= 4]//./div[position() = last()])[position() > 22]//li[2]//a
        • *
        • //div[2]/@*[2]
        • *
        • data(//div//a[@id][@class])
        • *
        • //p/last()
        • *
        • //body//div[3][@class]//span[12.2 *
        • data(//a['v' < @id])
        • *
        *
        */ public class XPather { private static final int C0 = '0'; private static final int C9 = '9'; private static final int CD = '.'; private static final int CP = '+'; private static final int CM = '-'; private static final int CS = ' '; // array of basic tokens of which XPath expression is made private String tokenArray[]; /** * Constructor - creates XPather instance with specified XPath expression. * @param expression */ public XPather(String expression) { StringTokenizer tokenizer = new StringTokenizer(expression, "/()[]\"'=<>", true); int tokenCount = tokenizer.countTokens(); tokenArray = new String[tokenCount]; int index = 0; // this is not real XPath compiler, rather simple way to recognize basic XPaths expressions // and interpret them against some TagNode instance. while (tokenizer.hasMoreTokens()) { tokenArray[index++] = tokenizer.nextToken(); } } /** * Main public method for this class - a way to execute XPath expression against * specified TagNode instance. * @param node */ public Object[] evaluateAgainstNode(TagNode node) throws XPatherException { if (node == null) { throw new XPatherException("Cannot evaluate XPath expression against null value!"); } Collection collectionResult = evaluateAgainst(singleton(node), 0, tokenArray.length - 1, false, 1, 0, false, null); Object[] array = new Object[collectionResult.size()]; Iterator iterator = collectionResult.iterator(); int index = 0; while (iterator.hasNext()) { array[index++] = iterator.next(); } return array; } private void throwStandardException() throws XPatherException { throw new XPatherException(); } private Collection evaluateAgainst(Collection object, int from, int to, boolean isRecursive, int position, int last, boolean isFilterContext, Collection filterSource) throws XPatherException { if (from >= 0 && to < tokenArray.length && from <= to) { if ("".equals(tokenArray[from].trim())) { return evaluateAgainst(object, from + 1, to, isRecursive, position, last, isFilterContext, filterSource); } else if (isToken("(", from)) { int closingBracket = findClosingIndex(from, to); if (closingBracket > 0) { Collection value = evaluateAgainst(object, from + 1, closingBracket - 1, false, position, last, isFilterContext, filterSource); return evaluateAgainst(value, closingBracket + 1, to, false, position, last, isFilterContext, filterSource); } else { throwStandardException(); } } else if (isToken("[", from)) { int closingBracket = findClosingIndex(from, to); if (closingBracket > 0 && object != null) { Collection value = filterByCondition(object, from + 1, closingBracket - 1); return evaluateAgainst(value, closingBracket + 1, to, false, position, last, isFilterContext, filterSource); } else { throwStandardException(); } } else if (isToken("\"", from) || isToken("'", from)) { // string constant int closingQuote = findClosingIndex(from, to); if (closingQuote > from) { Collection value = singleton( flatten(from + 1, closingQuote - 1) ); return evaluateAgainst(value, closingQuote + 1, to, false, position, last, isFilterContext, filterSource); } else { throwStandardException(); } } else if ( (isToken("=", from) || isToken("<", from) || isToken(">", from)) && isFilterContext ) { // operator inside filter boolean logicValue; if ( isToken("=", from + 1) && (isToken("<", from) || isToken(">", from)) ) { Collection secondObject = evaluateAgainst(filterSource, from + 2, to, false, position, last, isFilterContext, filterSource); logicValue = evaluateLogic(object, secondObject, tokenArray[from] + tokenArray[from + 1]); } else { Collection secondObject = evaluateAgainst(filterSource, from + 1, to, false, position, last, isFilterContext, filterSource); logicValue = evaluateLogic(object, secondObject, tokenArray[from]); } return singleton(new Boolean(logicValue)); } else if (isToken("/", from)) { // children of the node boolean goRecursive = isToken("/", from + 1); if (goRecursive) { from++; } if ( from < to ) { int toIndex = findClosingIndex(from, to) - 1; if (toIndex <= from) { toIndex = to; } Collection value = evaluateAgainst(object, from + 1, toIndex, goRecursive, 1, last, isFilterContext, filterSource); return evaluateAgainst(value, toIndex + 1, to, false, 1, last, isFilterContext, filterSource); } else { throwStandardException(); } } else if (isFunctionCall(from, to)) { int closingBracketIndex = findClosingIndex(from + 1, to); Collection funcValue = evaluateFunction(object, from, to, position, last, isFilterContext); return evaluateAgainst(funcValue, closingBracketIndex + 1, to, false, 1, last, isFilterContext, filterSource); } else if (isValidInteger(tokenArray[from])) { Collection value = singleton(Integer.valueOf(tokenArray[from])); return evaluateAgainst(value, from + 1, to, false, position, last, isFilterContext, filterSource); } else if (isValidDouble(tokenArray[from])) { Collection value = singleton(Double.valueOf(tokenArray[from])); return evaluateAgainst(value, from + 1, to, false, position, last, isFilterContext, filterSource); } else { return getElementsByName(object, from, to, isRecursive, isFilterContext); } } else { return object; } throw new XPatherException(); } private String flatten(int from, int to) { if (from <= to) { StringBuffer result = new StringBuffer(); for (int i = from; i <= to; i++) { result.append(tokenArray[i]); } return result.toString(); } return ""; } private static boolean isValidInteger(String value) { final int l = value.length(); if(l > 0) { int i = 1, c = value.charAt(0); if(c == CP || c == CM || (c >= C0 && c <= C9)) { for (; i < l; i++) { c = value.charAt(i); if (c < C0 || c > C9) return false; } return true; } } return false; } private boolean isValidDouble(String value) { final int l = value.length(); if(l > 0) { int i = 1, c = value.charAt(0); if(c == CP || c == CM || c == CS || (c >= C0 && c <= C9)) { for (; i < l; i++) { c = value.charAt(i); if (c != CD && (c < C0 || c > C9)) return false; } return true; } } return false; } /** * Checks if given string is valid identifier. * @param s */ private boolean isIdentifier(String s) { if (s == null) { return false; } s = s.trim(); if (s.length() > 0) { if ( !Character.isLetter(s.charAt(0)) ) { return false; } for (int i = 1; i < s.length(); i++) { final char ch = s.charAt(i); if ( ch != '_' && ch != '-' && !Character.isLetterOrDigit(ch) ) { return false; } } } return false; } /** * Checks if tokens in specified range represents valid function call. * @param from * @param to * @return True if it is valid function call, false otherwise. */ private boolean isFunctionCall(int from, int to) { if ( !isIdentifier(tokenArray[from]) && !isToken("(", from + 1) ) { return false; } return findClosingIndex(from + 1, to) > from + 1; } /** * Evaluates specified function. * Currently, following XPath functions are supported: last, position, text, count, data * @param source * @param from * @param to * @param position * @param last * @return Collection as the result of evaluation. */ private Collection evaluateFunction(Collection source, int from, int to, int position, int last, boolean isFilterContext) throws XPatherException { String name = tokenArray[from].trim(); ArrayList result = new ArrayList(); final int size = source.size(); Iterator iterator = source.iterator(); int index = 0; while (iterator.hasNext()) { Object curr = iterator.next(); index++; if ( "last".equals(name) ) { result.add( Integer.valueOf(isFilterContext ? last : size) ); } else if ( "position".equals(name) ) { result.add( Integer.valueOf(isFilterContext ? position : index) ); } else if ( "text".equals(name) ) { if (curr instanceof TagNode) { result.add( ((TagNode)curr).getText() ); } else if (curr instanceof String) { result.add( curr.toString() ); } } else if ( "count".equals(name) ) { Collection argumentEvaluated = evaluateAgainst(source, from + 2, to - 1, false, position, 0, isFilterContext, null); result.add( Integer.valueOf(argumentEvaluated.size()) ); } else if ( "data".equals(name) ) { Collection argumentEvaluated = evaluateAgainst(source, from + 2, to - 1, false, position, 0, isFilterContext, null); Iterator it = argumentEvaluated.iterator(); while (it.hasNext()) { Object elem = it.next(); if (elem instanceof TagNode) { result.add( ((TagNode)elem).getText() ); } else if (elem instanceof String) { result.add( elem.toString() ); } } } else { throw new XPatherException("Unknown function " + name + "!"); } } return result; } /** * Filter nodes satisfying the condition * @param source * @param from * @param to */ private Collection filterByCondition(Collection source, int from, int to) throws XPatherException { ArrayList result = new ArrayList(); Iterator iterator = source.iterator(); int index = 0; int size = source.size(); while (iterator.hasNext()) { Object curr = iterator.next(); index++; ArrayList logicValueList = new ArrayList(evaluateAgainst(singleton(curr), from, to, false, index, size, true, singleton(curr))); if (logicValueList.size() >= 1) { Object first = logicValueList.get(0); if (first instanceof Boolean) { if ( ((Boolean)first).booleanValue() ) { result.add(curr); } } else if (first instanceof Integer) { if ( ((Integer)first).intValue() == index ) { result.add(curr); } } else { result.add(curr); } } } return result; } private boolean isToken(String token, int index) { int len = tokenArray.length; return index >= 0 && index < len && tokenArray[index].trim().equals(token.trim()); } /** * @param from * @param to * @return matching closing index in the token array for the current token, or -1 if there is * no closing token within expected bounds. */ private int findClosingIndex(int from, int to) { if (from < to) { String currToken = tokenArray[from]; if ("\"".equals(currToken)) { for (int i = from + 1; i <= to; i++) { if ("\"".equals(tokenArray[i])) { return i; } } } else if ("'".equals(currToken)) { for (int i = from + 1; i <= to; i++) { if ("'".equals(tokenArray[i])) { return i; } } } else if ( "(".equals(currToken) || "[".equals(currToken) || "/".equals(currToken) ) { boolean isQuoteClosed = true; boolean isAposClosed = true; int brackets = "(".equals(currToken) ? 1 : 0; int angleBrackets = "[".equals(currToken) ? 1 : 0; int slashes = "/".equals(currToken) ? 1 : 0; for (int i = from + 1; i <= to; i++) { if ( "\"".equals(tokenArray[i]) ) { isQuoteClosed = !isQuoteClosed; } else if ( "'".equals(tokenArray[i]) ) { isAposClosed = !isAposClosed; } else if ( "(".equals(tokenArray[i]) && isQuoteClosed && isAposClosed ) { brackets++; } else if ( ")".equals(tokenArray[i]) && isQuoteClosed && isAposClosed ) { brackets--; } else if ( "[".equals(tokenArray[i]) && isQuoteClosed && isAposClosed ) { angleBrackets++; } else if ( "]".equals(tokenArray[i]) && isQuoteClosed && isAposClosed ) { angleBrackets--; } else if ( "/".equals(tokenArray[i]) && isQuoteClosed && isAposClosed && brackets == 0 && angleBrackets == 0) { slashes--; } if (isQuoteClosed && isAposClosed && brackets == 0 && angleBrackets == 0 && slashes == 0) { return i; } } } } return -1; } /** * Checks if token is attribute (starts with @) * @param token */ private boolean isAtt(String token) { return token != null && token.length() > 1 && token.startsWith("@"); } /** * Creates one-element collection for the specified object. * @param element */ private Collection singleton(Object element) { ArrayList result = new ArrayList(); result.add(element); return result; } /** * For the given source collection and specified name, returns collection of subnodes * or attribute values. * @param source * @param from * @param to * @param isRecursive * @return Colection of TagNode instances or collection of String instances. */ private Collection getElementsByName(Collection source, int from, int to, boolean isRecursive, boolean isFilterContext) throws XPatherException { String name = tokenArray[from].trim(); if (isAtt(name)) { name = name.substring(1); Collection result = new ArrayList(); Collection nodes; if (isRecursive) { nodes = new LinkedHashSet(); Iterator iterator = source.iterator(); while (iterator.hasNext()) { Object next = iterator.next(); if (next instanceof TagNode) { TagNode node = (TagNode) next; nodes.addAll( node.getAllElementsList(true) ); } } } else { nodes = source; } Iterator iterator = nodes.iterator(); while (iterator.hasNext()) { Object next = iterator.next(); if (next instanceof TagNode) { TagNode node = (TagNode) next; if ("*".equals(name)) { result.addAll( evaluateAgainst(node.getAttributes().values(), from + 1, to, false, 1, 1, isFilterContext, null) ); } else { String attValue = node.getAttributeByName(name); if (attValue != null) { result.addAll( evaluateAgainst(singleton(attValue), from + 1, to, false, 1, 1, isFilterContext, null) ); } } } else { throwStandardException(); } } return result; } else { Collection result = new LinkedHashSet(); Iterator iterator = source.iterator(); int index = 0; while (iterator.hasNext()) { final Object next = iterator.next(); if (next instanceof TagNode) { TagNode node = (TagNode) next; index++; boolean isSelf = ".".equals(name); boolean isParent = "..".equals(name); boolean isAll = "*".equals(name); Collection subnodes; if (isSelf) { subnodes = singleton(node); } else if (isParent) { TagNode parent = node.getParent(); subnodes = parent != null ? singleton(parent) : new ArrayList(); } else { subnodes = isAll ? node.getChildTagList() : node.getElementListByName(name, false); } LinkedHashSet nodeSet = new LinkedHashSet(subnodes); Collection refinedSubnodes = evaluateAgainst(nodeSet, from + 1, to, false, index, nodeSet.size(), isFilterContext, null); if (isRecursive) { List childTags = node.getChildTagList(); if (isSelf || isParent || isAll) { result.addAll(refinedSubnodes); } Iterator childIterator = childTags.iterator(); while (childIterator.hasNext()) { TagNode childTag = (TagNode) childIterator.next(); Collection childrenByName = getElementsByName(singleton(childTag), from, to, isRecursive, isFilterContext); if ( !isSelf && !isParent && !isAll && refinedSubnodes.contains(childTag) ) { result.add(childTag); } result.addAll(childrenByName); } } else { result.addAll(refinedSubnodes); } } else { throwStandardException(); } } return result; } } /** * Evaluates logic operation on two collections. * @param first * @param second * @param logicOperator * @return Result of logic operation */ private boolean evaluateLogic(Collection first, Collection second, String logicOperator) { if (first == null || first.size() == 0 || second == null || second.size() == 0) { return false; } Object elem1 = first.iterator().next(); Object elem2 = second.iterator().next(); if (elem1 instanceof Number && elem2 instanceof Number) { double d1 = ((Number)elem1).doubleValue(); double d2 = ((Number)elem2).doubleValue(); if ("=".equals(logicOperator)) { return d1 == d2; } else if ("<".equals(logicOperator)) { return d1 < d2; } else if (">".equals(logicOperator)) { return d1 > d2; } else if ("<=".equals(logicOperator)) { return d1 <= d2; } else if (">=".equals(logicOperator)) { return d1 >= d2; } } else { String s1 = toText(elem1); String s2 = toText(elem2); int result = s1.compareTo(s2); if ("=".equals(logicOperator)) { return result == 0; } else if ("<".equals(logicOperator)) { return result < 0; } else if (">".equals(logicOperator)) { return result > 0; } else if ("<=".equals(logicOperator)) { return result <= 0; } else if (">=".equals(logicOperator)) { return result >= 0; } } return false; } private String toText(Object o) { if (o == null) { return ""; } if (o instanceof TagNode) { return ((TagNode)o).getText().toString(); } else { return o.toString(); } } }src/main/java/org/htmlcleaner/XmlSerializer.java0000644000000000000000000002655113077712530020750 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.io.*; import java.util.*; /** *

        Abstract XML serializer - contains common logic for descendants.

        */ public abstract class XmlSerializer extends Serializer { public static final String XMLNS_NAMESPACE = "xmlns"; protected XmlSerializer(CleanerProperties props) { super(props); } private boolean creatingHtmlDom; /** * @param creatingHtmlDom the creatingHtmlDom to set */ public void setCreatingHtmlDom(boolean creatingHtmlDom) { this.creatingHtmlDom = creatingHtmlDom; } /** * @return the creatingHtmlDom */ public boolean isCreatingHtmlDom() { return creatingHtmlDom; } /** * @deprecated Use writeToStream() instead. */ @Deprecated public void writeXmlToStream(TagNode tagNode, OutputStream out, String charset) throws IOException { super.writeToStream(tagNode, out, charset); } /** * @deprecated Use writeToStream() instead. */ @Deprecated public void writeXmlToStream(TagNode tagNode, OutputStream out) throws IOException { super.writeToStream(tagNode, out); } /** * @deprecated Use writeToFile() instead. */ @Deprecated public void writeXmlToFile(TagNode tagNode, String fileName, String charset) throws IOException { super.writeToFile(tagNode, fileName, charset); } /** * @deprecated Use writeToFile() instead. */ @Deprecated public void writeXmlToFile(TagNode tagNode, String fileName) throws IOException { super.writeToFile(tagNode, fileName); } /** * @deprecated Use getAsString() instead. */ @Deprecated public String getXmlAsString(TagNode tagNode, String charset) { return super.getAsString(tagNode, charset); } /** * @deprecated Use getAsString() instead. */ @Deprecated public String getXmlAsString(TagNode tagNode) { return super.getAsString(tagNode); } /** * @deprecated Use write() instead. */ @Deprecated public void writeXml(TagNode tagNode, Writer writer, String charset) throws IOException { super.write(tagNode, writer, charset); } protected String escapeXml(String xmlContent) { return Utils.escapeXml(xmlContent, props, isCreatingHtmlDom()); } protected boolean dontEscape(TagNode tagNode) { return props.isUseCdataFor(tagNode.getName()); } protected boolean isMinimizedTagSyntax(TagNode tagNode) { final TagInfo tagInfo = props.getTagInfoProvider().getTagInfo(tagNode.getName()); return tagNode.isEmpty() && (tagInfo == null || tagInfo.isMinimizedTagPermitted()) && ( props.isUseEmptyElementTags() || (tagInfo != null && tagInfo.isEmptyTag()) ); } protected void serializeOpenTag(TagNode tagNode, Writer writer) throws IOException { serializeOpenTag(tagNode, writer, true); } /** * Serialize a CDATA section. If the context is a script or style tag, and * using CDATA for script and style is set to true, then we just write the * actual content, as the whole section is wrapped in CDATA tokens. * Otherwise we escape the content as if it were regular text. * * @param item the CDATA instance * @param tagNode the TagNode within which the CDATA appears * @param writer the writer to output to * @throws IOException */ protected void serializeCData(CData item, TagNode tagNode, Writer writer) throws IOException{ if (dontEscape(tagNode)){ writer.write(item.getContentWithoutStartAndEndTokens()); } else { writer.write(escapeXml(item.getContentWithStartAndEndTokens())); } } /** * Serialize a content token, escaping where necessary. * @param item the content token to serialize * @param tagNode the TagNode within which the content token appears * @param writer the writer to output to * @throws IOException */ protected void serializeContentToken(ContentNode item, TagNode tagNode, Writer writer) throws IOException { if (dontEscape(tagNode)){ writer.write(item.getContent()); }else { writer.write( escapeXml(item.getContent()) ); } } protected void serializeOpenTag(TagNode tagNode, Writer writer, boolean newLine) throws IOException { if ( !isForbiddenTag(tagNode)) { String tagName = tagNode.getName(); Map tagAtttributes = tagNode.getAttributes(); // always have head and body in newline if (props.isAddNewlineToHeadAndBody() && isHeadOrBody(tagName)) { writer.write("\n"); } writer.write("<" + tagName); Iterator> it = tagAtttributes.entrySet().iterator(); while (it.hasNext()) { Map.Entry entry = (Map.Entry) it.next(); String attName = (String) entry.getKey(); String attValue = (String) entry.getValue(); serializeAttribute(tagNode, writer, attName, attValue); } if ( isMinimizedTagSyntax(tagNode) ) { writer.write(" />"); if (newLine) { writer.write("\n"); } } else if (dontEscape(tagNode)) { // because we are not considering if the file is xhtml or html, // we need to put a javascript comment in front of the CDATA in case this is NOT xhtml writer.write(">"); if (!tagNode.getText().toString().startsWith(CData.SAFE_BEGIN_CDATA)) { writer.write(CData.SAFE_BEGIN_CDATA); // // Insert a newline after the CDATA start marker if there isn't // already a newline character there // if (!tagNode.getText().toString().equals("")){ char firstchar = tagNode.getText().toString().charAt(0); if (firstchar != '\n' && firstchar !='\r') writer.write("\n"); } } } else { writer.write(">"); } } } /** * @param tagNode * @return true if the tag is forbidden */ protected boolean isForbiddenTag(TagNode tagNode) { // null tagName when rootNode is a dummy node. // this happens when omitting the html envelope elements ( , , elements ) String tagName = tagNode.getName(); return tagName == null; } protected boolean isHeadOrBody(String tagName) { return "head".equalsIgnoreCase(tagName) || "body".equalsIgnoreCase(tagName); } /** * This allows overriding to eliminate forbidden attributes (for example javascript attributes onclick, onblur, etc. ) * @param writer * @param attName * @param attValue * @throws IOException */ protected void serializeAttribute(TagNode tagNode, Writer writer, String attName, String attValue) throws IOException { // // For XML, we can't use the lax definition of attribute names used in HTML5, so // we have to replace any invalid ones with a generated attribute name, or skip // them entirely. // if (!props.isAllowInvalidAttributeNames()){ attName = Utils.sanitizeXmlAttributeName(attName, props.getInvalidXmlAttributeNamePrefix()); } if (attName != null && (Utils.isValidXmlIdentifier(attName) || props.isAllowInvalidAttributeNames()) && !isForbiddenAttribute(tagNode, attName, attValue)) { writer.write(" " + attName + "=\"" + escapeXml(attValue) + "\""); } } /** * Override to add additional conditions. * @param tagNode * @param attName * @param value * @return true if the attribute should not be outputed. */ protected boolean isForbiddenAttribute(TagNode tagNode, String attName, String value) { return !props.isNamespacesAware() && (XMLNS_NAMESPACE.equals(attName) || attName.startsWith(XMLNS_NAMESPACE +":")); } protected void serializeEndTag(TagNode tagNode, Writer writer) throws IOException { serializeEndTag(tagNode, writer, true); } protected void serializeEndTag(TagNode tagNode, Writer writer, boolean newLine) throws IOException { if ( !isForbiddenTag(tagNode)) { String tagName = tagNode.getName(); if (dontEscape(tagNode)) { // because we are not considering if the file is xhtml or html, // we need to put a javascript comment in front of the CDATA in case this is NOT xhtml if (!tagNode.getText().toString().trim().endsWith(CData.SAFE_END_CDATA)) { // // Insert a newline character before the CDATA end marker if there isn't one // already at the end of the tag node content // if (tagNode.getText().toString().length() > 0){ char lastchar = tagNode.getText().toString().charAt(tagNode.getText().toString().length()-1); if (lastchar != '\n' && lastchar !='\r') writer.write("\n"); } // Write the CDATA end marker writer.write(CData.SAFE_END_CDATA); } } writer.write( "" ); if (newLine) { writer.write("\n"); } } } }src/main/java/org/htmlcleaner/Utils.java0000644000000000000000000007622213077712530017256 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.io.*; import java.net.URL; import java.util.StringTokenizer; import java.util.regex.Matcher; import java.util.regex.Pattern; /** *

        Common utilities.

        * * Created by: Vladimir Nikic
        * Date: November, 2006. */ public class Utils { /** * Removes the first newline and last newline (if present) of a string * @param str * @return */ static String bchomp(final String str){ return chomp(lchomp(str)); } /** * Removes the last newline (if present) of a string * @param str * @return */ static String chomp(final String str){ if (str.length() ==0) { return str; } if (str.length() == 1) { final char ch = str.charAt(0); if (ch == '\r' || ch == '\n') { return ""; } return str; } int lastIdx = str.length() - 1; final char last = str.charAt(lastIdx); if (last == '\n') { if (str.charAt(lastIdx - 1) == '\r') { lastIdx--; } } else if (last != '\r') { lastIdx++; } return str.substring(0, lastIdx); } /** * Removes the first newline (if present) of a string * @param str * @return */ static String lchomp(final String str){ if (str.length() == 0) { return str; } if (str.length() == 1) { final char ch = str.charAt(0); if (ch == '\r' || ch == '\n') { return ""; } return str; } int firstIndex = 0; final char first = str.charAt(0); if (first == '\n'){ firstIndex++; if (str.charAt(1) == '\r') { firstIndex++ ; } } else if (first != '\r') { firstIndex = 0; } return str.substring(firstIndex, str.length()); } /** * Reads content from the specified URL with specified charset into string * @param url * @param charset * @throws IOException */ @Deprecated // Removing network I/O will make htmlcleaner better suited to a server environment which needs managed connections static CharSequence readUrl(URL url, String charset) throws IOException { StringBuilder buffer = new StringBuilder(1024); InputStream inputStream = url.openStream(); try { InputStreamReader reader = new InputStreamReader(inputStream, charset); char[] charArray = new char[1024]; int charsRead = 0; do { charsRead = reader.read(charArray); if (charsRead >= 0) { buffer.append(charArray, 0, charsRead); } } while (charsRead > 0); } finally { inputStream.close(); } return buffer; } /** * Checks if specified link is full URL. * * @param link * @return True, if full URl, false otherwise. */ public static boolean isFullUrl(String link) { if (link == null) { return false; } link = link.trim().toLowerCase(); return link.startsWith("http://") || link.startsWith("https://") || link.startsWith("file://"); } /** * Calculates full URL for specified page URL and link * which could be full, absolute or relative like there can * be found in A or IMG tags. (Reinstated as per user request in bug 159) */ public static String fullUrl(String pageUrl, String link) { if (isFullUrl(link)) { return link; } else if (link != null && link.startsWith("?")) { int qindex = pageUrl.indexOf('?'); int len = pageUrl.length(); if (qindex < 0) { return pageUrl + link; } else if (qindex == len - 1) { return pageUrl.substring(0, len - 1) + link; } else { return pageUrl + "&" + link.substring(1); } } boolean isLinkAbsolute = link.startsWith("/"); if (!isFullUrl(pageUrl)) { pageUrl = "http://" + pageUrl; } int slashIndex = isLinkAbsolute ? pageUrl.indexOf("/", 8) : pageUrl.lastIndexOf("/"); if (slashIndex <= 8) { pageUrl += "/"; } else { pageUrl = pageUrl.substring(0, slashIndex + 1); } return isLinkAbsolute ? pageUrl + link.substring(1) : pageUrl + link; } /** * Escapes HTML string * @param s String to be escaped * @param props Cleaner properties affects escaping behaviour * @return */ public static String escapeHtml(String s, CleanerProperties props) { boolean advanced = props.isAdvancedXmlEscape(); boolean recognizeUnicodeChars = props.isRecognizeUnicodeChars(); boolean translateSpecialEntities = props.isTranslateSpecialEntities(); boolean transResCharsToNCR = props.isTransResCharsToNCR(); boolean transSpecialEntitiesToNCR = props.isTransSpecialEntitiesToNCR(); return escapeXml(s, advanced, recognizeUnicodeChars, translateSpecialEntities, false, transResCharsToNCR, transSpecialEntitiesToNCR, true); } /** * Escapes XML string. * @param s String to be escaped * @param props Cleaner properties affects escaping behaviour * @param isDomCreation Tells if escaped content will be part of the DOM */ public static String escapeXml(String s, CleanerProperties props, boolean isDomCreation) { boolean advanced = props.isAdvancedXmlEscape(); boolean recognizeUnicodeChars = props.isRecognizeUnicodeChars(); boolean translateSpecialEntities = props.isTranslateSpecialEntities(); boolean transResCharsToNCR = props.isTransResCharsToNCR(); boolean transSpecialEntitiesToNCR = props.isTransSpecialEntitiesToNCR(); return escapeXml(s, advanced, recognizeUnicodeChars, translateSpecialEntities, isDomCreation, transResCharsToNCR, transSpecialEntitiesToNCR, false); } /** * change notes: * 1) convert ascii characters encoded using &#xx; format to the ascii characters -- may be an attempt to slip in malicious html * 2) convert &#xxx; format characters to " style representation if available for the character. * 3) convert html special entities to xml &#xxx; when outputing in xml * @param s * @param advanced * @param recognizeUnicodeChars * @param translateSpecialEntities * @param isDomCreation * @return * TODO Consider moving to CleanerProperties since a long list of params is misleading. */ public static String escapeXml(String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR) { return escapeXml(s,advanced,recognizeUnicodeChars,translateSpecialEntities,isDomCreation,transResCharsToNCR,translateSpecialEntitiesToNCR,false); } /** * change notes: * 1) convert ascii characters encoded using &#xx; format to the ascii characters -- may be an attempt to slip in malicious html * 2) convert &#xxx; format characters to " style representation if available for the character. * 3) convert html special entities to xml &#xxx; when outputing in xml * @param s * @param advanced * @param recognizeUnicodeChars * @param translateSpecialEntities * @param isDomCreation * @param isHtmlOutput * @return * TODO Consider moving to CleanerProperties since a long list of params is misleading. */ public static String escapeXml(String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR, boolean isHtmlOutput) { if (s != null) { int len = s.length(); StringBuilder result = new StringBuilder(len); for (int i = 0; i < len; i++) { char ch = s.charAt(i); SpecialEntity code; if (ch == '&') { if ( (advanced || recognizeUnicodeChars) && (i < len-1) && (s.charAt(i+1) == '#') ) { i = convertToUnicode(s, isDomCreation, recognizeUnicodeChars, translateSpecialEntitiesToNCR, result, i+2); } else if ((translateSpecialEntities || advanced) && (code = SpecialEntities.INSTANCE.getSpecialEntity(s.substring(i, i+Math.min(10, len-i)))) != null) { if (translateSpecialEntities && code.isHtmlSpecialEntity()) { if (recognizeUnicodeChars) { result.append( (char)code.intValue() ); } else { result.append( code.getDecimalNCR() ); } i += code.getKey().length() + 1; } else if (advanced ) { // // If we are creating a HTML DOM or outputting to the HtmlSerializer, use HTML special entities; // otherwise we get their XML escaped version (see bug #118). // result.append(transResCharsToNCR ? code.getDecimalNCR() : code.getEscaped(isHtmlOutput || isDomCreation)); i += code.getKey().length()+1; } else { result.append(transResCharsToNCR ? getAmpNcr() : "&"); } } // // If the serializer used to output is HTML rather than XML, and we have a match to a // known HTML entity such as  , we output it as-is (see bug #118) // else if (isHtmlOutput) { // we have an ampersand and that's all we know so far code = SpecialEntities.INSTANCE.getSpecialEntity(s.substring(i, i+Math.min(10, len-i))); if ( code != null ) { // It is a special entity like   - leave it in place. result.append(code.getEscapedValue()); // advance i by the length of the entity so we won't process each following character // key length excludes & and ; and we add 1 to skip the ; i += code.getKey().length()+1; } else if ( (i < len-1) && (s.charAt(i+1) == '#') ) { // if the next char is a # then convert entity number to entity name (if possible) i = convert_To_Entity_Name(s, false, false, false, result, i+2); // assuming 'i' is being incremented correctly... not verified. } else { // html output but not an entity name or number result.append(transResCharsToNCR ? getAmpNcr() : "&"); } } else { result.append(transResCharsToNCR ? getAmpNcr() : "&"); } } else if ((code = SpecialEntities.INSTANCE.getSpecialEntityByUnicode(ch)) != null ) { // It's a special entity character itself if ( isHtmlOutput ) { if ( "apos".equals(code.getKey()) ) { // leave the apostrophes alone for html output // this is a cheap hack to avoid removing apostrophe from the special entities list for html output result.append(ch); } else { // output as entity name, or as literal character if isDomCreation result.append(isDomCreation? code.getHtmlString() : code.getEscapedValue()); } } else { // output as entity number, or as literal character if isDomCreation result.append(transResCharsToNCR ? code.getDecimalNCR() : code.getEscaped(isDomCreation)); } } else { result.append(ch); } } return result.toString(); } return null; } private static String ampNcr; private static String getAmpNcr() { if (ampNcr == null) { ampNcr = SpecialEntities.INSTANCE.getSpecialEntityByUnicode('&').getDecimalNCR(); } return ampNcr; } private static final Pattern ASCII_CHAR = Pattern.compile("\\p{Print}"); /** * @param s * @param domCreation * @param recognizeUnicodeChars * @param translateSpecialEntitiesToNCR * @param result * @param i * @return */ // Converts Numeric Character References (NCRs) (Dec or Hex) to Character Entity References // ie. € to € // This is almost a copy of convertToUnicode // only called in the case of isHtmlOutput when we see &# in the input stream private static int convert_To_Entity_Name(String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, StringBuilder result, int i) { StringBuilder unicode = new StringBuilder(); int charIndex = extractCharCode(s, i, true, unicode); if (unicode.length() > 0) { try { boolean isHex = unicode.substring(0,1).equals("x"); // // Get the unicode character and code point // int codePoint = -1; char[] unicodeChar = null; if (isHex){ codePoint = Integer.parseInt(unicode.substring(1), 16); unicodeChar = Character.toChars(codePoint); } else { codePoint = Integer.parseInt(unicode.toString()); unicodeChar = Character.toChars(codePoint); } SpecialEntity specialEntity = SpecialEntities.INSTANCE.getSpecialEntityByUnicode(codePoint); if (unicodeChar.length == 1 && unicodeChar[0] == 0) { // null character �Peanut for example // just consume character & result.append("&"); } else if ( specialEntity != null ) { if ( specialEntity.isHtmlSpecialEntity() ) { result.append( domCreation? specialEntity.getHtmlString() : specialEntity.getEscapedValue() ); } else { result.append(domCreation? specialEntity.getHtmlString(): (translateSpecialEntitiesToNCR? (isHex? specialEntity.getHexNCR(): specialEntity.getDecimalNCR()) : specialEntity.getHtmlString())); } } else if ( recognizeUnicodeChars ) { // output unicode characters as their actual byte code with the exception of characters that have special xml meaning. result.append( String.valueOf(unicodeChar)); } else if ( ASCII_CHAR.matcher(new String(unicodeChar)).find()) { // ascii printable character. this fancy escaping might be an attempt to slip in dangerous characters (i.e. spelling out doesn't get turned into return props.isUseCdataFor(element.getNodeName()) && (!element.hasChildNodes() || element.getTextContent() == null || element.getTextContent().trim().length() == 0); } protected String outputCData(CData cdata){ return cdata.getContentWithoutStartAndEndTokens(); } protected String deserializeCdataEntities(String input){ return Utils.deserializeEntities(input, props.isRecognizeUnicodeChars()); } /** * Serialize a given HTML Cleaner node. * * @param document the W3C Document to use for creating new DOM elements * @param element the W3C element to which we'll add the subnodes to * @param tagChildren the HTML Cleaner nodes to serialize for that node */ protected void createSubnodes(Document document, Element element, List tagChildren) { if (tagChildren != null) { CDATASection cdata = null; // // For script and style nodes, check if we're set to use CDATA // if (props.isUseCdataFor(element.getTagName())){ cdata = document.createCDATASection(""); element.appendChild(document.createTextNode(CSS_COMMENT_START)); element.appendChild(cdata); } Iterator it = tagChildren.iterator(); while (it.hasNext()) { Object item = it.next(); if (item instanceof CommentNode) { CommentNode commentNode = (CommentNode) item; Comment comment = document.createComment( commentNode.getContent() ); element.appendChild(comment); } else if (item instanceof ContentNode) { ContentNode contentNode = (ContentNode) item; String content = contentNode.getContent(); boolean specialCase = props.isUseCdataFor(element.getTagName()); if (escapeXml && !specialCase) { content = Utils.escapeXml(content, props, true); } if (specialCase && item instanceof CData){ // // For CDATA sections we don't want to return the start and // end tokens. See issue #106. // content = ((CData)item).getContentWithoutStartAndEndTokens(); } if (specialCase && deserializeCdataEntities){ content = this.deserializeCdataEntities(content); } if (cdata != null){ cdata.appendData(content); } else { element.appendChild(document.createTextNode(content) ); } } else if (item instanceof TagNode) { TagNode subTagNode = (TagNode) item; Element subelement = document.createElement( subTagNode.getName() ); Map attributes = subTagNode.getAttributes(); Iterator> entryIterator = attributes.entrySet().iterator(); while (entryIterator.hasNext()) { Map.Entry entry = (Map.Entry) entryIterator.next(); String attrName = (String) entry.getKey(); String attrValue = (String) entry.getValue(); if (escapeXml) { attrValue = Utils.escapeXml(attrValue, props, true); } // // Fix any invalid attribute names by adding a prefix // if (!props.isAllowInvalidAttributeNames()){ attrName = Utils.sanitizeXmlAttributeName(attrName, props.getInvalidXmlAttributeNamePrefix()); } if (attrName != null && (Utils.isValidXmlIdentifier(attrName) || props.isAllowInvalidAttributeNames())){ subelement.setAttribute(attrName, attrValue); // // Flag the attribute as an ID attribute if appropriate. Thanks to Chris173 // if (attrName.equalsIgnoreCase("id")) { subelement.setIdAttribute(attrName, true); } } } // recursively create subnodes createSubnodes(document, subelement, subTagNode.getAllChildren()); element.appendChild(subelement); } else if (item instanceof List) { List sublist = (List) item; createSubnodes(document, element, sublist); } } if (cdata != null){ if (!cdata.getData().startsWith(NEW_LINE)){ cdata.setData(CSS_COMMENT_END + NEW_LINE + cdata.getData()); } else { cdata.setData(CSS_COMMENT_END + cdata.getData()); } if (!cdata.getData().endsWith(NEW_LINE)){ cdata.appendData(NEW_LINE); } cdata.appendData(CSS_COMMENT_START); element.appendChild(document.createTextNode(CSS_COMMENT_END)); } } } }src/main/java/org/htmlcleaner/DoctypeToken.java0000644000000000000000000002435112234151627020560 0ustar rootroot/* Copyright (c) 2006-2013, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.io.IOException; import java.io.Writer; /** *

        HTML doctype token.

        */ public class DoctypeToken extends BaseTokenImpl implements HtmlNode{ // // Part 1 is the document type, typically 'html' or 'HTML' // private String part1; // // Part 2 is the PUBLIC or SYSTEM token // private String part2; // // Part 3 is the PUBLIC identifier, typically '-//W3C//DTD HTML 4.01//EN' or similar // private String part3; // // Part 4 is the SYSTEM identifier, typically a URL for the DTD // private String part4; /** * The identified DocType, if any */ private Integer type = null; // // Constants for identified doctypes // public static final int UNKNOWN = 0; public static final int HTML4_0 = 10; public static final int HTML4_01 = 20; public static final int HTML4_01_STRICT = 21; public static final int HTML4_01_TRANSITIONAL = 22; public static final int HTML4_01_FRAMESET = 23; public static final int XHTML1_0_STRICT = 31; public static final int XHTML1_0_TRANSITIONAL = 32; public static final int XHTML1_0_FRAMESET = 33; public static final int XHTML1_1 = 40; public static final int XHTML1_1_BASIC = 41; public static final int HTML5 = 60; public static final int HTML5_LEGACY_TOOL_COMPATIBLE = 61; // // Whether the DocType is valid // private Boolean valid = null; public DoctypeToken(String part1, String part2, String part3, String part4) { this.part1 = part1; this.part2 = part2 != null ? part2.toUpperCase() : part2; this.part3 = clean(part3); this.part4 = clean(part4); validate(); } /* * Constructor for 5-part DocTypes, e.g. . * For this we ignore part4 as we assume that must be "SYSTEM". */ public DoctypeToken(String part1, String part2, String part3, String part4, String part5) { this.part1 = part1; this.part2 = part2 != null ? part2.toUpperCase() : part2; this.part3 = clean(part3); this.part4 = clean(part5); validate(); } private String clean(String s) { if (s != null) { s = s.replace('>', ' '); s = s.replace('<', ' '); s = s.replace('&', ' '); s = s.replace('\'', ' '); s = s.replace('\"', ' '); } return s; } public boolean isValid(){ return valid; } /** * Checks the doctype according to W3C parsing rules and tries to identify * the type and validity * * See: *
          *
        • http://www.w3.org/TR/html-markup/syntax.html#doctype-syntax
        • *
        • http://dev.w3.org/html5/html-author/#doctype-declaration
        • *
        */ private void validate() { // // No PUBLIC or SYSTEM token // if (!"public".equalsIgnoreCase(part2) && !"system".equalsIgnoreCase(part2)) { // // HTML 5 // if ("html".equalsIgnoreCase(part1) && (part2 == null)){ type = HTML5; valid = true; } } if ("public".equalsIgnoreCase(part2)){ // // HTML 4.0 is valid without an ID, or with strict DTD ID // if ("-//W3C//DTD HTML 4.0//EN".equals(getPublicId())){ type = HTML4_0; if ("http://www.w3.org/TR/REC-html40/strict.dtd".equals(part4) || "".equals(getSystemId())){ valid = true; } else { valid = false; } } // // HTML 4.0.1 STRICT is valid with Strict dtd ID or empty // if ("-//W3C//DTD HTML 4.01//EN".equals(getPublicId())){ type = HTML4_01_STRICT; if ("http://www.w3.org/TR/html4/strict.dtd".equals(part4) || "".equals(getSystemId())){ valid = true; } else { valid = false; } } // // HTML 4.0.1 TRANSITIONAL valid only with Transitional DTD ID // if ("-//W3C//DTD HTML 4.01 Transitional//EN".equals(getPublicId())){ type = HTML4_01_TRANSITIONAL; if ("http://www.w3.org/TR/html4/loose.dtd".equals(getSystemId())){ valid = true; } else { valid = false; } } // // HTML 4.0.1 FRAMESET valid only with Frameset ID // if ("-//W3C//DTD HTML 4.01 Frameset//EN".equals(getPublicId())){ type = HTML4_01_FRAMESET; if ("http://www.w3.org/TR/html4/frameset.dtd".equals(getSystemId())){ valid = true; } else { valid = false; } } // // XHTML 1.0 // if ("-//W3C//DTD XHTML 1.0 Strict//EN".equals(getPublicId())){ type = XHTML1_0_STRICT; if ("http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd".equals(getSystemId())){ valid = true; } else { valid = false; } } // // XHTML 1.0 Transitional // if ("-//W3C//DTD XHTML 1.0 Transitional//EN".equals(getPublicId())){ type = XHTML1_0_TRANSITIONAL; if ("http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd".equals(getSystemId())){ valid = true; } else { valid = false; } } // // XHTML 1.0 Frameset // if ("-//W3C//DTD XHTML 1.0 Frameset//EN".equals(getPublicId())){ type = XHTML1_0_FRAMESET; if ("http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd".equals(getSystemId())){ valid = true; } else { valid = false; } } // // XHTML 1.1 // if ("-//W3C//DTD XHTML 1.1//EN".equals(getPublicId())){ type = XHTML1_1; if ("http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd".equals(getSystemId())){ valid = true; } else { valid = false; } } // // XHTML 1.1 Basic // if ("-//W3C//DTD XHTML Basic 1.1//EN".equals(getPublicId())){ type = XHTML1_1_BASIC; if ("http://www.w3.org/TR/xhtml11/DTD/xhtml-basic11.dtd".equals(getSystemId())){ valid = true; } else { valid = false; } } } if ("system".equalsIgnoreCase(part2)){ // // HTML 5 legacy tool compatible // if ("about:legacy-compat".equals(getPublicId())){ type = HTML5_LEGACY_TOOL_COMPATIBLE; valid = true; } } if (type == null){ type = UNKNOWN; valid = false; } } public String getContent() { String result = "= 30){ result += "html"; } else { result += "HTML"; } } else { // // if its an unknown doctype, just pass through as-is // result += part1; } if (part2 != null){ result += " " + part2 + " \"" + part3 + "\""; if (!"".equals(part4) ) { result += " \"" + part4 + "\""; } } result += ">"; return result; } @Override public String toString() { return getContent(); } /** * This will retrieve an integer representing the identified DocType */ public int getType(){ return type; } public String getName() { return ""; } public void serialize(Serializer serializer, Writer writer) throws IOException { writer.write(getContent() + "\n"); } /** * This will retrieve the public ID of an externally referenced DTD, or an empty String if none is referenced. */ public String getPublicId(){ return part3; } /** * This will retrieve the system ID of an externally referenced DTD, or an empty String if none is referenced. */ public String getSystemId(){ return part4; } public String getPart1() { return part1; } public String getPart2() { return part2; } /** * Deprecated - use getPublicId() instead * @return */ @Deprecated public String getPart3() { return part3; } /** * Deprecated - use getSystemId() instead * @return */ @Deprecated public String getPart4() { return part4; } }src/main/java/org/htmlcleaner/Display.java0000644000000000000000000000352512113037735017554 0ustar rootrootpackage org.htmlcleaner; /** * Most HTML 4 elements permitted within the BODY are classified as either * block-level elements or inline elements. This enumeration contains * corresponding constants to distinguish them. * * @author Konstantin Burov (aectann@gmail.com) * */ public enum Display { /** * Block-level elements typically contain inline elements and other * block-level elements. When rendered visually, block-level elements * usually begin on a new line. */ block(true, false), /** * Inline elements typically may only contain text and other inline * elements. When rendered visually, inline elements do not usually begin on * a new line. */ inline(false, true), /** * The following elements may be used as either block-level elements or * inline elements. If used as inline elements (e.g., within another inline * element or a P), these elements should not contain any block-level * elements. */ any(true, false), /** * Elements that are not actually inline or block, usually such elements are * not rendered at all. */ none(true, false); private boolean afterTagLineBreakNeeded; private boolean leadingAndEndWhitespacesAllowed; private Display(boolean afterTagLineBreakNeeded, boolean leadingAndEndWhitespacesAllowed) { this.afterTagLineBreakNeeded = afterTagLineBreakNeeded; this.leadingAndEndWhitespacesAllowed = leadingAndEndWhitespacesAllowed; } /** * @return true to advise serializers to put line break after tags with such a display type. */ public boolean isAfterTagLineBreakNeeded() { return afterTagLineBreakNeeded; } /** * @return true if tag contents can have single leading or end whitespace */ public boolean isLeadingAndEndWhitespacesAllowed() { return leadingAndEndWhitespacesAllowed; } } src/main/java/org/htmlcleaner/DefaultTagProvider.java0000644000000000000000000011047312424655677021723 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ConcurrentMap; /** * This is the default tag provider for HTML Cleaner * Note this is no longer generated from XML - see https://sourceforge.net/p/htmlcleaner/bugs/81/ */ public class DefaultTagProvider implements ITagInfoProvider { private static final String STRONG = "strong"; private ConcurrentMap tagInfoMap = new ConcurrentHashMap(); // singleton instance, used if no other TagInfoProvider is specified public final static DefaultTagProvider INSTANCE= new DefaultTagProvider(); private static final String CLOSE_BEFORE_COPY_INSIDE_TAGS = "bdo,"+STRONG+",em,q,b,i,u,tt,sub,sup,big,small,strike,s,font"; private static final String CLOSE_BEFORE_TAGS = "h1,h2,h3,h4,h5,h6,p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"; /** * Phrasing tags are those that can make up paragraphs along with text to make Phrasing Content */ private static final String PHRASING_TAGS = "a,abbr,area,audio,b,bdi,bdo,br,button,canvas,cite,code,data,datalist,del,dfn,em,embed,i,iframe,img,input,ins,kbd,keygen,label,link,map,mark,math,meta,meter,noscript,object,output,progress,q,ruby,s,samp,script,select,small,span,strong,sub,sup,svg,template,textarea,time,u,var,video,wbr"; /** * HTML5 Media Tags */ private static final String MEDIA_TAGS = "audio,video"; public DefaultTagProvider() { TagInfo tagInfo; // private static final Set END_TAG_OPTIONAL = Collections.unmodifiableSet(new HashSet(Arrays.asList( // "thead", "dt", "body", "tr", "colgroup", "td", "tfoot", "th", "li", "dd", "tbody", "p", "html", "head", "option"))); // private static final Set END_TAG_FORBIDDEN = Collections.unmodifiableSet(new HashSet(Arrays.asList( // "hr", "col", "param", "link", "img", "br", "meta", "input", "frame", "area", "basefont", "base", "isindex"))); // private static final Set END_TAG_REQUIRED = Collections.unmodifiableSet(new HashSet(Arrays.asList( // "noscript", "kbd", "center", "button", "h5", "h4", "samp", "ol", "h6", "h1", "h3", "h2", "form", "select", // "font", "menu", "ins", // "abbr", "label", "table", "code", "script", "cite", "iframe", "strong", "textarea", "noframes", "big", // "small", "span", "sub", "optgroup", "bdo", "var", "div", "object", "sup", "title", "strike", "style", // "dir", "map", "applet", "dl", "del", "fieldset", "ul", "b", "acronym", "a", "blockquote", // "caption", "i", "u", "s", "frameset", "tt", "address", "q", "pre", "legend", "em", "dfn"))); tagInfo = new TagInfo("div", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("div", tagInfo); /** * The HTML5 semantic flow tags */ // Sectioning tags tagInfo = new TagInfo("aside", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p"); this.put("aside", tagInfo); tagInfo = new TagInfo("section", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p"); this.put("section", tagInfo); tagInfo = new TagInfo("article", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p"); this.put("article", tagInfo); tagInfo = new TagInfo("main", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p"); this.put("main", tagInfo); tagInfo = new TagInfo("nav", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p"); this.put("nav", tagInfo); tagInfo = new TagInfo("details", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p"); this.put("details", tagInfo); tagInfo = new TagInfo("summary", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineRequiredEnclosingTags("details"); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p"); this.put("summary", tagInfo); tagInfo = new TagInfo("figure", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p"); this.put("figure", tagInfo); tagInfo = new TagInfo("figcaption", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); tagInfo.defineRequiredEnclosingTags("figure"); this.put("figcaption", tagInfo); // header and footer tagInfo = new TagInfo("header", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,header,footer,main"); this.put("header", tagInfo); tagInfo = new TagInfo("footer", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,header,footer,main"); this.put("footer", tagInfo); /** * Html5 phrasing tags */ tagInfo = new TagInfo("mark", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineAllowedChildrenTags(PHRASING_TAGS); this.put("mark", tagInfo); tagInfo = new TagInfo("bdi", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineAllowedChildrenTags(PHRASING_TAGS); this.put("bdi", tagInfo); tagInfo = new TagInfo("time", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineAllowedChildrenTags(PHRASING_TAGS); this.put("time", tagInfo); tagInfo = new TagInfo("meter", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineAllowedChildrenTags(PHRASING_TAGS); tagInfo.defineCloseBeforeTags("meter"); this.put("meter", tagInfo); /** * Html5 Ruby text */ tagInfo = new TagInfo("ruby", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineAllowedChildrenTags("rt,rp"); this.put("ruby", tagInfo); tagInfo = new TagInfo("rt", ContentType.text, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.inline); // // If we include this rule, we get an out-of-memory error. See issue 126. // //tagInfo.defineRequiredEnclosingTags("ruby"); tagInfo.defineAllowedChildrenTags(PHRASING_TAGS); this.put("rt", tagInfo); tagInfo = new TagInfo("rp", ContentType.text, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.inline); // // If we include this rule, we get an out-of-memory error. See issue 126. // //tagInfo.defineRequiredEnclosingTags("ruby"); tagInfo.defineAllowedChildrenTags(PHRASING_TAGS); this.put("rp", tagInfo); /** * Html5 media tags */ tagInfo = new TagInfo("audio", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); tagInfo.defineCloseInsideCopyAfterTags(MEDIA_TAGS); this.put("audio", tagInfo); tagInfo = new TagInfo("video", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); tagInfo.defineCloseInsideCopyAfterTags(MEDIA_TAGS); this.put("video", tagInfo); tagInfo = new TagInfo("source", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.any); tagInfo.defineRequiredEnclosingTags(MEDIA_TAGS); this.put("source", tagInfo); tagInfo = new TagInfo("track", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.any); tagInfo.defineRequiredEnclosingTags(MEDIA_TAGS); this.put("track", tagInfo); tagInfo = new TagInfo("canvas", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); this.put("canvas", tagInfo); /** * Html5 interactive tags */ tagInfo = new TagInfo("dialog", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); this.put("dialog", tagInfo); tagInfo = new TagInfo("progress", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); tagInfo.defineAllowedChildrenTags(PHRASING_TAGS); tagInfo.defineCloseBeforeTags("progress"); this.put("progress", tagInfo); /** * HTML 4 and earlier tags */ tagInfo = new TagInfo("span", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("span", tagInfo); tagInfo = new TagInfo("meta", ContentType.none, BelongsTo.HEAD, false, false, false, CloseTag.forbidden, Display.none); this.put("meta", tagInfo); tagInfo = new TagInfo("link", ContentType.none, BelongsTo.HEAD, false, false, false, CloseTag.forbidden, Display.none); this.put("link", tagInfo); tagInfo = new TagInfo("title", ContentType.text, BelongsTo.HEAD, false, true, false, CloseTag.required, Display.none); this.put("title", tagInfo); tagInfo = new TagInfo("style", ContentType.text, BelongsTo.HEAD, false, false, false, CloseTag.required, Display.none); this.put("style", tagInfo); tagInfo = new TagInfo("bgsound", ContentType.none, BelongsTo.HEAD, false, false, false, CloseTag.forbidden, Display.none); this.put("bgsound", tagInfo); tagInfo = new TagInfo("h1", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags(CLOSE_BEFORE_TAGS); this.put("h1", tagInfo); tagInfo = new TagInfo("h2", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags(CLOSE_BEFORE_TAGS); this.put("h2", tagInfo); tagInfo = new TagInfo("h3", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags(CLOSE_BEFORE_TAGS); this.put("h3", tagInfo); tagInfo = new TagInfo("h4", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags(CLOSE_BEFORE_TAGS); this.put("h4", tagInfo); tagInfo = new TagInfo("h5", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags(CLOSE_BEFORE_TAGS); this.put("h5", tagInfo); tagInfo = new TagInfo("h6", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags(CLOSE_BEFORE_TAGS); this.put("h6", tagInfo); // jericho parser requires

        tagInfo = new TagInfo("p", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("p", tagInfo); tagInfo = new TagInfo(STRONG, ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put(STRONG, tagInfo); tagInfo = new TagInfo("em", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("em", tagInfo); tagInfo = new TagInfo("abbr", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("abbr", tagInfo); tagInfo = new TagInfo("acronym", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("acronym", tagInfo); tagInfo = new TagInfo("address", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("address", tagInfo); tagInfo = new TagInfo("bdo", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("bdo", tagInfo); tagInfo = new TagInfo("blockquote", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("blockquote", tagInfo); tagInfo = new TagInfo("cite", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("cite", tagInfo); tagInfo = new TagInfo("q", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("q", tagInfo); tagInfo = new TagInfo("code", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("code", tagInfo); tagInfo = new TagInfo("ins", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); this.put("ins", tagInfo); tagInfo = new TagInfo("del", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); this.put("del", tagInfo); tagInfo = new TagInfo("dfn", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("dfn", tagInfo); tagInfo = new TagInfo("kbd", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("kbd", tagInfo); tagInfo = new TagInfo("pre", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("pre", tagInfo); tagInfo = new TagInfo("samp", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("samp", tagInfo); tagInfo = new TagInfo("listing", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("listing", tagInfo); tagInfo = new TagInfo("var", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("var", tagInfo); tagInfo = new TagInfo("br", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.none); this.put("br", tagInfo); tagInfo = new TagInfo("wbr", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.none); this.put("wbr", tagInfo); tagInfo = new TagInfo("nobr", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseBeforeTags("nobr"); this.put("nobr", tagInfo); tagInfo = new TagInfo("xmp", ContentType.text, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("xmp", tagInfo); tagInfo = new TagInfo("a", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseBeforeTags("a"); this.put("a", tagInfo); tagInfo = new TagInfo("base", ContentType.none, BelongsTo.HEAD, false, false, false, CloseTag.forbidden, Display.none); this.put("base", tagInfo); tagInfo = new TagInfo("img", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.inline); this.put("img", tagInfo); tagInfo = new TagInfo("area", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.none); tagInfo.defineFatalTags("map"); tagInfo.defineCloseBeforeTags("area"); this.put("area", tagInfo); tagInfo = new TagInfo("map", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); tagInfo.defineCloseBeforeTags("map"); this.put("map", tagInfo); tagInfo = new TagInfo("object", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); this.put("object", tagInfo); tagInfo = new TagInfo("param", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.none); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("param", tagInfo); tagInfo = new TagInfo("applet", ContentType.all, BelongsTo.BODY, true, false, false, CloseTag.required, Display.any); this.put("applet", tagInfo); tagInfo = new TagInfo("xml", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.none); this.put("xml", tagInfo); tagInfo = new TagInfo("ul", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("ul", tagInfo); tagInfo = new TagInfo("ol", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("ol", tagInfo); tagInfo = new TagInfo("li", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("li,p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("li", tagInfo); tagInfo = new TagInfo("dl", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("dl", tagInfo); tagInfo = new TagInfo("dt", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineCloseBeforeTags("dt,dd"); this.put("dt", tagInfo); tagInfo = new TagInfo("dd", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineCloseBeforeTags("dt,dd"); this.put("dd", tagInfo); tagInfo = new TagInfo("menu", ContentType.all, BelongsTo.BODY, true, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("menu", tagInfo); tagInfo = new TagInfo("dir", ContentType.all, BelongsTo.BODY, true, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("dir", tagInfo); tagInfo = new TagInfo("table", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineAllowedChildrenTags("tr,tbody,thead,tfoot,colgroup,caption"); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("tr,thead,tbody,tfoot,caption,colgroup,table,p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("table", tagInfo); tagInfo = new TagInfo("tr", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineFatalTags("table"); tagInfo.defineRequiredEnclosingTags("tbody"); tagInfo.defineAllowedChildrenTags("td,th"); tagInfo.defineHigherLevelTags("thead,tfoot"); tagInfo.defineCloseBeforeTags("tr,td,th,caption,colgroup"); this.put("tr", tagInfo); // jericho parser requires tagInfo = new TagInfo("td", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineFatalTags("table"); tagInfo.defineRequiredEnclosingTags("tr"); tagInfo.defineCloseBeforeTags("td,th,caption,colgroup"); this.put("td", tagInfo); tagInfo = new TagInfo("th", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineFatalTags("table"); tagInfo.defineRequiredEnclosingTags("tr"); tagInfo.defineCloseBeforeTags("td,th,caption,colgroup"); this.put("th", tagInfo); tagInfo = new TagInfo("tbody", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineFatalTags("table"); tagInfo.defineAllowedChildrenTags("tr,form"); tagInfo.defineCloseBeforeTags("td,th,tr,tbody,thead,tfoot,caption,colgroup"); this.put("tbody", tagInfo); tagInfo = new TagInfo("thead", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineFatalTags("table"); tagInfo.defineAllowedChildrenTags("tr,form"); tagInfo.defineCloseBeforeTags("td,th,tr,tbody,thead,tfoot,caption,colgroup"); this.put("thead", tagInfo); tagInfo = new TagInfo("tfoot", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineFatalTags("table"); tagInfo.defineAllowedChildrenTags("tr,form"); tagInfo.defineCloseBeforeTags("td,th,tr,tbody,thead,tfoot,caption,colgroup"); this.put("tfoot", tagInfo); tagInfo = new TagInfo("col", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.block); tagInfo.defineFatalTags("colgroup"); this.put("col", tagInfo); tagInfo = new TagInfo("colgroup", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.optional, Display.block); tagInfo.defineFatalTags("table"); tagInfo.defineAllowedChildrenTags("col"); tagInfo.defineCloseBeforeTags("td,th,tr,tbody,thead,tfoot,caption,colgroup"); this.put("colgroup", tagInfo); tagInfo = new TagInfo("caption", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineFatalTags("table"); tagInfo.defineCloseBeforeTags("td,th,tr,tbody,thead,tfoot,caption,colgroup"); this.put("caption", tagInfo); tagInfo = new TagInfo("form", ContentType.all, BelongsTo.BODY, false, false, true, CloseTag.required, Display.block); tagInfo.defineForbiddenTags("form"); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("option,optgroup,textarea,select,fieldset,p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("form", tagInfo); tagInfo = new TagInfo("input", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.inline); tagInfo.defineCloseBeforeTags("select,optgroup,option"); this.put("input", tagInfo); tagInfo = new TagInfo("textarea", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseBeforeTags("select,optgroup,option"); this.put("textarea", tagInfo); tagInfo = new TagInfo("select", ContentType.all, BelongsTo.BODY, false, false, true, CloseTag.required, Display.inline); tagInfo.defineAllowedChildrenTags("option,optgroup"); tagInfo.defineCloseBeforeTags("option,optgroup,select"); this.put("select", tagInfo); tagInfo = new TagInfo("option", ContentType.text, BelongsTo.BODY, false, false, true, CloseTag.optional, Display.inline); tagInfo.defineFatalTags("select"); tagInfo.defineCloseBeforeTags("option"); this.put("option", tagInfo); tagInfo = new TagInfo("optgroup", ContentType.all, BelongsTo.BODY, false, false, true, CloseTag.required, Display.inline); tagInfo.defineFatalTags("select"); tagInfo.defineAllowedChildrenTags("option"); tagInfo.defineCloseBeforeTags("optgroup"); this.put("optgroup", tagInfo); tagInfo = new TagInfo("button", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); tagInfo.defineCloseBeforeTags("select,optgroup,option"); this.put("button", tagInfo); tagInfo = new TagInfo("label", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); this.put("label", tagInfo); tagInfo = new TagInfo("legend", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); // // If we include this rule, we get an out-of-memory error. See issue 129. // //tagInfo.defineRequiredEnclosingTags("fieldset"); tagInfo.defineAllowedChildrenTags(PHRASING_TAGS); this.put("legend", tagInfo); tagInfo = new TagInfo("fieldset", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("fieldset", tagInfo); tagInfo = new TagInfo("isindex", ContentType.none, BelongsTo.BODY, true, false, false, CloseTag.forbidden, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("isindex", tagInfo); tagInfo = new TagInfo("script", ContentType.all, BelongsTo.HEAD_AND_BODY, false, false, false, CloseTag.required, Display.none); this.put("script", tagInfo); tagInfo = new TagInfo("noscript", ContentType.all, BelongsTo.HEAD_AND_BODY, false, false, false, CloseTag.required, Display.block); this.put("noscript", tagInfo); tagInfo = new TagInfo("b", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("u,i,tt,sub,sup,big,small,strike,blink,s"); this.put("b", tagInfo); tagInfo = new TagInfo("i", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,tt,sub,sup,big,small,strike,blink,s"); this.put("i", tagInfo); tagInfo = new TagInfo("u", ContentType.all, BelongsTo.BODY, true, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,i,tt,sub,sup,big,small,strike,blink,s"); this.put("u", tagInfo); tagInfo = new TagInfo("tt", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,i,sub,sup,big,small,strike,blink,s"); this.put("tt", tagInfo); tagInfo = new TagInfo("sub", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,i,tt,sup,big,small,strike,blink,s"); this.put("sub", tagInfo); tagInfo = new TagInfo("sup", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,i,tt,sub,big,small,strike,blink,s"); this.put("sup", tagInfo); tagInfo = new TagInfo("big", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,i,tt,sub,sup,small,strike,blink,s"); this.put("big", tagInfo); tagInfo = new TagInfo("small", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,i,tt,sub,sup,big,strike,blink,s"); this.put("small", tagInfo); tagInfo = new TagInfo("strike", ContentType.all, BelongsTo.BODY, true, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,i,tt,sub,sup,big,small,blink,s"); this.put("strike", tagInfo); tagInfo = new TagInfo("blink", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,i,tt,sub,sup,big,small,strike,s"); this.put("blink", tagInfo); tagInfo = new TagInfo("marquee", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("marquee", tagInfo); tagInfo = new TagInfo("s", ContentType.all, BelongsTo.BODY, true, false, false, CloseTag.required, Display.inline); tagInfo.defineCloseInsideCopyAfterTags("b,u,i,tt,sub,sup,big,small,strike,blink"); this.put("s", tagInfo); tagInfo = new TagInfo("hr", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("hr", tagInfo); tagInfo = new TagInfo("font", ContentType.all, BelongsTo.BODY, true, false, false, CloseTag.required, Display.inline); this.put("font", tagInfo); tagInfo = new TagInfo("basefont", ContentType.none, BelongsTo.BODY, true, false, false, CloseTag.forbidden, Display.none); this.put("basefont", tagInfo); tagInfo = new TagInfo("center", ContentType.all, BelongsTo.BODY, true, false, false, CloseTag.required, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("center", tagInfo); tagInfo = new TagInfo("comment", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.none); this.put("comment", tagInfo); tagInfo = new TagInfo("server", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.none); this.put("server", tagInfo); tagInfo = new TagInfo("iframe", ContentType.all, BelongsTo.BODY, false, false, false, CloseTag.required, Display.any); this.put("iframe", tagInfo); tagInfo = new TagInfo("embed", ContentType.none, BelongsTo.BODY, false, false, false, CloseTag.forbidden, Display.block); tagInfo.defineCloseBeforeCopyInsideTags(CLOSE_BEFORE_COPY_INSIDE_TAGS); tagInfo.defineCloseBeforeTags("p,address,label,abbr,acronym,dfn,kbd,samp,var,cite,code,param,xml"); this.put("embed", tagInfo); } /** * @param key * @param tagInfo */ protected void put(String tagName, TagInfo tagInfo) { this.tagInfoMap.put(tagName, tagInfo); } public TagInfo getTagInfo(String tagName) { if ( tagName == null) { // null named tagNode happens when a html fragment is being dealt with return null; } else { return this.tagInfoMap.get(tagName); } } }src/main/java/org/htmlcleaner/ContentType.java0000644000000000000000000000537112113037735020424 0ustar rootroot/* Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; /** * @author patmoore * */ public enum ContentType { all("all"), /** * elements that have no children or content ( for example ). For these elements, the check for null elements must be more than must a children/ content check. */ none("none"), text("text"); private final String dbCode; private ContentType(String dbCode) { this.dbCode =dbCode; } /** * @return the dbCode */ public String getDbCode() { return dbCode; } public static ContentType toValue(Object value) { ContentType result = null; if ( value instanceof ContentType) { result = (ContentType) value; } else if ( value != null ) { String dbCode = value.toString().trim(); for(ContentType contentType: ContentType.values()) { if ( contentType.getDbCode().equalsIgnoreCase(dbCode) || contentType.name().equalsIgnoreCase(dbCode)) { result = contentType; break; } } } return result; } } src/main/java/org/htmlcleaner/ContentNode.java0000644000000000000000000000473712634514541020400 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.io.IOException; import java.io.Writer; /** *

        HTML text token.

        */ public class ContentNode extends BaseTokenImpl implements HtmlNode { protected final String content; protected final boolean blank; public ContentNode(String content) { this.content = content; this.blank = Utils.isEmptyString(this.content); } public String getContent() { return content; } @Override public String toString() { return getContent(); } public void serialize(Serializer serializer, Writer writer) throws IOException { writer.write( getContent() ); } public boolean isBlank() { return this.blank; } }src/main/java/org/htmlcleaner/ConfigFileTagProvider.java0000644000000000000000000002723012200742042022311 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import org.xml.sax.Attributes; import org.xml.sax.InputSource; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import java.io.*; import java.util.HashMap; import java.util.Map; import java.net.URL; /** * Configuration file tag provider - reads XML file in specified format and creates a Tag Provider. * Used to create custom tag providers when used on the command line. */ public class ConfigFileTagProvider extends HashMap implements ITagInfoProvider { // obtaining instance of the SAX parser factory static SAXParserFactory parserFactory = SAXParserFactory.newInstance(); static { parserFactory.setValidating(false); parserFactory.setNamespaceAware(false); } // tells whether to generate code of the tag provider class based on XML configuration file // to the standard output private boolean generateCode = false; private ConfigFileTagProvider() { } public ConfigFileTagProvider(InputSource inputSource) { try { new ConfigParser(this).parse(inputSource); } catch (Exception e) { throw new HtmlCleanerException("Error parsing tag configuration file!", e); } } public ConfigFileTagProvider(File file) { try { new ConfigParser(this).parse(new InputSource(new FileReader(file))); } catch (Exception e) { throw new HtmlCleanerException("Error parsing tag configuration file!", e); } } public ConfigFileTagProvider(URL url) { try { Object content = url.getContent(); if (content instanceof InputStream) { InputStreamReader reader = new InputStreamReader((InputStream)content); new ConfigParser(this).parse(new InputSource(reader)); } } catch (Exception e) { throw new HtmlCleanerException("Error parsing tag configuration file!", e); } } public TagInfo getTagInfo(String tagName) { return (TagInfo) get(tagName); } /** * Generates code for tag provider class from specified configuration XML file. * In order to create custom tag info provider, make config file and call this main method * with the specified file. Output will be generated on the standard output. This way a custom * tag provider (class CustomTagProvider) is generated from an XML file. An example XML file, * "example.xml", can be found in the source distribution. * * @param args * @throws IOException * @throws SAXException * @throws ParserConfigurationException */ public static void main(String[] args) throws IOException, SAXException, ParserConfigurationException { final ConfigFileTagProvider provider = new ConfigFileTagProvider(); provider.generateCode = true; String fileName = "default.xml"; if (args != null && args.length>0){ fileName = args[0]; } File configFile = new File(fileName); String packagePath = "org.htmlcleaner"; String className = "CustomTagProvider"; final ConfigParser parser = provider.new ConfigParser(provider); System.out.println("package " + packagePath + ";"); System.out.println("import java.util.HashMap;"); System.out.println("public class " + className + " extends HashMap implements ITagInfoProvider {"); System.out.println("private ConcurrentMap tagInfoMap = new ConcurrentHashMap();"); System.out.println("// singleton instance, used if no other TagInfoProvider is specified"); System.out.println("public final static "+className+" INSTANCE= new "+className+"();"); System.out.println("public " + className + "() {"); System.out.println("TagInfo tagInfo;"); parser.parse( new InputSource(new FileReader(configFile)) ); System.out.println("}"); System.out.println("}"); } /** * SAX parser for tag configuration files. */ private class ConfigParser extends DefaultHandler { private TagInfo tagInfo = null; private String dependencyName = null; private Map tagInfoMap; ConfigParser(Map tagInfoMap) { this.tagInfoMap = tagInfoMap; } public void parse(InputSource in) throws ParserConfigurationException, SAXException, IOException { SAXParser parser = parserFactory.newSAXParser(); parser.parse(in, this); } @Override public void characters(char[] ch, int start, int length) throws SAXException { if (tagInfo != null) { String value = new String(ch, start, length).trim(); if ( "fatal-tags".equals(dependencyName) ) { tagInfo.defineFatalTags(value); if (generateCode) { System.out.println("tagInfo.defineFatalTags(\"" + value + "\");"); } } else if ( "req-enclosing-tags".equals(dependencyName) ) { tagInfo.defineRequiredEnclosingTags(value); if (generateCode) { System.out.println("tagInfo.defineRequiredEnclosingTags(\"" + value + "\");"); } } else if ( "forbidden-tags".equals(dependencyName) ) { tagInfo.defineForbiddenTags(value); if (generateCode) { System.out.println("tagInfo.defineForbiddenTags(\"" + value + "\");"); } } else if ( "allowed-children-tags".equals(dependencyName) ) { tagInfo.defineAllowedChildrenTags(value); if (generateCode) { System.out.println("tagInfo.defineAllowedChildrenTags(\"" + value + "\");"); } } else if ( "higher-level-tags".equals(dependencyName) ) { tagInfo.defineHigherLevelTags(value); if (generateCode) { System.out.println("tagInfo.defineHigherLevelTags(\"" + value + "\");"); } } else if ( "close-before-copy-inside-tags".equals(dependencyName) ) { tagInfo.defineCloseBeforeCopyInsideTags(value); if (generateCode) { System.out.println("tagInfo.defineCloseBeforeCopyInsideTags(\"" + value + "\");"); } } else if ( "close-inside-copy-after-tags".equals(dependencyName) ) { tagInfo.defineCloseInsideCopyAfterTags(value); if (generateCode) { System.out.println("tagInfo.defineCloseInsideCopyAfterTags(\"" + value + "\");"); } } else if ( "close-before-tags".equals(dependencyName) ) { tagInfo.defineCloseBeforeTags(value); if (generateCode) { System.out.println("tagInfo.defineCloseBeforeTags(\"" + value + "\");"); } } } } @Override public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { if ( "tag".equals(qName) ) { String name = attributes.getValue("name"); String content = attributes.getValue("content"); String section = attributes.getValue("section"); String deprecated = attributes.getValue("deprecated"); String unique = attributes.getValue("unique"); String ignorePermitted = attributes.getValue("ignore-permitted"); ContentType contentType = ContentType.toValue(content); BelongsTo belongsTo = BelongsTo.toValue(section); tagInfo = new TagInfo(name, contentType, belongsTo, deprecated != null && "true".equals(deprecated), unique != null && "true".equals(unique), ignorePermitted != null && "true".equals(ignorePermitted), CloseTag.required, Display.any ); if (generateCode) { String s = "tagInfo = new TagInfo(\"#1\", #2, #3, #4, #5, #6);"; s = s.replaceAll("#1", name); s = s.replaceAll("#2", ContentType.class.getCanonicalName()+"."+contentType.name()); s = s.replaceAll("#3", BelongsTo.class.getCanonicalName()+"."+belongsTo.name()); s = s.replaceAll("#4", Boolean.toString(deprecated != null && "true".equals(deprecated))); s = s.replaceAll("#5", Boolean.toString(unique != null && "true".equals(unique))); s = s.replaceAll("#6", Boolean.toString(ignorePermitted != null && "true".equals(ignorePermitted))); System.out.println(s); } } else if ( !"tags".equals(qName) ) { dependencyName = qName; } } @Override public void endElement(String uri, String localName, String qName) throws SAXException { if ( "tag".equals(qName) ) { if (tagInfo != null) { tagInfoMap.put(tagInfo.getName(), tagInfo); if (generateCode) { System.out.println("this.put(\"" + tagInfo.getName() + "\", tagInfo);\n"); } } tagInfo = null; } else if ( !"tags".equals(qName) ) { dependencyName = null; } } } }src/main/java/org/htmlcleaner/conditional/0000755000000000000000000000000013105122453017574 5ustar rootrootsrc/main/java/org/htmlcleaner/conditional/TagNodeNameCondition.java0000644000000000000000000000065512113037735024444 0ustar rootrootpackage org.htmlcleaner.conditional; import org.htmlcleaner.TagNode; /** * Checks if node has specified name. */ public class TagNodeNameCondition implements ITagNodeCondition { private String name; public TagNodeNameCondition(String name) { this.name = name; } public boolean satisfy(TagNode tagNode) { return tagNode == null ? false : tagNode.getName().equalsIgnoreCase(this.name); } }src/main/java/org/htmlcleaner/conditional/TagNodeInsignificantBrCondition.java0000644000000000000000000000224012113322304026612 0ustar rootrootpackage org.htmlcleaner.conditional; import java.util.List; import org.htmlcleaner.TagNode; /** * Checks if node is an insignificant br tag -- is placed at the end or at the * start of a block. * * @author Konstantin Burov (aectann@gmail.com) */ public class TagNodeInsignificantBrCondition implements ITagNodeCondition { private static final String BR_TAG = "br"; public TagNodeInsignificantBrCondition() { } public boolean satisfy(TagNode tagNode) { if (!isBrNode(tagNode)) { return false; } TagNode parent = tagNode.getParent(); List children = parent.getAllChildren(); int brIndex = children.indexOf(tagNode); return checkSublist(0, brIndex, children) || checkSublist (brIndex, children.size(), children); } private boolean isBrNode(TagNode tagNode) { return tagNode != null && BR_TAG.equals(tagNode.getName()); } private boolean checkSublist(int start, int end, List list) { List sublist = list.subList(start, end); for (Object object : sublist) { if(!(object instanceof TagNode)){ return false; } TagNode node = (TagNode) object; if(!isBrNode(node)&&!node.isPruned()){ return false; } } return true; } } src/main/java/org/htmlcleaner/conditional/TagNodeEmptyContentCondition.java0000644000000000000000000000666012113322304026204 0ustar rootrootpackage org.htmlcleaner.conditional; import java.util.HashSet; import java.util.Map; import java.util.Set; import org.htmlcleaner.ContentNode; import org.htmlcleaner.ITagInfoProvider; import org.htmlcleaner.TagInfo; import org.htmlcleaner.TagNode; import static org.htmlcleaner.Utils.isEmptyString; import static org.htmlcleaner.Display.*; /** * Checks if node is an inline 0r block element and has empty contents or white/non-breakable spaces only. Nodes that have * non-empty id attribute are considered to be non-empty, since they can be used in javascript scenarios. * * Examples that should be pruned, *
         *   
         * 
        *
        * * Examples of code that should NOT be pruned: * *
         * 

        - no content but image tags do not have text content. * hi - the first (empty) td is a placeholder so the second td is in the correct column *
        * @author Konstantin Burov */ public class TagNodeEmptyContentCondition implements ITagNodeCondition { private static final String ID_ATTRIBUTE_NAME = "id"; /** * Removal of element from this set can affect layout too hard. */ private static final Set < String > unsafeBlockElements = new HashSet < String >(); static { // cannot just remove a td unless removing the entire row. td's are place holders unsafeBlockElements.add("td"); unsafeBlockElements.add("th"); } private ITagInfoProvider tagInfoProvider; public TagNodeEmptyContentCondition(ITagInfoProvider provider) { this.tagInfoProvider = provider; } public boolean satisfy(TagNode tagNode) { return satisfy(tagNode, false); } private boolean satisfy(TagNode tagNode, boolean override) { String name = tagNode.getName(); TagInfo tagInfo = tagInfoProvider.getTagInfo(name); //Only _block_ elements can match. if (tagInfo != null && !hasIdAttributeSet(tagNode) && none != tagInfo.getDisplay() && !tagInfo.isEmptyTag() && (override || !unsafeBlockElements.contains(name))) { CharSequence contentString = tagNode.getText(); if(isEmptyString(contentString)) { // even though there may be no text need to make sure all children are empty or can be pruned if (tagNode.isEmpty()) { return true; } else { for(Object child: tagNode.getAllChildren()) { // TODO : similar check as in tagNode.isEmpty() argues for a visitor pattern // but allow empty td, ths to be pruned. if ( child instanceof TagNode) { if (!satisfy((TagNode)child, true)) { return false; } } else if (child instanceof ContentNode ) { if ( !((ContentNode)child).isBlank()) { return false; } } else { return false; } } return true; } } } return false; } private boolean hasIdAttributeSet(TagNode tagNode) { Map < String, String > attributes = tagNode.getAttributes(); return !isEmptyString(attributes.get(ID_ATTRIBUTE_NAME)); } }src/main/java/org/htmlcleaner/conditional/TagNodeAutoGeneratedCondition.java0000644000000000000000000000136012113037735026305 0ustar rootrootpackage org.htmlcleaner.conditional; import org.htmlcleaner.TagNode; /** * Remove empty autogenerated nodes. These nodes are created when an unclosed tag is immediately closed. * @author patmoore * */ public class TagNodeAutoGeneratedCondition implements ITagNodeCondition { public static final TagNodeAutoGeneratedCondition INSTANCE = new TagNodeAutoGeneratedCondition(); /** * @see org.htmlcleaner.conditional.ITagNodeCondition#satisfy(org.htmlcleaner.TagNode) */ public boolean satisfy(TagNode tagNode) { // auto-generated node that is not needed. return tagNode.isAutoGenerated() && tagNode.isEmpty(); } @Override public String toString() { return "auto generated tagNode"; } } src/main/java/org/htmlcleaner/conditional/TagNodeAttValueCondition.java0000644000000000000000000000160412113037735025304 0ustar rootrootpackage org.htmlcleaner.conditional; import org.htmlcleaner.TagNode; /** * Checks if node has specified attribute with specified value. */ public class TagNodeAttValueCondition implements ITagNodeCondition { private String attName; private String attValue; private boolean isCaseSensitive; public TagNodeAttValueCondition(String attName, String attValue, boolean isCaseSensitive) { this.attName = attName; this.attValue = attValue; this.isCaseSensitive = isCaseSensitive; } public boolean satisfy(TagNode tagNode) { if (tagNode == null || attName == null || attValue == null) { return false; } else { return isCaseSensitive ? attValue.equals( tagNode.getAttributeByName(attName) ) : attValue.equalsIgnoreCase( tagNode.getAttributeByName(attName) ); } } }src/main/java/org/htmlcleaner/conditional/TagNodeAttNameValueRegexCondition.java0000644000000000000000000000171412113037735027102 0ustar rootrootpackage org.htmlcleaner.conditional; import java.util.Map; import java.util.regex.Pattern; import org.htmlcleaner.TagNode; /** * Checks if node has specified attribute with specified value. */ public class TagNodeAttNameValueRegexCondition implements ITagNodeCondition { private Pattern attNameRegex; private Pattern attValueRegex; public TagNodeAttNameValueRegexCondition(Pattern attNameRegex, Pattern attValueRegex) { this.attNameRegex = attNameRegex; this.attValueRegex = attValueRegex; } public boolean satisfy(TagNode tagNode) { if (tagNode != null ) { for(Map.Entryentry: tagNode.getAttributes().entrySet()) { if ( (attNameRegex == null || attNameRegex.matcher(entry.getKey()).find()) && (attValueRegex == null || attValueRegex.matcher( entry.getValue() ).find())) { return true; } } } return false; } }src/main/java/org/htmlcleaner/conditional/TagNodeAttExistsCondition.java0000644000000000000000000000073412113037735025512 0ustar rootrootpackage org.htmlcleaner.conditional; import org.htmlcleaner.TagNode; /** * Checks if node contains specified attribute. */ public class TagNodeAttExistsCondition implements ITagNodeCondition { private String attName; public TagNodeAttExistsCondition(String attName) { this.attName = attName; } public boolean satisfy(TagNode tagNode) { return tagNode == null ? false : tagNode.getAttributes().containsKey( attName.toLowerCase() ); } }src/main/java/org/htmlcleaner/conditional/TagAllCondition.java0000644000000000000000000000034312113037735023460 0ustar rootrootpackage org.htmlcleaner.conditional; import org.htmlcleaner.TagNode; /** * All nodes. */ public class TagAllCondition implements ITagNodeCondition { public boolean satisfy(TagNode tagNode) { return true; } }src/main/java/org/htmlcleaner/conditional/ITagNodeCondition.java0000644000000000000000000000031712113037735023747 0ustar rootrootpackage org.htmlcleaner.conditional; import org.htmlcleaner.TagNode; /** * Used as base for different node checkers. */ public interface ITagNodeCondition { public boolean satisfy(TagNode tagNode); }src/main/java/org/htmlcleaner/CompactXmlSerializer.java0000644000000000000000000000773412234166103022252 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.io.IOException; import java.io.Writer; import java.util.*; /** *

        Compact XML serializer - creates resulting XML by stripping whitespaces.

        */ public class CompactXmlSerializer extends XmlSerializer { public CompactXmlSerializer(CleanerProperties props) { super(props); } @Override protected void serialize(TagNode tagNode, Writer writer) throws IOException { serializeOpenTag(tagNode, writer, false); List tagChildren = tagNode.getAllChildren(); if ( !isMinimizedTagSyntax(tagNode) ) { ListIterator childrenIt = tagChildren.listIterator(); while ( childrenIt.hasNext() ) { Object item = childrenIt.next(); if (item != null) { if ( item instanceof ContentNode ) { String content = ((ContentNode) item).getContent().trim(); writer.write( dontEscape(tagNode) ? content.replaceAll("]]>", "]]>") : escapeXml(content) ); if (childrenIt.hasNext()) { if ( !isWhitespaceString(childrenIt.next()) ) { writer.write("\n"); } childrenIt.previous(); } } else if (item instanceof CommentNode) { String content = ((CommentNode) item).getCommentedContent().trim(); writer.write(content); } else { ((BaseToken)item).serialize(this, writer); } } } serializeEndTag(tagNode, writer, false); } } /** * Checks whether specified object's string representation is empty string (containing of only whitespaces). * @param object Object whose string representation is checked * @return true, if empty string, false otherwise */ private boolean isWhitespaceString(Object object) { if (object != null) { String s = object.toString(); return s != null && "".equals(s.trim()); } return false; } }src/main/java/org/htmlcleaner/CompactHtmlSerializer.java0000644000000000000000000001067012234166103022407 0ustar rootroot/* Copyright (c) 2006-20013, HtmlCleaner project All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.io.*; import java.util.*; /** *

        Compact HTML serializer - creates resulting HTML by stripping whitespaces wherever possible.

        */ public class CompactHtmlSerializer extends HtmlSerializer { private int openPreTags = 0; public CompactHtmlSerializer(CleanerProperties props) { super(props); } protected void serialize(TagNode tagNode, Writer writer) throws IOException { boolean isPreTag = "pre".equalsIgnoreCase(tagNode.getName()); if (isPreTag) { openPreTags++; } serializeOpenTag(tagNode, writer, false); List tagChildren = tagNode.getAllChildren(); if ( !isMinimizedTagSyntax(tagNode) ) { ListIterator childrenIt = tagChildren.listIterator(); while ( childrenIt.hasNext() ) { Object item = childrenIt.next(); if (item instanceof ContentNode) { String content = item.toString(); if (openPreTags > 0) { writer.write(content); } else { boolean startsWithSpace = content.length() > 0 && Character.isWhitespace( content.charAt(0) ); boolean endsWithSpace = content.length() > 1 && Character.isWhitespace( content.charAt(content.length() - 1) ); content = dontEscape(tagNode) ? content.trim() : escapeText(content.trim()); if (startsWithSpace) { writer.write(' '); } if (content.length() != 0) { writer.write(content); if (endsWithSpace) { writer.write(' '); } } if (childrenIt.hasNext()) { if ( !Utils.isWhitespaceString(childrenIt.next()) ) { writer.write("\n"); } childrenIt.previous(); } } } else if (item instanceof CommentNode) { String content = ((CommentNode) item).getCommentedContent().trim(); writer.write(content); } else if (item instanceof BaseToken) { ((BaseToken)item).serialize(this, writer); } } serializeEndTag(tagNode, writer, false); if (isPreTag) { openPreTags--; } } } }src/main/java/org/htmlcleaner/CommentNode.java0000644000000000000000000000465212113037735020361 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.io.IOException; import java.io.Writer; /** *

        HTML comment token.

        */ public class CommentNode extends BaseTokenImpl implements HtmlNode { private String content; public CommentNode(String content) { this.content = content; } public String getCommentedContent() { return ""; } public String getContent() { return content; } @Override public String toString() { return getCommentedContent(); } public void serialize(Serializer serializer, Writer writer) throws IOException { writer.write( getCommentedContent() ); } }src/main/java/org/htmlcleaner/CommandLine.java0000644000000000000000000004040113100076320020315 0ustar rootroot/* Copyright (c) 2006-2007, Vladimir Nikic All rights reserved. Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * The name of HtmlCleaner may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact Vladimir Nikic by sending e-mail to nikic_vladimir@yahoo.com. Please include the word "HtmlCleaner" in the subject line. */ package org.htmlcleaner; import java.io.File; import java.io.IOException; import java.io.OutputStream; import java.io.FileOutputStream; import java.net.URL; import java.util.Map; import java.util.Scanner; import java.util.TreeMap; import java.util.logging.Logger; import org.htmlcleaner.audit.HtmlModificationListenerLogger; /** *

        Command line usage class.

        */ public class CommandLine { private static final String OMITXMLDECL = "omitxmldecl"; /** * If the specified argument name exists without a value, return true. * If it exists with a value, translate it as a boolean. * @param args the command line arguments * @param name the switch name * @return true, or false, depending on whether the switch has been specified */ private static boolean getSwitchArgument(String[] args, String name){ boolean value = false; for (String curr : args){ int eqIndex = curr.indexOf('='); if (eqIndex >= 0) { String argName = curr.substring(0, eqIndex).trim(); String argValue = curr.substring(eqIndex+1).trim(); if (argName.toLowerCase().startsWith(name.toLowerCase())) { value = toBoolean(argValue); } } else { value = true; } } return value; } private static String getArgValue(String[] args, String name, String defaultValue) { for (String curr : args) { int eqIndex = curr.indexOf('='); if (eqIndex >= 0) { String argName = curr.substring(0, eqIndex).trim(); String argValue = curr.substring(eqIndex+1).trim(); if (argName.toLowerCase().startsWith(name.toLowerCase())) { return argValue; } } } return defaultValue; } private static boolean toBoolean(String s) { return s != null && ( "on".equalsIgnoreCase(s) || "true".equalsIgnoreCase(s) || "yes".equalsIgnoreCase(s) ); } private final static String className = CommandLine.class.getName(); private final static Logger logger = Logger.getLogger(className); public static void main(String[] args) throws IOException, XPatherException { String source = getArgValue(args, "src", ""); Scanner scan = new Scanner(System.in); String s = ""; if ( "".equals(source) ) { while (scan.hasNext()) { s += scan.nextLine(); } if (s.compareTo("") != 0) { System.err.println("Output:"); } else { System.err.println("Usage: java -jar htmlcleanerXX.jar src= [htmlver=4] [incharset=] " + "[dest=] [outcharset=] [taginfofile=] [options...]"); System.err.println("Alternative: java -jar htmlcleanerXX.jar (reads the input from console)"); System.err.println(""); System.err.println("where options include:"); System.err.println(" outputtype=simple* | compact | browser-compact | pretty"); System.err.println(" advancedxmlescape=true* | false"); System.err.println(" usecdata=true* | false"); System.err.println(" usecdatafor= [script,style]"); System.err.println(" specialentities=true* | false"); System.err.println(" unicodechars=true* | false"); System.err.println(" omitunknowntags=true | false*"); System.err.println(" treatunknowntagsascontent=true | false*"); System.err.println(" omitdeprtags=true | false*"); System.err.println(" treatdeprtagsascontent=true | false*"); System.err.println(" omitcomments=true | false*"); System.err.println(" " +OMITXMLDECL +"=true* | false"); System.err.println(" omitdoctypedecl=true* | false"); System.err.println(" omithtmlenvelope=true | false*"); System.err.println(" useemptyelementtags=true* | false"); System.err.println(" allowmultiwordattributes=true* | false"); System.err.println(" allowhtmlinsideattributes=true | false*"); System.err.println(" ignoreqe=true | false*"); System.err.println(" namespacesaware=true* | false"); System.err.println(" hyphenreplacement= [=]"); System.err.println(" prunetags= []"); System.err.println(" booleanatts=self* | empty | true"); System.err.println(" nodebyxpath="); System.err.println(" allowinvalidxmlattributenames=true | false*"); System.err.println(" invalidxmlattributenameprefix= []"); System.err.println(" t:[=[,]]"); System.err.println(" t:.[=