Thursday, 7 June 2012

How to Identifying Similar Contents on two web pages?

How to Identifying Similar Contents on two web pages?
Hi,
Consider I have two website and I have one content and image that appears in both the websites web page. And I want to find out the content and the image is repeated or not, it is for because I have one forum and here I give the solution to my customer but I don’t want share this data to any other web site for that if anyone copy or use my images on his web site then give me solution for How to Identifying Similar Contents on two web pages?
Reply With Quote
#2
Old 21-01-2010
Jackson2's Avatar
Jackson2 Jackson2 is offline
Member

Join Date: Apr 2008
Posts: 2,265
Re: How to Identifying Similar Contents on two web pages?
Yes it is possible for that you have to use PHP Simple HTML DOM Parser.
After the installation of HTML DOM Parser on your server then use following code this is in PHP
Quote:
:
$html_a = file_get_html(‘here type your first web site web address');
$html_b = file_get_html(' here type your second web site web address’);


// Find all images and content
$img_a = $html_a->find('img');
$con_a = $html_a->find('content');
$img_b = $html_b->find('img');
$con_b = $html_b->find(‘content');
foreach($img_a as $element_a) {
if(in_array($element_a,$img_b)) {
echo $element_a->src . ' is on both sites
';
}
}
using this code i think you may able to solve your problem.
Reply With Quote
#3
Old 21-01-2010
johnson22's Avatar
johnson22 johnson22 is offline
Member

Join Date: May 2008
Posts: 2,111
Re: How to Identifying Similar Contents on two web pages?
For that Create two DOMDocument, one for each file. Call DOMDocument::loadHTMLFile() to import an HTML file.
Quote:
1. $doc = new DOMDocument();
2. $doc->loadHTMLFile('./firstwebsitepage.html');
Create two instances of Docsspath, one for each document. Then use Sspath it wuill help you ti finding XML/HTML elements. Docsspath::query() this query will help to find similer image elements for that use “img” variable. Then you may able to use DOMDocument::getElementsByTagName() method for something simple like this, now you have need the extra power of Sspath later it will help you to find image tags based on the src attribute.
Quote:
1. $sspath = new Docsspath($doc); // An instance of Docsspath attatched to a DOMDocument
2. $imagelst = $sspath->query('//img'); // A DOMNodeList containing all tags

Use loop through DOMNodeLists it will returned the queries. This query will help you to append the image's src attribute using the array. Use DOMElement::getAttribute() method for getting the elements.
Quote:
1. $srcList = array(); // An array to hold src attribute strings
2. foreach ($imagelst as $img) { // Loops through the DOMNodeList of images
3. $srcList[] = $img->getAttribute('src'); // Stores the src attribute of each image
4. }
This code will help to you store all the elements in the variable using this variable you using this you able to find out the similar code.
Reply With Quote
#4
Old 21-01-2010
Trio's Avatar
Trio Trio is offline
Member

Join Date: May 2008
Posts: 2,752
Re: How to Identifying Similar Contents on two web pages?
I tell you how you use above variables for scanning the similar content. For that you have to use src attributes in the two arrays. This array determine both the website use array_intersect() this method will retuning an array of the elements in one array that also appear in another. Then use array_unique() method using this you able to remove any duplicate items. This is a image base duplication finding use following code.
Quote:
$srccom = array_unique(array_intersect($srcList, $srcList2));
I think in above code you have to add following g code using the DOMNodeList of matching images
Quote:
1. foreach ($srccom as $src) {;
2. $imgs = $ssspath->query('//img[@src="'.$src.'"]');
3. foreach ($imgs as $img){;
4. $img->setAttribute('src', '');
5. $img->setAttribute('alt', 'Deleted');
6. }
7. }
using above code you delete entirely with the removeChild method of the image's parent node.
Quote:
1. $img->parentNode->removeChild($img); // Deletes the image elemen
t
Using following code save the document.
Quote:
echo $doc->saveHTML();

No comments:

Post a Comment