Difference between revisions of "Sitemaps"

From UA Libraries Digital Services Planning and Documentation
(Created page with "[https://support.google.com/webmasters/answer/156184?hl=en Sitemaps] are a way of telling web search engine crawlers where to find the content on your site that you want them ...")
 
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[https://support.google.com/webmasters/answer/156184?hl=en Sitemaps] are a way of telling web search engine crawlers where to find the content on your site that you want them to index.    After all, crawlers have no idea how to create the URLs that your database or delivery system create on the fly to provide access to online materials.  Thus, database content is like a black hole on the web, and without help, that content will not be reflected in web search engine results such as Google.
 
[https://support.google.com/webmasters/answer/156184?hl=en Sitemaps] are a way of telling web search engine crawlers where to find the content on your site that you want them to index.    After all, crawlers have no idea how to create the URLs that your database or delivery system create on the fly to provide access to online materials.  Thus, database content is like a black hole on the web, and without help, that content will not be reflected in web search engine results such as Google.
 +
  
 
Sitemaps cannot contain more than 50,000 URLs and must be no larger than 50 MB uncompressed.  If you have multiple sitemaps, then you need a sitemap index file that lists them all -- then this would be the file you submit to the search engine site for indexing, as opposed to the sitemap itself.
 
Sitemaps cannot contain more than 50,000 URLs and must be no larger than 50 MB uncompressed.  If you have multiple sitemaps, then you need a sitemap index file that lists them all -- then this would be the file you submit to the search engine site for indexing, as opposed to the sitemap itself.
  
Our sitemaps are automatically regenerated once a month, using the file date for the <lastmod> value;  all our entries are listed as changing "yearly", since the next option is "monthly" and they rarely are updated that frequently.  The <priority> value is highest for finding aids, and lowest for mass-digitized content (as that has little metadata to index).
 
  
Our sitemaps are located in /srv/www/htdocs/acumen/sitemaps and /srv/www/htdocs/sitemaps/  with corresponding sitemapIndex files in the directory just above these locations (visible via the web at [http://acumen.lib.ua.edu/sitemapIndex.xml
+
Our sitemaps are automatically regenerated once a month (by makeSiteMap in /srv/scripts/sitemaps/), using the file date for the <lastmod> value;  all our entries are listed as changing "yearly", since the next option is "monthly" and they rarely are updated that frequently.  The <priority> value is highest for finding aids, and lowest for mass-digitized content (as that has little metadata to index).
http://acumen.lib.ua.edu/sitemapIndex.xml] and [http://libcontent.lib.ua.edu/sitemapIndex.xml
+
 
http://libcontent.lib.ua.edu/sitemapIndex.xml].
+
 
 +
Our sitemaps are located in /srv/www/htdocs/acumen/sitemaps and /srv/www/htdocs/sitemaps/  with corresponding sitemapIndex files in the directory just above these locations (visible via the web at
 +
[http://acumen.lib.ua.edu/sitemapIndex.xml http://acumen.lib.ua.edu/sitemapIndex.xml] and  
 +
[http://libcontent.lib.ua.edu/sitemapIndex.xml http://libcontent.lib.ua.edu/sitemapIndex.xml].
  
  
 
One is for Acumen, and the other for libcontent, but they contain the same links.
 
One is for Acumen, and the other for libcontent, but they contain the same links.
 +
 +
Here's what a sitemap index file looks like:
 +
 +
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 +
<sitemap>
 +
    <loc>http://acumen.lib.ua.edu/sitemaps/sitemap1.xml</loc>
 +
    <lastmod>2015-03-01T07:10:16+00:00</lastmod>
 +
  </sitemap>
 +
  <sitemap>
 +
    <loc>http://acumen.lib.ua.edu/sitemaps/sitemap2.xml</loc>
 +
    <lastmod>2015-03-01T07:10:17+00:00</lastmod>
 +
  </sitemap>
 +
  <sitemap>
 +
    <loc>http://acumen.lib.ua.edu/sitemaps/sitemap3.xml</loc>
 +
    <lastmod>2015-03-01T07:10:17+00:00</lastmod>
 +
  </sitemap>
 +
  </sitemapindex>
 +
 +
 +
 +
And here is the first part of one of the sitemaps:
 +
 +
  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 +
    <url>
 +
      <loc>http://acumen.lib.ua.edu/u0003_0004090</loc>
 +
      <lastmod>2015-02-24T16:22:54+00:00</lastmod>
 +
      <changefreq>yearly</changefreq>
 +
      <priority>1.0</priority>
 +
    </url>
 +
    <url>
 +
      <loc>http://acumen.lib.ua.edu/u0003_0004091</loc>
 +
      <lastmod>2015-02-27T22:59:57+00:00</lastmod>
 +
      <changefreq>yearly</changefreq>
 +
      <priority>1.0</priority>
 +
    </url>
 +
    <url>
 +
      <loc>http://acumen.lib.ua.edu/u0003_0004092</loc>
 +
      <lastmod>2015-02-27T22:59:58+00:00</lastmod>
 +
      <changefreq>yearly</changefreq>
 +
      <priority>1.0</priority>
 +
    </url>
 +
    <url>
 +
      <loc>http://acumen.lib.ua.edu/u0003_0004093</loc>
 +
      <lastmod>2015-02-27T22:59:58+00:00</lastmod>
 +
      <changefreq>yearly</changefreq>
 +
      <priority>1.0</priority>
 +
    </url>
 +
    <url>
 +
      <loc>http://acumen.lib.ua.edu/u0004_0000001_0000001</loc>
 +
      <lastmod>2014-08-05T16:43:00+00:00</lastmod>
 +
      <changefreq>yearly</changefreq>
 +
      <priority>0.8</priority>
 +
    </url>
 +
 +
As you can see, the top 4 links are priority 1 and go to finding aids.  The last one is for an item (not mass-digitized) and is priority .8  -- mass digitized content (not shown) is priority .3.  Mass digitized collections are identified by adding their collection numbers into the script itself, as we have so few of them.
 +
 +
In spring of 2015, we modified our sitemaps to include links to the first large image (not thumbnail) of each item in an image collection, in the hopes of wider access to content via search engines. (See [https://support.google.com/webmasters/answer/178636 Webmaster Tools Image sitemaps].)  So, for example, the top of our 5th sitemap at the moment looks like this:
 +
 +
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
 +
  <url>
 +
    <loc>http://acumen.lib.ua.edu/u0003_0002867_0000290</loc>
 +
    <image:image>
 +
      <image:loc>
 +
          http://acumen.lib.ua.edu/content/u0001/2014021/0000040/u0001_2014021_0000040_2048.jpg
 +
      </image:loc>
 +
    </image:image>
 +
    <lastmod>2014-08-05T16:41:50+00:00</lastmod>
 +
    <changefreq>yearly</changefreq>
 +
    <priority>0.8</priority>
 +
  </url>
 +
 +
  
 
To submit a new sitemap to Google, or to check our indexing progress, log in with web services credentials to [https://www.google.com/webmasters/tools/home?hl=en  Google Webmaster Tools].
 
To submit a new sitemap to Google, or to check our indexing progress, log in with web services credentials to [https://www.google.com/webmasters/tools/home?hl=en  Google Webmaster Tools].

Latest revision as of 16:47, 18 May 2015

Sitemaps are a way of telling web search engine crawlers where to find the content on your site that you want them to index. After all, crawlers have no idea how to create the URLs that your database or delivery system create on the fly to provide access to online materials. Thus, database content is like a black hole on the web, and without help, that content will not be reflected in web search engine results such as Google.


Sitemaps cannot contain more than 50,000 URLs and must be no larger than 50 MB uncompressed. If you have multiple sitemaps, then you need a sitemap index file that lists them all -- then this would be the file you submit to the search engine site for indexing, as opposed to the sitemap itself.


Our sitemaps are automatically regenerated once a month (by makeSiteMap in /srv/scripts/sitemaps/), using the file date for the <lastmod> value; all our entries are listed as changing "yearly", since the next option is "monthly" and they rarely are updated that frequently. The <priority> value is highest for finding aids, and lowest for mass-digitized content (as that has little metadata to index).


Our sitemaps are located in /srv/www/htdocs/acumen/sitemaps and /srv/www/htdocs/sitemaps/ with corresponding sitemapIndex files in the directory just above these locations (visible via the web at http://acumen.lib.ua.edu/sitemapIndex.xml and http://libcontent.lib.ua.edu/sitemapIndex.xml.


One is for Acumen, and the other for libcontent, but they contain the same links.

Here's what a sitemap index file looks like:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
   <loc>http://acumen.lib.ua.edu/sitemaps/sitemap1.xml</loc>
   <lastmod>2015-03-01T07:10:16+00:00</lastmod>
 </sitemap>
 <sitemap>
   <loc>http://acumen.lib.ua.edu/sitemaps/sitemap2.xml</loc>
   <lastmod>2015-03-01T07:10:17+00:00</lastmod>
 </sitemap>
 <sitemap>
   <loc>http://acumen.lib.ua.edu/sitemaps/sitemap3.xml</loc>
   <lastmod>2015-03-01T07:10:17+00:00</lastmod>
 </sitemap>
 </sitemapindex>


And here is the first part of one of the sitemaps:

 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
     <loc>http://acumen.lib.ua.edu/u0003_0004090</loc>
     <lastmod>2015-02-24T16:22:54+00:00</lastmod>
     <changefreq>yearly</changefreq>
     <priority>1.0</priority>
   </url>
   <url>
     <loc>http://acumen.lib.ua.edu/u0003_0004091</loc>
     <lastmod>2015-02-27T22:59:57+00:00</lastmod>
     <changefreq>yearly</changefreq> 
     <priority>1.0</priority>
   </url>
   <url>
     <loc>http://acumen.lib.ua.edu/u0003_0004092</loc>
     <lastmod>2015-02-27T22:59:58+00:00</lastmod>
     <changefreq>yearly</changefreq>
     <priority>1.0</priority>
   </url>
   <url>
      <loc>http://acumen.lib.ua.edu/u0003_0004093</loc>
      <lastmod>2015-02-27T22:59:58+00:00</lastmod>
      <changefreq>yearly</changefreq>
      <priority>1.0</priority>
   </url>
   <url>
      <loc>http://acumen.lib.ua.edu/u0004_0000001_0000001</loc> 
      <lastmod>2014-08-05T16:43:00+00:00</lastmod>
      <changefreq>yearly</changefreq>
      <priority>0.8</priority>
   </url>

As you can see, the top 4 links are priority 1 and go to finding aids. The last one is for an item (not mass-digitized) and is priority .8 -- mass digitized content (not shown) is priority .3. Mass digitized collections are identified by adding their collection numbers into the script itself, as we have so few of them.

In spring of 2015, we modified our sitemaps to include links to the first large image (not thumbnail) of each item in an image collection, in the hopes of wider access to content via search engines. (See Webmaster Tools Image sitemaps.) So, for example, the top of our 5th sitemap at the moment looks like this:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>http://acumen.lib.ua.edu/u0003_0002867_0000290</loc>
    <image:image>
      <image:loc>
          http://acumen.lib.ua.edu/content/u0001/2014021/0000040/u0001_2014021_0000040_2048.jpg
      </image:loc>
    </image:image>
    <lastmod>2014-08-05T16:41:50+00:00</lastmod>
    <changefreq>yearly</changefreq>
    <priority>0.8</priority>
  </url>


To submit a new sitemap to Google, or to check our indexing progress, log in with web services credentials to Google Webmaster Tools.