Diacritics

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(New page: Best to export from Excal as .txt. Then open in Encodinator to identify problems. Also, you can open in NotePad ++ and "Encode in UTF-8" to see where the problems are. Then search and repl...)
 
(24 intermediate revisions by one user not shown)
Line 1: Line 1:
Best to export from Excal as .txt.
+
As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant [[MODS_Mockup|MODS metadata files]].  
Then open in Encodinator to identify problems. Also, you can open in NotePad ++ and "Encode in UTF-8" to see where the problems are.
+
 
Then search and replace these out in Excel .xlsx.
+
 
Then export as Unicode.
+
Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the collection: u0002_0000006 (French Revolutionary Pamphlets).
Then in Notepad ++ "Convert to ANSI", then reload into Encodinator and makes sure all is well.
+
 
Then take that txt file in ANSI and "Convert to UTF-8" in NotePad ++.
+
  To do these steps, you will need a Windows computer with: Microsoft Excel and [http://notepad-plus.sourceforge.net/ NotePad ++].
Then take it to Archivist Utility.
+
  You will also need the TextFX plusing for NotePad ++.
 +
  To get plugins for NotePad++ refer to the application documentation with regard to the Plugin Manager.
 +
 
 +
1. Export the metadata from Excel as a tab delimited text file (use the Unicode option). If the Excel file is in the legacy .xls format, first convert it to .xlsx prior to export.
 +
 
 +
 
 +
2. Open in the file in NotePad ++ and choose "Encode in UTF-8".
 +
 
 +
 
 +
3. Use the TextFX plugin and choose: TextFX Characters>zap all non-printable characters to #
 +
 
 +
 
 +
4. Do a search for all instances of "#" and replace all found problems in the *Excel* (.xlsx) file. Use Excel's built in character map. To access the character map window from within excel goto the Insert tab - Symbols group - Symbol, select the character you need to replace and choose "Unicode Hex" as the encoding while making changes.
 +
 
 +
[http://intranet.lib.ua.edu/wiki/digcoll/images/e/e6/French_encoding_map.xlsx This Excel file] may be useful in getting one started in searching/replacing diacritics for French language metadata.
 +
 
 +
 
 +
4. From Excel, export the metadata as a Unicode tab-delimited file.
 +
 
 +
 
 +
5. Open the Unicode export in Notepad ++ and choose "Encode in UTF-8". Repeat Step 3 and 4 until all problems are taken care of.
 +
 
 +
 
 +
6. If all is well, save the text file with NotePad ++ (it will be a UTF-8 file).
 +
  This UTF-8 file is now the file from which to create MODS with [http://acumen.lib.ua.edu/project/?f=Archivist%20Utility.txt Archivist Utility]. This is also the text version of the metadata that will go into long term Storage.

Revision as of 09:35, 14 December 2012

As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant MODS metadata files.


Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the collection: u0002_0000006 (French Revolutionary Pamphlets).

  To do these steps, you will need a Windows computer with: Microsoft Excel and NotePad ++.
  You will also need the TextFX plusing for NotePad ++.
  To get plugins for NotePad++ refer to the application documentation with regard to the Plugin Manager.

1. Export the metadata from Excel as a tab delimited text file (use the Unicode option). If the Excel file is in the legacy .xls format, first convert it to .xlsx prior to export.


2. Open in the file in NotePad ++ and choose "Encode in UTF-8".


3. Use the TextFX plugin and choose: TextFX Characters>zap all non-printable characters to #


4. Do a search for all instances of "#" and replace all found problems in the *Excel* (.xlsx) file. Use Excel's built in character map. To access the character map window from within excel goto the Insert tab - Symbols group - Symbol, select the character you need to replace and choose "Unicode Hex" as the encoding while making changes.

This Excel file may be useful in getting one started in searching/replacing diacritics for French language metadata.


4. From Excel, export the metadata as a Unicode tab-delimited file.


5. Open the Unicode export in Notepad ++ and choose "Encode in UTF-8". Repeat Step 3 and 4 until all problems are taken care of.


6. If all is well, save the text file with NotePad ++ (it will be a UTF-8 file).

  This UTF-8 file is now the file from which to create MODS with Archivist Utility. This is also the text version of the metadata that will go into long term Storage.
Personal tools